装配组件-H3C UniServer R4900 G3 25SFF-RS3Z8R4900C-CTO服务器-国内海外合一版
NA
客户两台R4900 G3服务器 MCA告警,服务器重启
【第一台】
328 Critical 1 0 0 2021-11-30 20:29:39 2021-11-30 12:29:39 SensorType: Processor, SensorName: CPU1_Status, EventType: Discrete, Event: Machine Check Exception, Data3: 0 CPU 1 triggered an uncorrectable error. ——cpu1触发UCE报错
332 Warning 193 0 1 2021-11-30 20:29:40 2021-11-30 12:29:40 "Socket Address[48] MCA Error Src Log Info: 00h 14h 00h 00h 40h
MCA_ERR_SRC_LOG : 0x00140000
[20] MSMI internal
[18] MSMI_MCERR internal" ——内部错误
334 Warning 193 0 1 2021-11-30 20:29:40 2021-11-30 12:29:40"Socket Address[48] MCE Error Log Reg Info: 00h 00h 03h c6h 40h
MCERRLOGGINGREG : 0x000003c6
[9] FirstMCerrSrcFromCbo
[8] FirstMCerrSrcValid
[7:0] FirstMCerrSrcId = 0xc6" ——cpu1内部错误指向core 6 bank 9(后方无此bank具体报错信息)
342 Warning 193 0 1 2021-11-30 20:29:40 2021-11-30 12:29:40 "Socket Address[49] MCA Error Src Log Info: 00h a0h 00h 00h 40h
MCA_ERR_SRC_LOG : 0x00a00000
[23] MSMI External
[21] MSMI_MCERR External" ——cpu2表明外部错误
344 Warning 193 0 1 2021-11-30 20:29:40 2021-11-30 12:29:40 "Socket Address[49] MCE Error Log Reg Info: 00h 00h 01h 44h 40h
MCERRLOGGINGREG : 0x00000144
[8] FirstMCerrSrcValid
[7:0] FirstMCerrSrcId = 0x44" ——cpu2的FirstMCerrSrcId自证清白
从sds日志查看大概率为cpu1出现错误——更换cpu1解决
【第二台】
576 Caution 1 0 0 2021-12-13 12:11:59 2021-12-13 04:11:59 SensorType: Memory, SensorName: CPU1_DIMM_A11, EventType: Discrete, Event: Correctable ECC or other correctable memory error, Data2: 66, Data3: 17 CPU1 A11 triggered a correctable error ——A11 CE内存可修复错误
579 Warning 193 0 1 2021-12-13 12:12:00 2021-12-13 04:12:00 "Socket Address[48] MCA Error Src Log Info: 00h 18h 00h 00h 40h
MCA_ERR_SRC_LOG : 0x00180000
[20] MSMI internal
[19] MSMI_IERR internal"
581 Warning 193 0 1 2021-12-13 12:12:00 2021-12-13 04:12:00 "Socket Address[48] MCE Error Log Reg Info: 00h 00h 01h 44h 40h
MCERRLOGGINGREG : 0x00000144
[8] FirstMCerrSrcValid
[7:0] FirstMCerrSrcId = 0x44"
586 Warning 193 0 1 2021-12-13 12:12:00 2021-12-13 04:12:00 "Socket Address[48] Comm Bank[16]--IMC1 Chan1:[Status] 8ch 00h 00h 40h 00h 08h 00h c1h 40h;[Address] 00h 00h 00h 1bh 8ch 44h d3h 40h 40h;[Misc] 12h 21h 00h 00h 00h 00h 00h 86h 40h
Channel Num 2 Memory Scrubbing Error. This error indicates the patrol scrubber has detected an error.
MC16_STATUS : 0x8c000040000800c1
[63] Valid
[59] MC_MISC is valid
[58] MC_ADDR is valid
[52:38] Corrected Err Count = 0x0001
[31:16] Model Specific Error Code = 0x0008
[15:0] Machine Check Architecture Error Code = 0x00c1
MC16_ADDR : 0x0000001b8c44d340
[45:0] ADDRESS = 0x001b8c44d340
MC16_MISC : 0x1221000000000086
[63:9] EXTRA_ERR_INFO = 0x09108000000000
[8:6] ADDR_MODE = 0x02
[5:0] REC_ERR_LSB = 0x06" -指向IMC2 CHANNEL2与之前得A11内存清洗错误对应,因此需要更换此内存
589 Warning 193 0 1 2021-12-13 12:12:00 2021-12-13 04:12:00 "Socket Address[49] MCA Error Src Log Info: 00h c0h 00h 00h 40h
MCA_ERR_SRC_LOG : 0x00c00000
[23] MSMI External
[22] MSMI_IERR External"
591 Warning 193 0 1 2021-12-13 12:12:00 2021-12-13 04:12:00 "Socket Address[49] MCE Error Log Reg Info: 00h 00h 01h 44h 40h
MCERRLOGGINGREG : 0x00000144
[8] FirstMCerrSrcValid
[7:0] FirstMCerrSrcId = 0x44" ——cpu2自证清白
592 Warning 193 0 1 2021-12-13 12:12:00 2021-12-13 04:12:00"Socket Address[49] PCU First IERR Tsc Lo Info: 00h 00h 00h 00h 40h
PCU_FIRST_IERR_TSC_LO : 0x00000000" 593 Warning 193 0 1 2021-12-13 12:12:00 2021-12-13 04:12:00 "Socket Address[49] PCU First IERR Tsc Hi Info: 00h 00h 00h 00h 40h
PCU_FIRST_IERR_TSC_HI : 0x00000000"
594 Warning 193 0 1 2021-12-13 12:12:00 2021-12-13 04:12:00 "Socket Address[49] PCU First MCERR Tsc Lo Info: 00h 00h 00h 00h 40h
PCU_FIRST_MCEERR_TSC_LO : 0x00000000"
595 Warning 193 0 1 2021-12-13 12:12:00 2021-12-13 04:12:00 "Socket Address[49] PCU First MCERR Tsc Hi Info: 00h 00h 00h 00h 40h
PCU_FIRST_MCEERR_TSC_HI : 0x00000000" ——cpu2时间触发为0,另一侧为故障侧
见过程分析
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作