node4 down且无法启动
Time : 2024-01-20 06:14:06.60 CST
Node : 0
Seq : 183895728
Class : Alert,Status change
Severity : Major
Type : Component state change
Component: hw_node:4
Tier : Hardware check
Spare_PN : 872569-001
Message : Node 4 Failed (Node Offline Due to Failure {0xd})
Time : 2024-01-20 06:28:36.16 CST
Node : 0
Seq : 183903329
Class : Service Alert
Severity : Critical
Type : Node-Failure-Analysis File Received From Remote/Local MCU
Component: hw_node:4
Tier : Corefiles
Message : Node-Failure-Analysis file received from Node 4.
node4 syslog从05:56开始报错hardware error,打印mcelog,且打印的内容基本为CPU 1: Machine Check: 0 Bank 7: cc00008000010092及CPU 1: Machine Check: 0 Bank 11: c800008a00800092报错,只到6:12节点down
Jan 20 05:56:55 CNX148005S-4 vmunix: [3221709.683180] mce: [Hardware Error]: Machine check events logged
Jan 20 05:56:55 CNX148005S-4 vmunix: [3221709.683213] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 7: cc00008000010092
Jan 20 05:56:55 CNX148005S-4 vmunix: [3221709.698812] mce: [Hardware Error]: TSC 0 ADDR 2eaeb09340 MISC 4404c4e86
Jan 20 05:56:55 CNX148005S-4 vmunix: [3221709.712589] mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1705701415 SOCKET 1 APIC 20 microcode 42e
Jan 20 05:56:55 CNX148005S-4 vmunix: [3221709.730931] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 11: c800008a00800092
Jan 20 05:56:55 CNX148005S-4 vmunix: [3221709.746750] mce: [Hardware Error]: TSC 0 MISC c900178f163c1400
Jan 20 05:56:55 CNX148005S-4 vmunix: [3221709.758921] mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1705701415 SOCKET 1 APIC 20 microcode 42e
Jan 20 05:56:55 CNX148005S-4 mcelog: Hardware event. This is not a software error.
Jan 20 05:56:55 CNX148005S-4 mcelog: MCE 0
Jan 20 05:56:55 CNX148005S-4 mcelog: CPU 1 BANK 7
Jan 20 05:56:55 CNX148005S-4 mcelog: MISC 4404c4e86 ADDR 2eaeb09340
Jan 20 05:56:55 CNX148005S-4 mcelog: TIME 1705701415 Sat Jan 20 05:56:55 2024
Jan 20 05:56:55 CNX148005S-4 mcelog: MCG status:
Jan 20 05:56:55 CNX148005S-4 mcelog: MCi status:
Jan 20 05:56:55 CNX148005S-4 mcelog: Error overflow
Jan 20 05:56:55 CNX148005S-4 mcelog: Corrected error
Jan 20 05:56:55 CNX148005S-4 mcelog: MCi_MISC register valid
Jan 20 05:56:55 CNX148005S-4 mcelog: MCi_ADDR register valid
Jan 20 05:56:55 CNX148005S-4 mcelog: MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Jan 20 05:56:55 CNX148005S-4 mcelog: Transaction: Memory read error
Jan 20 05:56:55 CNX148005S-4 mcelog: STATUS cc00008000010092 MCGSTATUS 0
Jan 20 05:56:55 CNX148005S-4 mcelog: MCGCAP 1000c1b APICID 20 SOCKETID 1
Jan 20 05:56:55 CNX148005S-4 mcelog: CPUID Vendor Intel Family 6 Model 62
Jan 20 05:56:55 CNX148005S-4 mcelog: Hardware event. This is not a software error.
...
Jan 20 06:12:52 CNX148005S-4 mcelog: warning: 24 bytes ignored in each record
Jan 20 06:12:52 CNX148005S-4 mcelog: consider an update
Jan 20 09:03:26 CNX148005S-4 syslogd (GNU inetutils 1.9.4): restart
Jan 20 09:03:26 CNX148005S-4 vmunix: [ 0.000000] Initializing cgroup subsys cpuset
showeventlog报告node4 DIMM 1.1.0存在异常
2024-01-20 05:56:57.84 CST 4 785625 Internal Communication Degraded Eagle memory cerr hw_eagle:4 General CC posted by node 4 MEM ADDR 0x0000002eaeb09340 - DIMM 1.1.0.
2024-01-20 06:02:07.87 CST 4 785674 Internal Communication Degraded Eagle memory cerr hw_eagle:4 General CC posted by node 4 MEM ADDR 0x00000026cada11c0 - DIMM 1.1.0.
通过node4 的NFA日志也可以发现MRC log报告socket 2 CPU,DIMM 1.1.0异常
insplore.4B-CF22085-005S.node0.20240120.0958/var/core/nemoe/node4-nemoe-nfa-2024-01-20_06:28:35/console
MRC log data - Socket: 0x1, Channel : 0x1, DIMM: 0x0, Rank: 0x0
Corresponding DIMM on node board: DIMM 1.1.0
OemHook Assert discovered: codetype: 0x38 subcode: 0x311C data: 0x1010000
另尝试decode mce报错的bank也可确定是chanel 2通道内存异常
node4 control cache DIMM 1:1:0异常引起节点故障,需要更换内存修复
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作