无
某局点CAS E0511一台CVK异常重启。
这台CVK在11月25日凌晨3点左右发生重启,后来根据现场收集的日志,发现mcelog中在重启时间(25号早上3点左右)有关于内存报错的信息:
mcelog: failed to prefill DIMM database from DMI data
Hardware event. This is not a software error.
MCE 0
CPU 84 BANK 14 TSC 1e3dd733eae600
RIP !INEXACT! 10:ffffffff8104ef3c
MISC 908400880000086 ADDR 3150d06600
TIME 1574622767 Mon Nov 25 03:12:47 2019
MCG status:RIPV MCIP
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
SRAO
MCA: MEMORY CONTROLLER MS_CHANNEL1_ERR
Transaction: Memory scrubbing error
MemCtrl: Uncorrected patrol scrub error
STATUS fd001e00001000c1 MCGSTATUS 5
MCGCAP f000814 APICID 29 SOCKETID 1
CPUID Vendor Intel Family 6 Model 85
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 1 TSC 1e3dd7340a4106
ADDR 3150d06600
TIME 1574622768 Mon Nov 25 03:12:48 2019
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
STATUS 940000000000009f MCGSTATUS 0
MCGCAP f000814 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 85
通过服务器HDM查看发生重启的时间点有因为内存引起的重启,但记录那个事件显示正常。
经过专家深入分析SDS日志得出:
CPU2 B2内存有不可纠正的错误,导致机器重启,建议更换CPU2 B2内存。
分析如下:
故障时间点,在系统健康日志中有如下记录:
2019-11-25 03:12:48 2019-11-24 19:12:48 Uncorrected Machine Check Exception (Socket (1), APIC ID (0x20000000), Bank (0xe), Status (0xfd0009c0001000c1), Address (0x3150d06600), Misc (0xfd0009c0001000c1))
//
[61] 1 The error was not corrected.
[31:16] 0000000000010000 Model Specific Error Code (MSCOD)
[31:16] 0000000000010000 0x0010 UnCorr Patrol Scrub Error
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[15:0] 0000000011000001 Machine Check Architecture Error Code (MCACOD).
[12] 0 No Corrected Filtering
[7] 1 Memory Controller Error.
Memory Controller Error, format : 000F 0000 1MMM CCCC.
[6:4] 100 Memory Scrubbing Error (MMM bits).
[3:0] 0001 Channel Number 1 (CCCC bits).
CPU2 B2内存有不可纠正的错误,导致机器重启,建议更换CPU2 B2内存。
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作