告警信息如下:
"109","Major","RAID Card","Communication between the iBMC and RAID controller card 1 failed (SN:xxxxxxxx, BN:xxxxxxxxx).","2024-12-06 22:33:18","Asserted","0x06000025","1. Restart the server and open BIOS Device Manager in UEFI mode. Enter the Driver Health menu, and select Repair the whole platform.@#AB;2. Check and upgrade the firmware and driver of the RAID controller card to the latest version in the OS.@#AB;3. Check whether the OptionRom space in legacy mode is sufficient in the OS.@#AB;4. Check whether the PCIe port of the RAID controller card is disabled on the BIOS advanced settings.@#AB;5. Replace the RAID controller card.@#AB;6. Replace BBU."
"108","Critical","RAID Card","The RAID controller card 1 triggered an uncorrectable error, (SN:xxxxxxxx, BN:xxxxxxxxx).","2024-12-06 22:29:49","Asserted","0x06000007","1. Power off the server and check whether there is damage or poor contact between the RAID controller card and its slot.@#AB;2. Replace the RAID controller card.@#AB;3. Replace BBU.@#AB;4. Replace the mainboard."
客户报修阵列卡故障,更换阵列卡之后故障复现,日志中还是同样的告警信息。
分析日志:
#\dump_info\LogDump\maintenance_log
2024-12-06 00:41:48 INFO : SVR-0000000,Collecting physical drive log from OOB started.
2024-12-06 00:42:05 INFO : SVR-0000000,Collecting physical drive log from OOB ended.
2024-12-06 22:30:35 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal asserted(0 to 1)
2024-12-06 22:33:13 ERROR: SVR-0080006,RAID controller (RAID Card1) communication loss - Asserted
2024-12-06 22:33:37 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal deasserted(1 to 0)
2024-12-06 22:34:36 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal asserted(0 to 1)
2024-12-06 22:38:05 ERROR: SVR-0080006,RAID controller (RAID Card1) communication loss - Asserted
2024-12-06 22:58:36 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal deasserted(1 to 0)
2024-12-06 22:59:36 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal asserted(0 to 1)
#\dump_info\LogDump\app_debug_log_all
GetCtrlPhyConnectionsInfo failed return 0x1001表示通信中断
2024-12-13 00:42:01 StorageMgnt ERROR: sml_lsi.c(13839): smlib: LSI:GetCtrlInfo failed, CtrlId = 0, return 0x1001
2024-12-13 00:42:01 StorageMgnt ERROR: sml_lsi.c(14127): smlib: LSI:GetCtrlPhyConnectionsInfo failed, CtrlId = 0, return 0x1001
2024-12-13 00:42:02 StorageMgnt ERROR: sml_lsi.c(13839): smlib: LSI:GetCtrlInfo failed, CtrlId = 0, return 0x1001
2024-12-13 00:42:02 StorageMgnt ERROR: sml_lsi.c(14127): smlib: LSI:GetCtrlPhyConnectionsInfo failed, CtrlId = 0, return 0x1001
通过以上信息,升级华为原厂分析得出,raid卡存在 surprise dom 的故障打印, 之后raid卡固件初始化反复失败。
更换主板,确保raid卡链路正常。 同时可携带 raid 卡备件。
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作