Print

R3820 G3阵列卡报错:"Communication between the iBMC and RAID controller card 1 failed"

2025-01-09 发表

告警信息

告警信息如下:

"109","Major","RAID Card","Communication between the iBMC and RAID controller card 1 failed (SN:xxxxxxxx, BN:xxxxxxxxx).","2024-12-06 22:33:18","Asserted","0x06000025","1. Restart the server and open BIOS Device Manager in UEFI mode. Enter the Driver Health menu, and select Repair the whole platform.@#AB;2. Check and upgrade the firmware and driver of the RAID controller card to the latest version in the OS.@#AB;3. Check whether the OptionRom space in legacy mode is sufficient in the OS.@#AB;4. Check whether the PCIe port of the RAID controller card is disabled on the BIOS advanced settings.@#AB;5. Replace the RAID controller card.@#AB;6. Replace BBU."

"108","Critical","RAID Card","The RAID controller card 1 triggered an uncorrectable error, (SN:xxxxxxxx, BN:xxxxxxxxx).","2024-12-06 22:29:49","Asserted","0x06000007","1. Power off the server and check whether there is damage or poor contact between the RAID controller card and its slot.@#AB;2. Replace the RAID controller card.@#AB;3. Replace BBU.@#AB;4. Replace the mainboard."

 

问题描述

客户报修阵列卡故障,更换阵列卡之后故障复现,日志中还是同样的告警信息。

 

过程分析

分析日志:

#\dump_info\LogDump\maintenance_log

2024-12-06 00:41:48 INFO : SVR-0000000,Collecting physical drive log from OOB started.

2024-12-06 00:42:05 INFO : SVR-0000000,Collecting physical drive log from OOB ended.

2024-12-06 22:30:35 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal asserted(0 to 1)

2024-12-06 22:33:13 ERROR: SVR-0080006,RAID controller (RAID Card1) communication loss - Asserted

2024-12-06 22:33:37 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal deasserted(1 to 0)

2024-12-06 22:34:36 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal asserted(0 to 1)

2024-12-06 22:38:05 ERROR: SVR-0080006,RAID controller (RAID Card1) communication loss - Asserted

2024-12-06 22:58:36 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal deasserted(1 to 0)

2024-12-06 22:59:36 ERROR: SVR-0072002,RAID Card1 heartbeat abnormal asserted(0 to 1)

 

#\dump_info\LogDump\app_debug_log_all

GetCtrlPhyConnectionsInfo failed  return 0x1001表示通信中断

2024-12-13 00:42:01 StorageMgnt ERROR: sml_lsi.c(13839): smlib: LSI:GetCtrlInfo failed, CtrlId = 0, return 0x1001

2024-12-13 00:42:01 StorageMgnt ERROR: sml_lsi.c(14127): smlib: LSI:GetCtrlPhyConnectionsInfo failed, CtrlId = 0, return 0x1001

2024-12-13 00:42:02 StorageMgnt ERROR: sml_lsi.c(13839): smlib: LSI:GetCtrlInfo failed, CtrlId = 0, return 0x1001

2024-12-13 00:42:02 StorageMgnt ERROR: sml_lsi.c(14127): smlib: LSI:GetCtrlPhyConnectionsInfo failed, CtrlId = 0, return 0x1001

 

通过以上信息,升级华为原厂分析得出,raid卡存在 surprise dom 的故障打印, 之后raid卡固件初始化反复失败。

 

 

 

 

 

解决方法

更换主板,确保raid卡链路正常。 同时可携带 raid 卡备件。