无
某局点一台H3C UIS-Cell 3010 G3一体机用户反馈前面6块硬盘一起出现橙色和蓝色指示灯交替闪烁。业务未出现中断的情况。
H3C UIS-Cell 3010 G3底层的硬件是H3C UniServer R4900 G3服务器,具体分析过程如下:
1.登录HDM的Web界面,查看左侧“系统信息”下的“传感器信息”,在右侧主界面选中“硬盘”标签页,发现F04硬盘出现严重报错,如下图所示:
2.点击左侧“硬件信息”菜单,选择右侧主窗体的“存储”标签页,发现少了物理驱动器3,如下图所示:
3.收集SDS日志,分析发现动态监控日志里有如下掉盘(F04)和逻辑驱动器降级记录:
2019-05-03 05:38:30 PD is offline ---Pos: Front Panel index: 3
2019-05-03 05:38:30 LD 0 has changed from optimal to degraded.
2019-05-03 05:38:33 SensorType: Drive Slot (Bay), SensorName: HDD_F04_Status, EventType: Discrete, Event: Drive Fault Drive fault
4. 从SDS也能看到配置的是PMC的P430-M2卡,故使用arcconf工具收集阵列卡日志,从controller_1_config.txt文件发现是6块硬盘配置成的RAID5,逻辑驱动器0当前是降级的状态,其中slot 3(或者叫bay 4)硬盘缺失,日志记录如下:
Logical Device number 0
Logical Device name : DefaultValue0
Block Size of member drives : 512 Bytes
RAID level : 5
Unique Identifier : E1FB32B3
Status of Logical Device : Degraded
Additional details : Initialized with Build/Clear
Size : 8575985 MB
Parity space : 1715199 MB
Stripe-unit size : 256 KB
Interface Type : Serial Attached SCSI
Device Type : HDD
Read-cache setting : Enabled
Read-cache status : On
Write-cache setting : Enabled
Write-cache status : On
Partitioned : Yes
Protected by Hot-Spare : No
Bootable : Yes
Failed stripes : No
Power settings : Disabled
--------------------------------------------------------
Logical Device segment information
--------------------------------------------------------
Segment 0 : Present (1716957MB, SAS, HDD, Enclosure:0, Slot:0) W3Z17NQV0000K817KUPS
Segment 1 : Present (1716957MB, SAS, HDD, Enclosure:0, Slot:1) W3Z171LP0000K8170VLP
Segment 2 : Present (1716957MB, SAS, HDD, Enclosure:0, Slot:2) W3Z18A0G0000K815JR1W
Segment 3 : Missing (0MB, SAS, HDD, Connector:0, Device:3)
Segment 4 : Present (1716957MB, SAS, HDD, Enclosure:0, Slot:4) W3Z1745N0000K817KSS3
Segment 5 : Present (1716957MB, SAS, HDD, Enclosure:0, Slot:5) W3Z1740D0000K817KU3K
5. 检查对应时间点的Controller_1_Logs.txt发现,dev03(也就是bay4)硬盘有hardware error,具体原因是磁头的随机定位错误,KCQ值是04/15/01,未发现阵列卡及固件的IOP reset或者hung等错误,如下:
[14098]: 05:41:42 ProcessSRB_Errors: Service Response 0,
Scsi Status 2
[14099]: Fri - May 3 05:41:42 2019.587360
ScsiStatus=2 ServResp=0 devt=0x3
Cdb[0:15]=0x2800ccef:e5200000:08000000:00000000
[14100]: Fri - May 3 05:41:42 2019.587477 RS:
Check Condition hhmmss=0x00054149 incident=0x00003aaf nexus=0x01020002
devt=0x00000003
[14101]: 05:41:42 expevent 0001000C - 00:03:00
SCSI Sense code key=04
asc=15 ascq=01
[14102]: 05:41:42 ID(0:03:0); Error Event
[Cmd:0x28]
[14103]: Fri - May 3 05:41:42 2019.587777
DC_DecodeSenseInfo: ID(0:03:0); [k:0x4;c:0x15;q:0x1]
[14104]: 05:41:42 Hardware Error
[14105]: Fri - May 3 05:41:42 2019.587908
DC_DecodeSenseInfo: ID(0:03:0)
[14106]: 05:41:42 Random Positioning Error
[14107]: Fri - May 3 05:41:42 2019.588045
SP_CloseNexus: Setting ID(0:03:0) offline. failure Reason Code=4
[14108]: 05:41:42 IsContainerPartition:
Container device 0x3
[14109]: 05:41:42 Vendor ID: SEAGATE Product
ID: ST1800MM0018 Serial Number: W3Z189X00000K816NVFV
[14110]: 05:41:42 SAS WWN: 50 00 C5 00 A0 DA
44 44
[14111]: 05:41:42 DDLog for devt: 3 with
reason: 4
6. Controller_1_Monitor_Log.txt中也发现指向devt03的错误code是4,KCQ是04/15/01的报错记录,未发现与阵列卡及固件相关的报错,记录如下:
05/03/19 05:41:42.587360: ScsiStatus=2 ServResp=0 devt=0x3 Cdb[0:15]=0x2800ccef:e5200000:08000000:00000000
05/03/19 05:41:42.587477: RS: Check Condition hhmmss=0x00054149 incident=0x00003aaf nexus=0x01020002 devt=0x00000003
05/03/19 05:41:42.587777: DC_DecodeSenseInfo: ID(0:03:0); [k:0x4;c:0x15;q:0x1]
05/03/19 05:41:42.587908: DC_DecodeSenseInfo: ID(0:03:0)
05/03/19 05:41:42.588045: SP_CloseNexus: Setting ID(0:03:0) offline. failure Reason Code=4
05/03/19 05:41:42.588937: ID_AIC_DEV_TASK: rmw_nexus=0x01020002 state=0x01000100:01000201:02020f00:0f090f0a:0f090f0a:02050203:02040600
05/03/19 05:41:42.591771: ScsiStatus=109 ServResp=1 devt=0x3 Cdb[0:15]=0x2a00ccef:fe000002:00000000:00000000
05/03/19 05:41:42.591875: SRV_DLVRY_TGT_FAILURE Abort handling for STATUS: 0x6d on devt=0x00000003 !
05/03/19 05:41:42.592174: ScsiStatus=109 ServResp=1 devt=0x3 Cdb[0:15]=0x2800ccef:e6880000:08000000:00000000
05/03/19 05:41:42.592277: SRV_DLVRY_TGT_FAILURE Abort handling for STATUS: 0x6d on devt=0x00000003 !
05/03/19 05:41:42.592573: ScsiStatus=109 ServResp=1 devt=0x3 Cdb[0:15]=0x2800ccef:e1c80000:08000000:00000000
05/03/19 05:41:42.592676: SRV_DLVRY_TGT_FAILURE Abort handling for STATUS: 0x6d on devt=0x00000003 !
7. Controller_1_SmartStats.xml日志未发现bay4硬盘,同时其他硬盘未发现有错误记录;
8. 配置的阵列卡是RAID-P430-M2,固件和驱动都不是最新版本,日志记录如下:
Controller Type : RAID-P430-M2 |
Firmware Version : 33270 |
--------------------------------------------------------
Controller Version Information
--------------------------------------------------------
BIOS : 7.13-0 (33270)
Firmware : 7.13-0 (33270)
Driver : 1.2-1 (50792)
Boot Flash : 7.13-0 (33270)
CPLD (Load version/ Flash version) : 8/ 8
SEEPROM (Load version/ Flash version) : 1/ 1
FCT Custom Init String Version : 0x3
从上述可以定位问题的根因就是bay 4硬盘损坏,该硬盘截留分析也发现无法识别到,问题指向硬盘的磁头。
1.做好数据备份,更换故障硬盘;
2.更新阵列卡固件及驱动到最新版本(FW:33303,Drv:57013)。
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作