/
/
7月30日10:00分左右,现场有一组S6800-54QF堆叠设备异常挂死,分别掉电重启后恢复(先掉电重启slot2,恢复后掉电重启的slot1)。
<HN-GZNSD201-CB5-S6800-190.Int>dis version
H3C Comware Software, Version 7.1.045, Feature 2426
Copyright (c) 2004-2016 Hangzhou H3C Tech. Co., Ltd. All rights reserved.
H3C S6800-54QF uptime is 0 weeks, 0 days, 1 hour, 18 minutes
Last reboot reason : Cold reboot
Boot image: flash:/s6800-cmw710-boot-f2426.bin
Boot image version: 7.1.045, Feature 2426
Compiled Jan 19 2016 16:00:00
System image: flash:/s6800-cmw710-system-f2426.bin
System image version: 7.1.045, Feature 2426
Compiled Jan 19 2016 16:00:00
Patch image(s) list:
flash:/s6800-cmw710-boot-patch-f2426h03.bin, version: Feature 2426H03
Compiled Jan 19 2016 16:00:00
flash:/s6800-cmw710-system-patch-f2426h06.bin, version: Feature 2426H06
Compiled Jan 19 2016 16:00:00
Slot 1:
Uptime is 0 weeks,0 days,0 hours,50 minutes
S6800-54QF with 2 Processors
BOARD TYPE: S6800-54QF
DRAM: 2048M bytes
FLASH: 512M bytes
PCB 1 Version: VER.A
Bootrom Version: 150
CPLD 1 Version: 001
CPLD 2 Version: 001
Release Version: H3C S6800-54QF-2426
Patch Version : Feature 2426H06
Reboot Cause : ColdReboot
[SubSlot 0] 48SFP Plus+6QSFP Plus
Slot 2:
Uptime is 0 weeks,0 days,1 hour,18 minutes
S6800-54QF with 2 Processors
BOARD TYPE: S6800-54QF
DRAM: 2048M bytes
FLASH: 512M bytes
PCB 1 Version: VER.A
Bootrom Version: 150
CPLD 1 Version: 001
CPLD 2 Version: 001
Release Version: H3C S6800-54QF-2426
Patch Version : Feature 2426H06
Reboot Cause : ColdReboot
[SubSlot 0] 48SFP Plus+6QSFP Plus
Slot1 cpu挂死后掉电重启,看不到相关的信息记录了。
但是从slot2的日志看,可以确认slot1 cpu故障挂死后,堆叠心跳报文超时,堆叠分裂后,slot2升级为master,但是因为早期版本不支持健康度检查,只能将框号大的slot2设备mad down,仅剩slot1承载业务,但由于slot1已经挂死,导致下挂业务全部中断。
%@1653%Jul 30 10:42:47:907 2023 HN-GZNSD201-CB5-S6800-190.Int HA/5/HA_STANDBY_TO_MASTER: Standby board in slot 2 changed to master.
%@1654%Jul 30 10:42:48:207 2023 HN-GZNSD201-CB5-S6800-190.Int DEV/3/BOARD_REMOVED: Board was removed from slot 1, type is S6800-54QF.
%@1655%Jul 30 10:42:48:733 2023 HN-GZNSD201-CB5-S6800-190.Int LAGG/6/LAGG_INACTIVE_PHYSTATE: Member port XGE1/0/3 of aggregation group BAGG3 changed to the inactive state, because the physical state of the port is down.
%@1656%Jul 30 10:42:48:756 2023 HN-GZNSD201-CB5-S6800-190.Int LAGG/6/LAGG_ACTIVE: Member port XGE2/0/11 of aggregation group BAGG11 changed to the active state.
%@1657%Jul 30 10:42:48:756 2023 HN-GZNSD201-CB5-S6800-190.Int LAGG/6/LAGG_INACTIVE_CONFIGURATION: Member port XGE1/0/11 of aggregation group BAGG11 changed to the inactive state, because the aggregation configuration of the port is incorrect.
%@1658%Jul 30 10:42:48:773 2023 HN-GZNSD201-CB5-S6800-190.Int LAGG/6/LAGG_INACTIVE_PHYSTATE: Member port XGE1/0/15 of aggregation group BAGG15 changed to the inactive state, because the physical state of the port is down.
%@1659%Jul 30 10:42:48:806 2023 HN-GZNSD201-CB5-S6800-190.Int LAGG/6/LAGG_INACTIVE_PHYSTATE: Member port XGE1/0/34 of aggregation group BAGG34 changed to the inactive state, because the physical state of the port is down.
%@1660%Jul 30 10:42:49:267 2023 HN-GZNSD201-CB5-S6800-190.Int BFD/5/BFD_CHANGE_FSM: Sess[192.168.0.2/192.168.0.1, LD/RD:97/97, Interface:Vlan2, SessType:Ctrl, LinkType:INET], Sta: DOWN->UP, Diag: 0
%@1661%Jul 30 10:42:49:269 2023 HN-GZNSD201-CB5-S6800-190.Int DEV/1/MAD_DETECT: Multi-active devices detected, please fix it.
综上,slot1 cpu硬件故障导致堆叠分裂,同时MAD将slot2 隔离导致业务受损。后续研发发布补丁支持健康度检查,再次发生故障可以将故障设备MAD DOWN隔离,确保健康的设备继续承载业务。
更换slot1设备,打上支持健康检查的补丁。
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作