/
/
10月3日两台6800设备运行中突然在18:19堆叠分裂,随后无人干预下于18:24自动恢复:
%Oct 3 18:19:27:691 2022 HK-FT-0201-E02-H6800QTH3-LA-01 BFD/4/BFD_MAD_INTERFACE_CHANGE_STATE: BFD MAD function enabled on Vlan-interface199 changed to the faulty state.
%Oct 3 18:19:29:987 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DRVPLAT/4/DrvDebug:
The port Forty1/0/53 can't receive irf pkt and has been changed to inactive status, please check.
%Oct 3 18:19:29:987 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DRVPLAT/4/DrvDebug:
The port Forty1/0/54 can't receive irf pkt, please check.
%Oct 3 18:19:40:615 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DRVPLAT/4/DrvDebug:
The port Forty1/0/54 can't receive irf pkt, please check. This message repeated 1 times in last 10 seconds.
%Oct 3 18:19:40:568 2022 HK-FT-0201-E02-H6800QTH3-LA-01 STM/2/STM_LINK_TIMEOUT: IRF port 1 went down because the heartbeat timed out.
%Oct 3 18:19:40:573 2022 HK-FT-0201-E02-H6800QTH3-LA-01 STM/3/STM_LINK_DOWN: IRF port 1 went down.
%Oct 3 18:19:40:650 2022 HK-FT-0201-E02-H6800QTH3-LA-01 LAGG/6/LAGG_INACTIVE_PHYSTATE: Member port XGE2/0/1 of aggregation group BAGG1 changed to the inactive state, because the physical state of the port is down.
%Oct 3 18:19:40:663 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DEV/3/BOARD_REMOVED: Board was removed from slot 2, type is S6800-54QT.
%Oct 3 18:22:41:332 2022 HK-FT-0201-E02-H6800QTH3-LA-01 STM/6/STM_LINK_UP: IRF port 1 came up.
%Oct 3 18:22:41:636 2022 HK-FT-0201-E02-H6800QTH3-LA-01 IFNET/3/PHY_UPDOWN: Physical state on the interface FortyGigE1/0/54 changed to up.
%Oct 3 18:22:41:637 2022 HK-FT-0201-E02-H6800QTH3-LA-01 IFNET/5/LINK_UPDOWN: Line protocol state on the interface FortyGigE1/0/54 changed to up.
%Oct 3 18:23:22:656 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DEV/2/BOARD_STATE_FAULT: Board state changed to Fault on slot 2, type is unknown.
%Oct 3 18:23:29:277 2022 HK-FT-0201-E02-H6800QTH3-LA-01 DEV/5/BOARD_STATE_NORMAL: Board state changed to Normal on slot 2, type is S6800-54QT.
如果是堆叠分裂再合并,那么重启原因应该是STM或IRF megre,但是从重启后的原因记录为warm reboot,充分说明分裂前Slot 2本身发生了故障,当时堆叠心跳报文和MAD功能均无法正常交互了,Slot2是在感知不到堆叠分裂的情况检测到自身故障并重启了自己。进一步查看重启前的 reboot 记录,发现没有信息,并且secondary_log也只有一次启动记录,就像是冷启动一样。不过History interrupt 里有信息,但是是乱码的,说明是高端内存都没记录下来或者记录错了。
因此虽然是热重启(warm reboot),但是记录的信息就像冷重启一样。按照以往经验,一般是PCIE出现问题的设备才有这种情况,为消除隐患,建议将slot 2返回分析。
Slot 2:
Uptime is 0 weeks,0 days,1 hour,26 minutes
S6800-54QT with 2 Processor
BOARD TYPE: S6800-54QT
DRAM: 4096M bytes
FLASH: 1024M bytes
PCB 1 Version: VER.A
PCB 2 Version: VER.A
FPGA Version: NONE
Bootrom Version: 229
CPLD 1 Version: 002
CPLD 2 Version: 002
Release Version: H3C S6800-54QT-2609
Patch Version: Release 2609H09
Reboot Cause: WarmReboot
[SubSlot 0] 48XGT+6QSFP Plus
Display kernel相关信息都没有记录下来。
===============display kernel deadloop 20 verbose slot 2 ===============
No information to display.
=================================================================
===============display kernel exception 10 verbose slot 2 ===============
No information to display.
=================================================================
===============display kernel reboot 20 verbose slot 2 ===============
No information to display.
重启前中断信息记录的是乱码:
===============display reboot interrupt 2===============
============ History interrupt info of slot 2 ============
Last 200 interrupts time:
Irq ID jiffies year/month/day hour:min:sec Count 1
12 0x3c65cda6 1766/02/03 10:19:05 1
09 0x73c6a59ac 2022/10/03 10:19:05 1
10 0x73c68962f 0998/10/03 10:19:04 1
13 0x73c662d32 1830/10/01 02:19:07 1
13 0x71c69a00f 1894/10/03 10:19:08 1
09 0x73c6a556b 1254/10/03 00:03:09 1
05 0x73c6578ea 2022/10/03 10:19:10 1
13 0x73c663439 1254/10/03 10:19:10 1
10 0x73c6a7ab8 2020/10/01 08:03:10 1
10 0x73c6a7e90 1478/10/03 08:19:11 1
13 0x3c61e42b 1958/10/03 10:03:04 1
13 0x3c64f189 1734/10/03 08:19:13 1
05 0x73c6a057e 1382/10/03 02:03:12 1
13 0x73c2a041d 2018/10/03 10:01:14 1
13 0x73c658c3c 1254/10/01 08:03:13 1
08 0x73c09ce40 2018/08/03 10:19:15 1
08 0x73c649410 1990/10/02 10:02:01 1
05 0x73c2a47b5 1732/10/03 00:18:57 1
13 0x734693223 1988/10/01 10:19:00
=================================================================
重启前jiffies和任务切换信息是空的。
===============display reboot last-time 2===============
slot 2 Last Running Info:
CPU Time jiffies TASK
=================================================================
============================================================
Secondary log buf 也只有一次的内容:
===============printk irq trace info on slot 2===============
===============printk log buffer info on slot 2===============
<4>---------- secondary log buffer [1] ----------
<6>[ 0.000000] 0:Initializing cgroup subsys cpuset <6>0:done
<6>[ 0.000000] 0:Initializing cgroup subsys cpu <6>0:done
<5>[ 0.000000] 0:Linux version (none) (CMO@host) (gcc version 4.4.5 20100516 (prerelease) (GCC) ) #2 SMP Tue Nov 7 16:00:00 CST 2017
<4>[ 0.000000] Standard version 0.50
返修Slot2设备。
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作