现场在5月28日新扩容了几组服务器,当天19:01出现网络中断约10min。具体日志如下:
%May 28 18:48:51:115 2025 B_YNKMA_SVR_AS17 OPTMOD/4/MODULE_IN: Ten-GigabitEthernet1/0/44: The transceiver is 10G_BASE_SR_SFP.
%May 28 19:01:21:005 2025 B_YNKMA_SVR_AS17 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.
%May 28 19:03:54:005 2025 B_YNKMA_SVR_AS17 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.
%May 28 19:06:33:003 2025 B_YNKMA_SVR_AS17 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.
%May 28 19:08:07:260 2025 B_YNKMA_SVR_AS17 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.
%May 28 19:01:24:456 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.
%May 28 19:03:04:456 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.
%May 28 19:04:45:899 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.
%May 28 19:06:24:456 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.
%May 28 19:08:04:456 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.
1. 对设备日志信息、diagfile信息进行详细分析,发现排查发现,故障时锐捷设备从S6520X侧收到了次优bpdu导致接口阻塞,从S6520X上看,对应时间点两台设备都因在保活时间内没有收到bpdu,上行口BAGG1状态由根端口切换成指定端口,开始外发bpdu,导致上行设备收到次优BPDU而阻塞端口,造成故障。
Port Bridge-Aggregation1
Role change : DESI->ROOT
Time : 2025/05/28 19:08:15
Port priority : 4096.80e4-5543-8600 2000 32768.905d-7c5b-ec80 0
61440.0074-9c59-125c 128.53 128.785
Designated priority : 4096.80e4-5543-8600 2001 32768.905d-7c5b-ec80 0
32768.905d-7c5b-ec80 128.785 128.785
Port Bridge-Aggregation1
Role change : ROOT->DESI (Aged)
Time : 2025/05/28 19:08:07
Port priority : 4096.80e4-5543-8600 2000 32768.905d-7c5b-ec80 0
61440.0074-9c59-125c 128.53 128.785
Designated priority : 32768.905d-7c5b-ec80 0 32768.905d-7c5b-ec80 0
32768.905d-7c5b-ec80 128.785 128.785
2. 故障时间点,诊断上没有协议报文超限速的日志告警,设备底层的驱动上送记录中看也未出现过协议报文丢包,可以确认设备不存在协议报文冲击cpu导致stp丢包的可能。另外故障时间点与锐捷设备的lldp邻居也未老化,说明链路状态是正常的。
===============debug rxtx softcar show slot 1===============
ID Type RcvPps Rcv_All DisPkt_All Pps Dyn Swi Hash Am Apps
37 STP 0 70153832 0 100 S On SMAC 8 1024
===============debug rxtx softcar show slot 1===============
ID Type RcvPps Rcv_All DisPkt_All Pps Dyn Swi Hash Am Apps
37 STP 0 70166211 0 100 S On SMAC 8 1024
3. 同时,还对设备的上送队列也进行了检查,STP协议报文通过5队列上送,也未发现存在有队列拥塞而丢包的情况。。
[B_YNKMA_SVR_AS17-probe]debug rxtx coscar show slot 1
Index RcvPkt DisPkt RcvPkt/s DisPkt/s PPS
0 0 0 0 0 1000
1 542975298 2 2 0 1000
2 300239 0 0 0 1000
3 6229744 0 0 0 1000
4 5555374 0 2 0 1000
5 129507755 0 7 0 1000
6 0 0 0 0 1000
7 0 0 0 0 1000
4. 综上, S6520X侧未发现存在和故障相关的异常,故障时设备无法收到来自上行设备的STP报文,导致出现了端口STP状态老化、由指定端口切换成根端口 情况。
1、复现故障配合抓包排查具体BPDU报文丢失位置。
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作