Print

某局点S6520X STP状态异常故障

2025-10-16 发表

问题描述

现场在528日新扩容了几组服务器,当天1901出现网络中断约10min。具体日志如下:

%May 28 18:48:51:115 2025 B_YNKMA_SVR_AS17 OPTMOD/4/MODULE_IN: Ten-GigabitEthernet1/0/44: The transceiver is 10G_BASE_SR_SFP.

%May 28 19:01:21:005 2025 B_YNKMA_SVR_AS17 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.

%May 28 19:03:54:005 2025 B_YNKMA_SVR_AS17 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.

%May 28 19:06:33:003 2025 B_YNKMA_SVR_AS17 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.

%May 28 19:08:07:260 2025 B_YNKMA_SVR_AS17 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.

 

%May 28 19:01:24:456 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.

%May 28 19:03:04:456 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.

%May 28 19:04:45:899 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.

%May 28 19:06:24:456 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.

%May 28 19:08:04:456 2025 B_YNKMA_SVR_AS18 STP/5/STP_BPDU_RECEIVE_EXPIRY: Instance 0's port Bridge-Aggregation1 received no BPDU within the rcvdInfoWhile interval. Information of the port aged out.

过程分析

1.     对设备日志信息、diagfile信息进行详细分析,发现排查发现,故障时锐捷设备从S6520X侧收到了次优bpdu导致接口阻塞,从S6520X上看,对应时间点两台设备都因在保活时间内没有收到bpdu,上行口BAGG1状态由根端口切换成指定端口,开始外发bpdu,导致上行设备收到次优BPDU而阻塞端口,造成故障。

Port Bridge-Aggregation1           

   Role change         : DESI->ROOT            

   Time                : 2025/05/28 19:08:15

   Port priority       : 4096.80e4-5543-8600 2000 32768.905d-7c5b-ec80 0

                         61440.0074-9c59-125c 128.53 128.785

   Designated priority : 4096.80e4-5543-8600 2001 32768.905d-7c5b-ec80 0

                         32768.905d-7c5b-ec80 128.785 128.785

 

 Port Bridge-Aggregation1           

   Role change         : ROOT->DESI (Aged)           

   Time                : 2025/05/28 19:08:07

   Port priority       : 4096.80e4-5543-8600 2000 32768.905d-7c5b-ec80 0

                         61440.0074-9c59-125c 128.53 128.785

   Designated priority : 32768.905d-7c5b-ec80 0 32768.905d-7c5b-ec80 0

                         32768.905d-7c5b-ec80 128.785 128.785

2.     故障时间点,诊断上没有协议报文超限速的日志告警,设备底层的驱动上送记录中看也未出现过协议报文丢包,可以确认设备不存在协议报文冲击cpu导致stp丢包的可能。另外故障时间点与锐捷设备的lldp邻居也未老化,说明链路状态是正常的。

===============debug rxtx softcar show slot 1=============== 

ID  Type                RcvPps Rcv_All    DisPkt_All Pps  Dyn Swi Hash Am Apps

37  STP                 0      70153832   0          100  S   On  SMAC 8 1024

 

===============debug rxtx softcar show slot 1=============== 

ID  Type                RcvPps Rcv_All    DisPkt_All Pps  Dyn Swi Hash Am Apps

37  STP                 0      70166211   0          100  S   On  SMAC 8 1024

 

3.     同时,还对设备的上送队列也进行了检查,STP协议报文通过5队列上送,也未发现存在有队列拥塞而丢包的情况。。

[B_YNKMA_SVR_AS17-probe]debug rxtx coscar show slot 1

 

 Index        RcvPkt           DisPkt           RcvPkt/s     DisPkt/s     PPS 

 0            0                0                0            0            1000

 1            542975298        2                2            0            1000

 2            300239           0                0            0            1000

 3            6229744          0                0            0            1000

 4            5555374          0                2            0            1000

 5            129507755        0                7            0            1000

 6            0                0                0            0            1000

 7            0                0                0            0            1000

4.     综上, S6520X侧未发现存在和故障相关的异常,故障时设备无法收到来自上行设备的STP报文,导致出现了端口STP状态老化、由指定端口切换成根端口 情况。

解决方法

1、复现故障配合抓包排查具体BPDU报文丢失位置