在核心下S9820接口1/0/29与1/0/30下新增两台接入交换机,接入后出现OSPF邻居中断,CPU占用异常现象。
1、查看日志信息,据现场描述新增两台接入交换机到S9820后故障发生,与日志信息匹配。
%@223553%Mar 18 19:31:46:524 2026 JB081533-S9820-10.255.161.45 IFNET/3/PHY_UPDOWN: Physical state on the interface HundredGigE1/0/30 changed to up.
%@223554%Mar 18 19:31:46:525 2026 JB081533-S9820-10.255.161.45 LAGG/6/LAGG_LACP_RECEIVE_TIMEOUT: LACPDU reception timed out on member port HGE1/0/30 in aggregation group BAGG30.
%@223555%Mar 18 19:31:46:670 2026 JB081533-S9820-10.255.161.45 LLDP/6/LLDP_CREATE_NEIGHBOR: Nearest bridge agent neighbor created on port HundredGigE1/0/30 (IfIndex 90), neighbor's chassis ID is 1451-7ea6-3f22, port ID is HundredGigE1/0/29.
%@223556%Mar 18 19:31:46:694 2026 JB081533-S9820-10.255.161.45 LAGG/6/LAGG_ACTIVE: Member port HGE1/0/30 of aggregation group BAGG30 changed to the active state.
%@223557%Mar 18 19:31:46:695 2026 JB081533-S9820-10.255.161.45 IFNET/5/LINK_UPDOWN: Line protocol state on the interface HundredGigE1/0/30 changed to up.
%@223558%Mar 18 19:31:46:706 2026 JB081533-S9820-10.255.161.45 IFNET/3/PHY_UPDOWN: Physical state on the interface Bridge-Aggregation30 changed to up.
%@223559%Mar 18 19:31:46:706 2026 JB081533-S9820-10.255.161.45 IFNET/5/LINK_UPDOWN: Line protocol state on the interface Bridge-Aggregation30 changed to up.
2、接着日志里有ARP累积、STP与CPU占用升高的信息。现场配置的是1s一个hello报文,四个超时时间没收到回应ospf断联。从上述记录的时间点来看31分55秒发送一个,32分00秒才发第二个,发包不及时导致超时,而发包不及时就是cpu高导致的。
%@223560%Mar 18 19:31:46:717 2026 JB081533-S9820-10.255.161.45 M-LAG/6/MLAG_IFEVT_MLAGIF_SELECTED: Local M-LAG interface Bridge-Aggregation30 in M-LAG group 30 has Selected member ports.
%@223561%Mar 18 19:31:46:933 2026 JB081533-S9820-10.255.161.45 M-LAG/6/MLAG_IFEVT_MLAGIF_GLOBALUP: The state of M-LAG group 30 changed to up.
%@223562%Mar 18 19:31:53:424 2026 JB081533-S9820-10.255.161.45 STP/4/STP_DISPUTE: VLAN 100's port Bridge-Aggregation40 received an inferior BPDU from a designated port which is in forwarding or learning state. The designated bridge ID contained in the BPDU is 32768.10b3-d54c-ea27, and the designated port ID contained in the BPDU is 144.1.
%@223563%Mar 18 19:31:54:870 2026 JB081533-S9820-10.255.161.45 ARP/6/ARP_PKTQUE_ALERT: The current size of the ARP_PKT queue has reached 14896. Please check the network environment.
%@223564%Mar 18 19:32:00:608 2026 JB081533-S9820-10.255.161.45 OSPF/5/OSPF_NBR_CHG_REASON: OSPF 200 Area 0.0.0.0 Router 10.255.161.45(HGE1/0/51) CPU usage: 27%, IfMTU: 1500, Neighbor address: 10.255.191.193, NbrID:10.255.161.39 changed from Full to EXSTART because a SeqNumberMismatch event was triggered by the maste-slave relationship change at 2026-03-18 19:32:00:607.
Last 4 hello packets received at:
2026-03-18 19:31:55:323
2026-03-18 19:31:55:328
2026-03-18 19:31:55:328
2026-03-18 19:31:55:329
Last 4 hello packets sent at:
2026-03-18 19:31:51:814
2026-03-18 19:31:53:419
2026-03-18 19:31:55:319
2026-03-18 19:32:00:580
%@223565%Mar 18 19:32:00:608 2026 JB081533-S9820-10.255.161.45 OSPF/5/OSPF_NBR_CHG: OSPF 200 Neighbor 10.255.191.193(HundredGigE1/0/51) changed from FULL to EXSTART.
%@223566%Mar 18 19:32:02:053 2026 JB081533-S9820-10.255.161.45 OSPF/5/OSPF_NBR_CHG_REASON: OSPF 200 Area 0.0.0.0 Router 10.255.161.45(HGE1/0/51) CPU usage: 27%, IfMTU: 1500, Neighbor address: 10.255.191.193, NbrID:10.255.161.39 changed from ExStart to INIT because a 1-way hello packet was received at 2026-03-18 19:32:02:053.
Last 4 hello packets received at:
2026-03-18 19:32:02:049
2026-03-18 19:32:02:052
2026-03-18 19:32:02:053
2026-03-18 19:32:02:053
Last 4 hello packets sent at:
2026-03-18 19:31:51:814
2026-03-18 19:31:53:419
2026-03-18 19:31:55:319
2026-03-18 19:32:00:580
%@223567%Mar 18 19:32:03:943 2026 JB081533-S9820-10.255.161.45 OSPFV3/5/OSPFv3_NBR_CHG: OSPFv3 200 Neighbor 10.255.161.46(Route-Aggregation115) received SeqNumberMismatch and its state from FULL to EXSTART.
%@223568%Mar 18 19:32:10:192 2026 JB081533-S9820-10.255.161.45 OSPFV3/6/OSPFv3_LAST_NBR_DOWN: OSPFv3 200 Last neighbor down event: Router ID: 10.255.161.46 Local interface ID: 2158 Remote interface ID: 2158 Reason: DeadInterval timer expired.
%@223569%Mar 18 19:32:10:192 2026 JB081533-S9820-10.255.161.45 OSPFV3/5/OSPFv3_NBR_CHG: OSPFv3 200 Neighbor 10.255.161.46(Route-Aggregation115) received InactivityTimer and its state from EXCHANGE to DOWN.
%@223570%Mar 18 19:32:11:908 2026 JB081533-S9820-10.255.161.45 OSPF/6/OSPF_LAST_NBR_DOWN: OSPF 200 Last neighbor down event: Router ID: 10.255.161.46 Local address: 10.255.191.201 Remote address: 10.255.191.202 Reason: DeadInterval timer expired.
%@223571%Mar 18 19:32:11:909 2026 JB081533-S9820-10.255.161.45 OSPF/5/OSPF_NBR_CHG_REASON: OSPF 200 Area 0.0.0.0 Router 10.255.161.45(RAGG115) CPU usage: 99%, IfMTU: 1500, Neighbor address: 10.255.191.202, NbrID:10.255.161.46 changed from Full to DOWN because the dead timer expired at 2026-03-18 19:32:11:908.
Last 4 hello packets received at:
3、因为现场故障已恢复,无法直接判断导致CPU占用异常的根因,厂内搭建实验环境复现。发现在下挂设备接入的瞬间会有cpu的瞬时升高,在做多次尝试后,出现ospf断联现象,此时设备cpu忙碌,keepalive断联,出现了和现网相同的日志。进而分析cpu升高原因,发现设备在接入瞬间产生了大量的arp报文。
4、分析发送arp广播的原因:接入设备的时候,9820会有stp状态变化,状态变化会更新arp探测,虽然我们有arp探测报文数量限制,每三秒发送100个探测。但是由于现场是pvst,每个vlan都是一个生成树,导致每个vlan都会进行一次arp的广播探测。
建议分离部分业务,让设备cpu不再出现较高的毛刺。或者是增加设备,分摊网关的压力。
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作