S9827-128DH 交换机1/0/126接口频繁up/down,bgp邻居不显示
新采购设备,初次使用,对模块和线缆进行过更换,没有改善。对端也是一样的报错,并且两台设备型号一致,日志如下:
%Jun 25 00:19:22:940 2026 XDC-AI-S9827-LF01(H05) IFNET/3/PHY_UPDOWN: Physical state on the interface FourHundredGigE1/0/126 changed to down.
%Jun 25 00:19:22:940 2026 XDC-AI-S9827-LF01(H05) IFNET/5/LINK_UPDOWN: Line protocol state on the interface FourHundredGigE1/0/126 changed to down.
%Jun 25 00:19:22:944 2026 XDC-AI-S9827-LF01(H05) DEV/2/INTERNALLINK_ALARM_CLEAR: Internal link alarm cleared. (PhysicalIndex=563, PhysicalName=FourHundredGigE1/0/126, RelativeResource=FourHundredGigE1/0/126, ErrorCode=482002, Reason=The interface Physical state changed to up.)
%Jun 25 00:19:22:945 2026 XDC-AI-S9827-LF01(H05) IFNET/3/PHY_UPDOWN: Physical state on the interface FourHundredGigE1/0/126 changed to up.
%Jun 25 00:19:22:946 2026 XDC-AI-S9827-LF01(H05) IFNET/5/LINK_UPDOWN: Line protocol state on the interface FourHundredGigE1/0/126 changed to up.
%Jun 25 00:19:23:074 2026 XDC-AI-S9827-LF01(H05) DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=563, PhysicalName=FourHundredGigE1/0/126, RelativeResource=FourHundredGigE1/0/126, ErrorCode=482002, Reason=The interface Physical state changed to down.)
%Jun 25 00:19:23:075 2026 XDC-AI-S9827-LF01(H05) IFNET/3/PHY_UPDOWN: Physical state on the interface FourHundredGigE1/0/126 changed to down.
%Jun 25 00:19:23:075 2026 XDC-AI-S9827-LF01(H05) IFNET/5/LINK_UPDOWN: Line protocol state on the interface FourHundredGigE1/0/126 changed to down.
%Jun 25 00:19:23:079 2026 XDC-AI-S9827-LF01(H05) DEV/2/INTERNALLINK_ALARM_CLEAR: Internal link alarm cleared. (PhysicalIndex=563, PhysicalName=FourHundredGigE1/0/126, RelativeResource=FourHundredGigE1/0/126, ErrorCode=482002, Reason=The interface Physical state changed to up.)
%Jun 25 00:19:23:081 2026 XDC-AI-S9827-LF01(H05) IFNET/3/PHY_UPDOWN: Physical state on the interface FourHundredGigE1/0/126 changed to up.
%Jun 25 00:19:23:081 2026 XDC-AI-S9827-LF01(H05) IFNET/5/LINK_UPDOWN: Line protocol state on the interface FourHundredGigE1/0/126 changed to up.
%Jun 25 00:19:23:119 2026 XDC-AI-S9827-LF01(H05) DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=563, PhysicalName=FourHundredGigE1/0/126, RelativeResource=FourHundredGigE1/0/126, ErrorCode=482002, Reason=The interface Physical state changed to down.)
%Jun 25 00:19:23:119 2026 XDC-AI-S9827-LF01(H05) IFNET/3/PHY_UPDOWN: Physical state on the interface FourHundredGigE1/0/126 changed to down.
%Jun 25 00:19:23:119 2026 XDC-AI-S9827-LF01(H05) IFNET/5/LINK_UPDOWN: Line protocol state on the interface FourHundredGigE1/0/126 changed to down.
接口配置如下:
interface FourHundredGigE1/0/126
port link-mode route
description "Link_to_XDC-AI-S9827-LF02(H07):FourHundredGigE1/0/126"
priority-flow-control enable
priority-flow-control no-drop dot1p 5
priority-flow-control deadlock enable
flow-interval 5
speed 400000
ip address 192.168.4.13 255.255.255.252
forwarding split-horizon
qos trust dscp
qos wfq byte-count
qos wfq af1 group 1 byte-count 2
qos wfq af2 group 1 byte-count 3
qos wfq af3 group 1 byte-count 15
qos wfq af4 group 1 byte-count 20
qos wfq ef group 1 byte-count 60
qos wfq cs6 group sp
qos wfq cs7 group sp
telemetry ifa role transit
qos wred queue 5 drop-level 0 low-limit 6000 high-limit 12000 discard-probability 40
qos wred queue 5 drop-level 1 low-limit 6000 high-limit 12000 discard-probability 40
qos wred queue 5 drop-level 2 low-limit 6000 high-limit 12000 discard-probability 40
qos wred queue 5 ecn
qos wred queue 5 weighting-constant 0
qos gts queue 6 cir 200000000 cbs 16000000
display interface:
FourHundredGigE1/0/126
Current state: UP
Line protocol state: UP
Description: "Link_to_XDC-AI-S9827-LF02(H07):FourHundredGigE1/0/126"
Bandwidth: 400000000 kbps
Link delay (up): 0 msec. Link delay (down): 0 msec
Maximum transmission unit: 1500
Allow jumbo frames to pass
Internet address: 192.168.4.13/30 (Primary)
IP packet frame type: Ethernet II, hardware address: 105e-ae5c-9215
IPv6 packet frame type: Ethernet II, hardware address: 105e-ae5c-9215
Loopback is not set
Media type is optical fiber, port hardware type is 400G_BASE_VR4_QSFP112
Port power is 12W
Fec counter Last 15mins uncorr errors: 0, corr errors: 181
Peak1 Fec counter uncorr errors: 2731, corr errors: 6535427 06/14/2026 03:56:26:579000
Peak2 Fec counter uncorr errors: 12091, corr errors: 5729468 06/18/2026 19:44:19:226000
PRE-FEC Last 30 seconds BER: 8.00e-08
SER[0]: 1.03e-07 SER[1]: 4.46e-07 SER[2]: 1.07e-07 SER[3]: 2.54e-06
Each serdes rate is:106.25 Gbps
Packets received of length [Byte]:
[64]: 3849519 [65-127]: 53717 [128-255]: 1
[256-511]: 215616 [512-1023]: 0 [1024-1518]: 0
[1519-2047]: 0 [2048-4095]: 0 [4096-9216]: 0 [9217-16383]: 0
Packets transmitted of length [Byte]:
[64]: 4945704 [65-127]: 53885 [128-255]: 1
[256-511]: 217268 [512-1023]: 0 [1024-1518]: 0
[1519-2047]: 0 [2048-4095]: 0 [4096-9216]: 0 [9217-16383]: 0
Port priority: 0
400Gbps-speed mode, full-duplex mode
Link speed type is force link, link duplex type is autonegotiation
Flow-control is not enabled
The Maximum Frame Length is 9216
Last link flapping: 0 hours 0 minutes 7 seconds
Last clearing of counters: Never
Last hardware down reason: The interface receives the remote fault from the peer device.
Current system time:2026-06-25 00:19:45 BeiJing+08:00:00
Last time when physical state changed to up:2026-06-25 00:19:37 BeiJing+08:00:00
Last time when physical state changed to down:2026-06-25 00:19:37 BeiJing+08:00:00
Peak input rate: 3305 bytes/sec, at 2026-06-24 23:16:19
Peak output rate: 5780 bytes/sec, at 2026-06-21 08:05:29
Last 5 seconds input: 1 packets/sec 334 bytes/sec 0%
Last 5 seconds output: 1 packets/sec 334 bytes/sec 0%
Traffic statistic: Not include Inter-frame Gaps and Preambles
Input (total): 4118853 packets, 330084990 bytes
62980 unicasts, 3839966 broadcasts, 215907 multicasts, 0 pauses
Input (normal): 4118853 packets, - bytes
62980 unicasts, 3839966 broadcasts, 215907 multicasts, 0 pauses
Input: 0 input errors, 0 runts, 0 giants, 0 throttles
0 CRC, 0 frame, - overruns, 0 aborts
- ignored, - parity errors
Output (total): 5216858 packets, 400855539 bytes
62101 unicasts, 4937167 broadcasts, 217590 multicasts, 0 pauses
Output (normal): 5216858 packets, - bytes
62101 unicasts, 4937167 broadcasts, 217590 multicasts, 0 pauses
Output: 0 output errors, - underruns, - buffer failures
0 aborts, 0 deferred, 0 collisions, 0 late collisions
0 lost carrier, - no carrier
S9827-128DH 400G 口频繁闪断、BGP 邻居无法建立完整故障分析
一、核心故障根因(从现场日志 & 接口定位)
1. 链路底层信号劣化(最根本原因)
接口输出关键告警数据:
plaintext
Fec counter Last 15mins uncorr errors: 0, corr errors: 181
Peak uncorr errors: 12091, corr errors: 6535427
PRE-FEC Last 30 seconds BER: 8.00e-08
SER[3]: 2.54e-06
Last hardware down reason: The interface receives the remote fault from the peer device.
SER 误码率超标:4 条 SER 通道里 SER [3] 达到2.54e-6,远超 400G 光模块稳定阈值1e-12,大量可纠错 FEC 错误持续累积,峰值出现上万不可纠错误码;
设备检测到对端上报remote fault远端故障,直接触发物理端口 Down,随即光模块重新协商,反复 up/down;
虽然更换过模块、光纤,但400G QSFP112 VR4 对端面清洁、端面损耗、光纤通道、机柜电磁干扰极度敏感,单纯换线 / 模块未必解决光路衰减问题。
2. 端口强制速率 + 自动双工协商冲突(配置缺陷)
配置里写死speed 400000强制 400G 速率,同时保留autonegotiation自动协商:
plaintext
400Gbps-speed mode, full-duplex mode
Link speed type is force link, link duplex type is autonegotiation
400G 光口标准规范:强制速率模式下必须关闭自动协商,强制协商会导致两端 Serdes 锁相不稳定,进一步加剧误码、闪断。
3. PFC 死锁功能加剧闪断震荡
plaintext
priority-flow-control enable
priority-flow-control deadlock enable
flow-interval 5
当前链路空载流量极低,无拥塞场景下开启 PFC 死锁检测,链路一旦出现轻微误码丢包,PFC 死锁机制会主动断链重置端口,放大闪断现象。
4. BGP 邻居不 UP 的直接诱因
BGP 建立依赖三层接口协议长期稳定,接口每秒反复 up/down:
接口协议频繁起落,TCP 179 端口会话持续断开重建;
端口 down 时路由直接删除,BGP 邻居重置,无法完成 OPEN 报文交互;
链路持续误码导致 BGP 报文 CRC 丢弃,协商中断。
二、分步修复操作(先改配置缓解闪断,再排查光路硬件)
步骤 1:修正 400G 端口协商配置(立刻执行,减少闪断频率)
plaintext
interface FourHundredGigE1/0/126
# 强制速率后关闭自动协商,400G标准要求
undo duplex auto
duplex full
# 无拥塞场景关闭PFC死锁检测,避免误触发断链
undo priority-flow-control deadlock enable
# 链路空载,临时关闭PFC排查故障,业务拥塞后再按需开启
undo priority-flow-control enable
对端同型号交换机 1/0/126 同步修改完全一致协商配置。
步骤 2:排查光路硬件误码根源(SER 高误码解决)
清洁光纤端面
400G VR4 光模块端面微小灰尘、划痕会大幅抬升 SER 误码,使用无尘纸 + 专用酒精擦拭模块端面、光纤跳线两端;
排查光纤损耗
VR4 多模 400G 传输距离上限 100m,确认链路长度不超限;
使用光功率计测试模块收光功率,必须落在厂商推荐区间(一般 - 1~-7dBm),收光过低直接导致 SER 误码飙升;
规避电磁干扰
400G Serdes 高频信号敏感,光纤远离电源线、大功率 UPS、机柜风扇,光纤不要挤压弯折;
交叉测试端口
把这组 400G 模块跳线换到设备其他空闲 400G 端口,观察 SER 误码是否消失:
换端口后误码消失:原端口槽位硬件故障;
换端口后误码依旧:光纤 / 光模块本身光路衰减问题。
步骤 3:临时端口防抖,抑制频繁 up/down 震荡
plaintext
interface FourHundredGigE1/0/126
# 开启链路抖动抑制,过滤短时间闪断
link-delay up 3000 down 3000
作用:端口 up/down 延迟 3 秒上报,避免路由、BGP 反复震荡。
步骤 4:BGP 侧防抖配套优化(链路稳定后执行)
plaintext
bgp X
# 接口闪断时不立即删除路由,缓冲时间
dampening
# BGP会话保活调大,适配链路短时波动
timer keepalive 30 hold 90
# 接口down后保留邻居一段时间,快速重连
neighbor 192.168.4.14 route-hold 60
三、关键验证命令(定位修复效果)
查看 SER 误码、FEC 纠错数据
plaintext
display interface FourHundredGigE 1/0/126 transceiver
观察SER error rate、FEC corrected/uncorrected errors,稳定运行后数值应不再持续上涨。
2. 查看端口闪断日志
plaintext
display logbuffer | include PHY_UPDOWN
修复后无密集 up/down 日志。
3. 查看 BGP 邻居状态
plaintext
display bgp peer
链路稳定后邻居进入 Established 状态。
四、故障总结
根本问题:400G 光路 SER 通道严重误码,对端上报远端故障,端口反复闪断;
配置叠加恶化:强制速率 + 自动协商冲突、PFC 死锁检测放大断链;
BGP 无法建立是链路闪断带来的连锁现象,链路稳定后 BGP 自动恢复;
更换模块 / 光纤无效,重点排查端面清洁、收光功率、传输距离、电磁干扰。
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作
举报
×
侵犯我的权益
×
侵犯了我企业的权益
×
抄袭了我的内容
×
原文链接或出处
诽谤我
×
对根叔社区有害的内容
×
不规范转载
×
举报说明
暂无评论