上游——(10GE)SR88-F(GE2/3/2)——下游设备
%Sep 1 11:36:17:725 2023 **_SR8803_01 IFMON/4/OUTPUT_BUFFER_DROP_RECOVERY: -Slot=2; The number of output buffer drop packets dropped the lower threshold: Interface name=GigabitEthernet2/3/2, upper threshold 1000, lower threshold 100, number of output buffer drop packets 0, Interval 10s.
%Sep 1 14:33:57:680 2023 **_A0102_SR8803_01 IFMON/4/OUTPUT_BUFFER_DROP_THRESHOLD: -Slot=2; The number of output buffer drop packets exceeded the high threshold: Interface name=GigabitEthernet2/3/2, upper threshold 1000, lower threshold 100, number of output buffer drop packets 1188, Interval 10s.
%Sep 1 14:34:07:680 2023 **_A0102_SR8803_01 IFMON/4/OUTPUT_BUFFER_DROP_RECOVERY: -Slot=2; The number of output buffer drop packets dropped the lower threshold: Interface name=GigabitEthernet2/3/2, upper threshold 1000, lower threshold 100, number of output buffer drop packets 0, Interval 10s.
%Sep 1 15:14:17:723 2023 **_A0102_SR8803_01 IFMON/4/OUTPUT_BUFFER_DROP_THRESHOLD: -Slot=2; The number of output buffer drop packets exceeded the high threshold: Interface name=GigabitEthernet2/3/2, upper threshold 1000, lower threshold 100, number of output buffer drop packets 6024, Interval 10s.
现场用户发现在流量高峰期,会出现业务卡顿的情况,检查设备日志发现有上述打印,提示路由器设备丢包,且业务异常时间与打印丢包时间基本吻合。
故障现象仅在工作时间业务高峰期出现,说明与流量大小强相关。
观察接口流量,发现主要业务模型是较大流量从万兆接口接收后转发给千兆接口。
多个高速接口(XGE2/1/1和XGE2/1/2)向一个低速接口G2/3/2发包,且业务高峰期流量较大(接口峰值15s平均接近700Mbps),很可能出现频繁的流量突发,低速接口出方向buffer不够用的情况。 Buffer不够后GE2/3/2接口就会丢包,当丢包超过每10秒1000个,就会打印告警。
<**_SR8803_01>dis counters rate interface
Usage: Bandwidth utilization in percentage
Interface InUsage(%) InTotal(pps) OutUsage(%) OutTotal(pps) Description
GE2/3/1 0.01 5 0.01 5
GE2/3/2 0.78 7492 20.13 32292 //traffic outbound
GE2/3/3 0.15 769 0.16 717
GE2/3/4 0.01 19 0.00 0
XGE2/1/1 1.77 21677 0.13 8408
XGE2/1/1.398 1.69 20967 0.13 8387 // traffic inbound
XGE2/1/2 0.37 11474 0.01 75
XGE2/1/2.1398 0.37 12097 0.01 70 // traffic inbound
GigabitEthernet2/3/2
Bandwidth: 1000000 kbps
……
Peak input rate: 17510909 bytes/sec, at 2023-09-06 13:09:46
Peak output rate: 86886935 bytes/sec, at 2023-09-06 11:25:02 //86886935*8=695,095,480bps
Last 15 seconds input: 4983 packets/sec 1036493 bytes/sec 8291947 bits/sec 0.83%
Last 15 seconds output: 34727 packets/sec 21087695 bytes/sec 168701563 bits/sec 16.87%
Input (total): 28999484459 packets, 9379537862849 bytes
28999483983 unicasts, 454 broadcasts, 2 multicasts, - pauses
Input (normal): 28999484439 packets, - bytes
- unicasts, - broadcasts, - multicasts, 0 pauses
Input: 0 input errors, 0 runts, 0 giants, 0 throttles
0 CRC, 0 frame, 0 overruns, - aborts
0 ignored, - parity errors
Output (total): 130293116963 packets, 84511958603373 bytes
130292774293 unicasts, 13 broadcasts, 342640 multicasts, - pauses
Output (normal): 130293116963 packets, 84511958603373 bytes
- unicasts, - broadcasts, - multicasts, 0 pauses
Output: 0 output errors, - giants, - underruns, 700316 buffer failures
0 aborts, 0 deferred, - collisions, 0 late collisions
- lost carrier, - no carrier
该问题的根本原因在于高速口的带宽大于低速口,传输数据时不会考虑低速口的承载能力。
且流量传输速率是时时波动的,日常看到的都是基于时间的平均值,而实际情况下在设备转发性能足够且接口带宽内都是以尖峰突发的形式转发流量。
万兆口input的尖峰突发速率在业务高峰期(平均速率700Mbps时)可能高于千兆是较常见的现象,但这种突发对与千兆接口是无法完美处理的。
千兆接口遇到突发流量,会进行buffer缓存,但超出buffer后,自然就开始丢包了。
1、将下行出口也调整为万兆接口
2、将下行出口调整为聚合组,增加成员接口分担突发流量。(该方案对特征均匀的流量可用,对单条大流量等hash不均的场景效果不明显,可能需要继续扩容接口或调整hash参数)
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作