现场出现校园网无线终端上网业务卡顿,终端测试ping网关bras时小包通信正常,当ping超过94字节的报文时就会出现不通的情况。现场组网大致如下:
1、1、接收到问题后,首先收集了设备的相关诊断信息和日志信息,并远程到现场进行问题排查。查看设备配置、诊断未发现明显的配置错误,同时部署流量统计分析不通的报文转发情况,但因核心S10506X上配置了携带hardware-count参数的出方向包过滤,导致无法在交换机上统计到出方向报文。在进行了初步排查后,现场为恢复业务,断开核心S10506X主框和主Bras之间的互联端口,此时故障恢复:
#
packet-filter 3001 global inbound hardware-count
packet-filter 3001 global outbound hardware-count
# 携带该参数时,命中此包过滤的流量无法通过其他qos策略进行流量统计
acl advanced 3001
rule 1005 permit ip
#交换机包过滤默认动作为permit,rule 1005的配置为多余配置,并且会导致所有出方向流量都命中该包过滤,影响到qos流量统计的结果。
2、在删除掉多余的包过滤干扰配置后,测试交换机流统可以正常统计到入、出报文。4.9日晚23点现场重新连接bras和核心交换机的互联端口,可以复现故障。部分PC出现ping bras网关地址ping 95字节时报文不通,ping 小于95字节报文可以通。根据流统明确现场终端ping bras网关的流量转发路径如下:
以测试终端地址1.1.2.28(vlan 3300 mac 1111-1111-924E),bras网关地址1.1.2.1(mac 1111-1111-B002)为例:
去程:
核心交换机从下行聚合3收到终端发出的ping报文,二层转发给bras。bras在路由子接口.3300内匹配PBR扔回给交换机,在交换机上做三层转发送给FW,FW再走三层转发从VLAN4001送回给交换机,交换机上再做二层转发送给bras。
回程:
Bras在.4001上做三层转发查路由再携带vlan tag 3300将icmp reply报文发给交换机,交换机上做二层转发送给终端。
此时查看交换机上下行口的口能正常选中:
Aggregate Interface: Bridge-Aggregation14
Local:
Port Status Priority Index Oper-Key Flag
FGE1/0/0/1(R) S 32768 1 1 {ACDEF}
FGE1/0/0/2 S 32768 45 1 {ACDEF}
FGE2/0/0/1 S 32768 14 1 {ACDEF}
FGE2/0/0/2 S 32768 46 1 {ACDEF}
Aggregate Interface: Bridge-Aggregation3
Local:
Port Status Priority Index Oper-Key Flag
XGE1/3/0/25 S 32768 7 12 {ACDEF}
XGE1/3/0/41 S 32768 11 12 {ACDEF}
XGE1/3/0/42 S 32768 24 12 {ACDEF}
XGE2/3/0/41 S 32768 33 12 {ACDEF}
XGE2/3/0/42 S 32768 41 12 {ACDEF}
3、根据交换机上的流统明细,终端发起的icmp request报文从聚合3进来后,交换机从1/0/0/2或1/0/0/1口发给bras,再从1/0/0/1口收到bras匹配pbr后转回来的icmp request报文送给FW,FW回来的报文从1/0/0/1或1/0/0/2发给bras,bras回应的icmp reply报文发回给交换机的2/0/0/1口。根据流统,bras回应的icmp reply报文从2/0/0/1口进来后,没有转发出去。此时在交换机的2框上将icmp reply报文mirro to cpu,并通过dis rxtx打印出来,对比ping 64字节和95字节长度的icmp reply报文,在报文封装上均无异常,报文的hg头也无差异。
*Apr 10 01:31:42:054 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
rx parse info: mdcid=1, userport=0, tagnum=1, outervlan=3300, vlan=3300, srcslot=23,srcunit=0,srcphyport=1,rxflags=0x54,pktype=33,acltype=44,ifindex=0x8b8
*Apr 10 01:31:42:054 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
LPS_RecvIn(): unit=0, ifindex=2232, uiDrvRxFlags=0x54, opcode=0x1, dest_port=0x0, dest_mod=64, uiMod=64, length=110, uiPktPriority_Platform=1, bOnlyMirrorPkt=1, pstSdkPkt=c000000064a0c4b8, pfLinkRcv=ffffffffc3b89790.
*Apr 10 01:31:42:054 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
drv_rxtx_rx_to_platform():unit=0,ifindex=2232, uiDrvRxFlags=0x54, length=110, priority=1, pfLinkRcv=ffffffffc3b89790
*Apr 10 01:31:42:059 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
drv_rxtx_rx():unit=0, port=1,pvlan=3300,cos=2,reason=0x0,length=110,sMod=64,sPort=1,dMod=64,dPort=0,opcode=1,rxMatched=0,rx_untagged=2,uiDrvRxFlags=0x44,srcvp=4294967295,mcgroup=16384,reasOns=0x0000 0000 0000 0000 0000 0000 0000 ,hghdr: fb 00 40 00 40 01 6c 00 29 00 00 00 0c e4 01 00
*Apr 10 01:31:42:059 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
From board 23: received packet from chip0,port1,reason=0x0,cos=2,sMod=64,sPort=1,len=110,Matched=0,time=2,src_vp=-1
*Apr 10 01:31:42:059 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
-----------------------------------------------------
0000 90 78 41 e7 92 4e 70 c6 dd 09 b0 02 81 00 0c e4
0010 08 00 45 00 00 5c d8 1f 40 00 3c 01 91 1b 0a a4
0020 e0 01 0a a4 e0 1c 00 00 70 04 00 01 1d 7f 61 62
0030 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72
0040 73 74 75 76 77 61 62 63 64 65 66 67 68 69 6a 6b
0050 6c 6d 6e 6f 70 71 72 73 74 75 76 77 61 62 63 64
0060 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72
-----------------------------------------------------
*Apr 10 01:31:22:698 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
rx parse info: mdcid=1,userport=0,tagnum=1,outervlan=3300, vlan=3300, srcslot=23,srcunit=0,srcphyport=1,rxflags=0x54,pktype=33,acltype=44,ifindex=0x8b8
*Apr 10 01:31:22:698 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
LPS_RecvIn(): unit=0, ifindex=2232, uiDrvRxFlags=0x54, opcode=0x1, dest_port=0x0, dest_mod=64, uiMod=64, length=141, uiPktPriority_Platform=1, bOnlyMirrorPkt=1, pstSdkPkt=c000000064a02e68, pfLinkRcv=ffffffffc3b89790.
*Apr 10 01:31:22:698 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
drv_rxtx_rx_to_platform():unit=0,ifindex=2232, uiDrvRxFlags=0x54, length=141, priority=1, pfLinkRcv=ffffffffc3b89790
*Apr 10 01:31:27:705 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
drv_rxtx_rx():unit=0,port=1,pvlan=3300,cos=2, reason=0x0,length=141, sMod=64, sPort=1,dMod=64,dPort=0,opcode=1,rxMatched=0,rx_untagged=2,uiDrvRxFlags=0x44,srcvp=4294967295,mcgroup=16384,reasOns=0x0000 0000 0000 0000 0000 0000 0000 ,hghdr: fb 00 40 00 40 01 6c 00 29 00 00 00 0c e4 01 00
*Apr 10 01:31:27:705 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
From board 23: received packet from chip0,port1,reason=0x0,cos=2, sMod=64, sPort=1, len=141, Matched=0,time=2,src_vp=-1
*Apr 10 01:31:27:706 2025 S10506X DRVPLAT/7/RxTxDebug: -MDC=1-Chassis=2-Slot=0;
-----------------------------------------------------
0000 90 78 41 e7 92 4e 70 c6 dd 09 b0 02 81 00 0c e4
0010 08 00 45 00 00 7b d8 1d 40 00 3c 01 90 fe 0a a4
0020 e0 01 0a a4 e0 1c 00 00 a2 a3 00 01 1d 7d 61 62
0030 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72
0040 73 74 75 76 77 61 62 63 64 65 66 67 68 69 6a 6b
0050 6c 6d 6e 6f 70 71 72 73 74 75 76 77 61 62 63 64
0060 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74
0070 75 76 77 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d
0080 6e 6f 70 71 72 73 74 75 76 77 61 62 63
-----------------------------------------------------
4、由于icmp reply报文在交换机上只做二层转发,查看报文的目的mac在芯片底层的MAC表项无异常。S10500X设备LSUM2CQGS12SG0单板,相同外层的报文在设备内部交换网转发默认按照入端口hash,即同个源端口进来的报文,固定走一条hg。根据内部hg流统,bras回应的icmp reply报文从2/0/0/1口进来内部转发路径如下:
1、Ping 64字节报文时,2框6槽位网板的内部互联hg端口2可统计到入方向流量,2框3槽位的网板的内部互联hg端口26可以统计到入方向流量;共ping了2次(每次4个报文),计数都能正常统计到,此时测试终端上显示ping能通:
[S10506X-probe]debug qacl show packet pattern c 2 s 6 c 0 2 in
========
Acl-Type Statistics based PktPattern, Stage IFP, Pipe 0, SinglePort, Installed, Active
Prio Mjr/Sub 512/7, Group 3 [3], Slice/Idx 3/2, Entry 9, Double: 2306/3074
Rule Match --------
In Port: 2
Source IP: 10.164.224.1, 255.255.255.255
Dest IP: 10.164.224.28, 255.255.255.255
Actions --------
Account mode packets, green and non-green
Accounting: Hi 4, LO 0
[S10506X-probe]debug qacl show packet pattern c 2 s 3 c 0 26 in
========
Acl-Type Statistics based PktPattern, Stage IFP, Pipe 0, SinglePort, Installed, Active
Prio Mjr/Sub 512/7, Group 3 [3], Slice/Idx 9/22, Entry 1724, Triple: 6934/7702/8470
Rule Match --------
In Port: 26
Source IP: 10.164.224.1, 255.255.255.255
Dest IP: 10.164.224.28, 255.255.255.255
Actions --------
Account mode packets, green and non-green
Accounting: Hi 4, LO 0
2、ping 95字节报文时,2框6槽位网板的内部互联hg端口2入方向流量无增加。可确认是2框0槽位报文没有正常发给网板。此时终端显示ping不通。
##长ping情况下,多次查询无增加
[S10506X-probe]debug qacl show packet pattern c 2 s 6 c 0 2 in
========
Acl-Type Statistics based PktPattern, Stage IFP, Pipe 0, SinglePort, Installed, Active
Prio Mjr/Sub 512/7, Group 3 [3], Slice/Idx 3/2, Entry 9, Double: 2306/3074
Rule Match --------
In Port: 2
Source IP: 10.164.224.1, 255.255.255.255
Dest IP: 10.164.224.28, 255.255.255.255
Actions --------
Account mode packets, green and non-green
Accounting: Hi 8, LO 0
[S10506X-probe]debug qacl show packet pattern c 2 s 3 c 0 26 in
========
Acl-Type Statistics based PktPattern, Stage IFP, Pipe 0, SinglePort, Installed, Active
Prio Mjr/Sub 512/7, Group 3 [3], Slice/Idx 9/22, Entry 1724, Triple: 6934/7702/8470
Rule Match --------
In Port: 26
Source IP: 10.164.224.1, 255.255.255.255
Dest IP: 10.164.224.28, 255.255.255.255
Actions --------
Account mode packets, green and non-green
Accounting: Hi 8, LO 0
3、将2/0/0/1口挪到2/0/0/4口后,根据内部hg流统,bras回应的icmp reply报文从2/0/0/1口进来内部转发路径如下:
此时ping大、小字节的报文均正常能通,业务也恢复正常
[S10506X-probe]debug qacl show packet pattern c 2 s 8 c 0 2 in
========
Acl-Type Statistics based PktPattern, Stage IFP, Pipe 0, SinglePort, Installed, Active
Prio Mjr/Sub 512/7, Group 3 [3], Slice/Idx 3/2, Entry 32, Double: 2306/3074
Rule Match --------
In Port: 2
Source IP: 10.164.224.1, 255.255.255.255
Dest IP: 10.164.224.28, 255.255.255.255
Actions --------
Account mode packets, green and non-green
Accounting: Hi 345, LO 0
[S10506X-probe]debug qacl show packet pattern c 2 s 3 c 0 33 in
========
Acl-Type Statistics based PktPattern, Stage IFP, Pipe 0, SinglePort, Installed, Active
Prio Mjr/Sub 512/7, Group 3 [3], Slice/Idx 9/24, Entry 1726, Triple: 6936/7704/8472
Rule Match --------
In Port: 33
Source IP: 10.164.224.1, 255.255.255.255
Dest IP: 10.164.224.28, 255.255.255.255
Actions --------
Account mode packets, green and non-green
Accounting: Hi 345, LO 0
4、为验证此分析,还进行了如下测试来改变bras回复报文的转发Hash路径,确认只要Bras不hash到交换机的2/0/0/1口,业务均正常
a)只将2/0/0/1口down。
b)在bras上去掉.3300子接口上的pbr策略。
5、综上分析,可确认现场S10506X的2框slot 0板卡芯片处理存在异常,无法将2/0/0/1口进入的特定目的mac的报文转发给网板,导致故障。芯片相关底层MAC表项正常,但大于95字节的报文无法转发,分析是单板硬件故障,现场更换掉2框0槽位单板后,故障恢复。针对此类硬件故障,在后续R7643P02版本上做过优化,系统可以自动检测并告警重启。
202408190396
•问题现象:设备存在丢包或卡顿情况。
•问题产生条件:报文在设备内部链路传输时发生错误。
1、更换故障的2框slot 0业务板卡。
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作