一、 组网:
二、 问题描述:
SR8812设备作为中心视频会议核心设备,使用CPOS拆分出的63个E1连接多个分支,分支的设备为SR6604路由器。核心SR8812路由器使用CPOS拆分出来的S8/1/8/1:0和S8/1/8/2:0时隙做MP捆绑,和分支SR6604的S3/2/0、S3/2/1通过MP捆绑互联。客户反馈一个分支SR6604下挂的视频会议经常出现无法正常召开,通过手工重启SR6604后问题解决。
三、 过程分析:
首先,我们先来总结一下,因设备运行中引发协议down掉的情况,通常有以下几种。其一,设备硬件故障,比如子卡故障引发协议down掉;其二,协议软件本身故障引发协议down;其三,某段时间内线路信号质量差或者错包持续总量较大,上层协议(如OSPF/BGP/LDP)、甚至链路协议报文丢失,最终定时器超时引发协议down;其四,某段时间内错包比例或速率过大,直接导致接口关闭,引发协议down。
通过收集SR8812和SR6604路由器诊断和日志信息分析如下:
SR8812路由器日志分析:
%@3211988#Aug 18 20:27:24:024 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.155 LsdbRouterId 10.239.5.1 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.
%@3211989#Aug 18 20:27:29:024 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.155 LsdbRouterId 10.0.5.18 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4. //OSPF重传
……
%@3211993%Aug 18 20:27:29:251 2014 SR8812 LDP/5/LDP_SESSION_DOWN: Session(2.2.2.2:0, public instance)'s state changed to down. //MPLS LDP会话DOWN
……
%@3211997%Aug 18 20:27:37:322 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.110(Mp-group8/1/4) from Full to ExStart. //OSPF邻居DOWN
%@3211998#Aug 18 20:27:39:023 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.1.96 LsdbRouterId 10.240.211.38 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.
%@3211999#Aug 18 20:27:42:352 2014 SR8812 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.2 LSP 0.0.161.21 Changes to Down
%@3212000#Aug 18 20:27:44:023 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.2.32 LsdbRouterId 10.0.7.6 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4. //OSPF重传
%@3212001#Aug 18 20:27:49:025 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.2.32 LsdbRouterId 10.0.7.6 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.
%@3212002#Aug 18 20:27:50:679 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 10.0.8.109 index 0 Retransmit to Nbr 10.0.8.110 0 LsdbAreaId 0.0.0.28 LsdbType 1 LsdbLsid 10.0.3.5LsdbRouterId 10.0.3.5Router 10.0.3.5Nbr RID 2.2.2.2 Type is 2.
%@3212003#Aug 18 20:27:51:692 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 10.0.8.109 index 0 Retransmit to Nbr 10.0.8.110 0 LsdbAreaId 0.0.0.28 LsdbType 3 LsdbLsid 10.0.4.48 LsdbRouterId 10.0.3.5Router 10.0.3.5Nbr RID 2.2.2.2 Type is 2.
……
从SR8812侧日志来看,首先发生OSPF重传,接着MPLS LDP会话DOWN,然后OSPF邻居DOWN,这说明LDP会话DOWN并非OSPF邻居DOWN造成。大量OSPF报文的重传说明本端SR8812未收到对端的确认报文,这属于引发协议down的第三种情况。
另外,所有UP的serial端口的接收方向都有不少错包,有一些端口的错包还非常多,说明线路质量存在问题。如MP8/1/4的成员口S8/1/8/1:0、S8/1/8/2:0下则有较多错包:
S8/1/8/1:0 current state: UP
Input(total): 12917891 packets, 5042061874 bytes
Input(Bad): 12 Abort, 2881 FCS-Error, 0 FIFO-Abort, 0 Giant, 0 Runt
S8/1/8/2:0 current state: UP
Input(total): 12919422 packets, 5042729785 bytes
Input(Bad): 856 Abort, 3431 FCS-Error, 0 FIFO-Abort, 0 Giant, 0 Runt
第二次收集的诊断信息中,发现Mp-group 9/1/4的成员端口错包有明显增长:
S8/1/8/1:0 current state: UP
Input(Bad): 14 Abort, 2881 FCS-Error, 0 FIFO-Abort, 0 Giant, 0 Runt
S8/1/8/2:0 current state: UP
Input(Bad): 1298 Abort, 4178 FCS-Error, 0 FIFO-Abort, 0 Giant, 0 Runt //增长明显
除上述两端口外,其他端口也增长明显。
第三次收集的诊断信息中,发现Mp-group 9/1/4的成员端口错包有明显增长:
S8/1/8/1:0 current state: UP
Input(Bad): 14 Abort, 2881 FCS-Error, 0 FIFO-Abort, 0 Giant, 0 Runt
S8/1/8/2:0 current state: UP
Input(Bad): 1940 Abort, 4692 FCS-Error, 0 FIFO-Abort, 0 Giant, 0 Runt //增长明显
还有,设备一直存在大量物理线路错误告警(都跟信号丢失、帧丢失,或错包相关),而且,其中就包含与对端SR66相连的两条线路。
%@3053231%Aug 23 08:38:42:241 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8; ——这是第一次
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm RAI recover! Start Time : 2013-08-23 08:38:33:44!
%@3053232%Aug 23 08:38:43:282 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm CRC4 recover! Start Time : 2013-08-23 08:38:42:465!
%@3053233%Aug 23 08:38:51:618 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm AIS recover! Start Time : 2013-08-23 08:38:43:506!
%@3053234%Aug 23 08:38:51:618 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm LOF recover! Start Time : 2013-08-23 08:38:42:465!
%@3053235%Aug 23 08:38:51:618 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm LMFA recover! Start Time : 2013-08-23 08:38:42:465!
……
%Sep 1 03:48:40:813 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8; ——这是最近一次(这个日志就是9.1日收集的)
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 4 TU12 3 E1 Alarm RAI recover! Start Time : 2014-09-01 03:48:40:288!
%Sep 1 11:01:10:871 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 3 TU12 1 Alarm STU-P recover! Start Time : 2014-09-01 11:01:09:516!
%Sep 1 11:01:10:871 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 2 TU12 2 Alarm STU-P recover! Start Time : 2014-09-01 11:01:09:519!
而且,这些物理线路错误比较严重时,会直接导致接口震荡,以致协议震荡(如下所示)。这属于引发协议down的第三种情况。
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 1 TU12 3 E1 Alarm RAI recover! Start Time : 2014-08-19 00:29:54:478!
%@3216415%Aug 19 00:29:56:843 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 1 TU12 3 E1 Alarm CRC4 recover! Start Time : 2014-08-19 00:29:55:520!
%@3216416%Aug 19 00:48:00:088 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 1 TU12 3 E1 Alarm CRC4 recover! Start Time : 2014-08-19 00:47:58:772!
%@3216417%Aug 19 03:34:09:808 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;
Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 2 TU12 1 E1 Alarm RAI recover! Start Time : 2014-08-19 03:34:06:390!
%@3216418#Aug 19 03:34:12:059 2014 SR8812 IFNET/4/INTERFACE UPDOWN:
Trap 1.3.6.1.6.3.1.1.5.3: Interface 78315779 is Down, ifAdminStatus is 1, ifOperStatus is 2
%@3216419%Aug 19 03:34:12:059 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface Serial8/1/8/1:0 is DOWN.
%@3216420%Aug 19 03:34:12:059 2014 SR8812 IFNET/5/PROTOCOL_UPDOWN: Protocol PPP IPCP on the interface Serial8/1/8/1:0 is DOWN.
%@3216421#Aug 19 03:34:12:065 2014 SR8812 OSPF/5/NEIGHBOR_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.2: Non-virtual neighbor 10.240.201.86 index 0 Router 10.0.3.5NbrRouter 10.240.204.22 state change to 1.
%@3216422#Aug 19 03:34:12:065 2014 SR8812 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.22 LsdbType 1 LsdbLsid 10.0.3.5LsdbRouterId 10.0.3.5Router 10.0.3.5.
%@3216423#Aug 19 03:34:12:066 2014 SR8812 OSPF/5/IF_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.16: Non-virtual interface 10.240.201.85 index 0 Router 10.0.3.5state change to 1.
%@3216424%Aug 19 03:34:12:066 2014 SR8812 OSPF/6/OSPF_LAST_NBR_DOWN: OSPF 1 Last neighbor down event: Router ID: 10.240.204.22 Local address: 10.240.201.85 Remote address: 10.240.201.86 Reason: Vlink down.
%@3216425%Aug 19 03:34:12:066 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.240.201.86(Serial8/1/8/1:0) from Full to Down.
%@3216426%Aug 19 03:34:12:066 2014 SR8812 LDP/5/LDP_SESSION_DOWN: Session(10.240.204.22:0, public instance)'s state changed to down.
以上所有SR88的信息都反映线路质量有问题,建议排查SR8812与SR6604之间的线路情况。
SR6604侧设备日志分析:
从设备日志文件中可见,业务中断是路由协议震荡造成;而路由协议震荡都是收不到对端发送的协议报文后,定时器超时引起。这属于引发协议down的第三种情况。
%@32165%Jul 2 14:46:34:645 2013 SR6604 OSPF/6/OSPF_LAST_NBR_DOWN: OSPF 1 Last neighbor down event: Router ID: 10.0.3.5 Local address: 10.0.8.110 Remote address: 10.0.8.109 Reason: DeadInterval timer expired. //收不到对端OSPF协议报文
%@32166%Jul 2 14:46:34:645 2013 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.109(Mp-group3/2/7) from Full to Down. //OSPf邻居的DOWN
%@32167#Jul 2 14:47:16:224 2013 SR6604 BGP/4/BACKWARD: BGP TrapID1.3.6.1.2.1.15.7.2: PeerID 10.1.2.6 last error 60 changed state to 1.
%@32168%Jul 2 14:47:16:224 2013 SR6604 BGP/5/BGP_STATE_CHANGED: 10.1.2.6 state is changed from ESTABLISHED to IDLE. //OSPF DOWN 导致BGP邻居 DOWN
%@32169#Jul 2 14:47:21:233 2013 SR6604 BGP/4/BACKWARD: BGP TrapID1.3.6.1.2.1.15.7.2: PeerID 1.1.1.1 last error 60 changed state to 1.
%@32170%Jul 2 14:47:21:233 2013 SR6604 BGP/5/BGP_STATE_CHANGED: 1.1.1.1 state is changed from ESTABLISHED to IDLE.
另外,SR66侧跟SR88一样,相关接口下也有不少错包。2014.8.28日把接口统计清空后,还是能观察到错包数的持续增长。这说明线路质量存在问题。
Serial3/2/0 current state: UP
Last clearing of counters: 16:09:38 GMT Thu 08/28/2014 ——8.28下午清空的
Input error packet detect: Initial
Last 5 seconds input rate 72937.60 bytes/sec, 583500 bits/sec, 189.79 packets/sec
Last 5 seconds output rate 96109.20 bytes/sec, 768873 bits/sec, 205.19 packets/sec
Input: 6856074 packets, 2640748639 bytes, 0 no buffers
0 broadcasts, 0 multicasts
136 errors, 0 runts, 1 giants ——错包一直再增长
121 CRC, 0 align errors, 0 overruns
0 dribbles, 0 aborts, 14 frame errors
Serial3/2/0 current state: UP
Last clearing of counters: 16:09:38 GMT Thu 08/28/2014
Input error packet detect: Initial
Last 5 seconds input rate 85223.60 bytes/sec, 681788 bits/sec, 226.19 packets/sec
Last 5 seconds output rate 101720.39 bytes/sec, 813763 bits/sec, 236.20 packets/sec
Input: 11300985 packets, 4304683148 bytes, 0 no buffers
0 broadcasts, 0 multicasts
3435 errors, 0 runts, 1 giants ——错包一直再增长
2804 CRC, 0 align errors, 0 overruns
0 dribbles, 0 aborts, 630 frame errors
前面既然说是线路质量导致的业务中断,那为什么业务中断时,手工重启SR66后业务就能恢复呢?
那么我们来分析一下手动重启SR6604的过程,手工启动时间——Aug 4 14:32:49:108
从SR66的日志里可以清晰地看到,这次手工启动之前,SR6604持续收到大量错包,Serial3/2/0频繁关闭开启,直接导致协议震荡(这是引发震荡的第四中情况)。
%@57613%Aug 4 14:13:02:824 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;
(Serial3/2/0)
Channel is deactived because of too many continuous input errors! This should be caused by poor signal or improper channel-config.
%@57614%Aug 4 14:13:05:468 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;
(Serial3/2/0)
Channel is actived!
%@57615%Aug 4 14:16:59:629 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;
(Serial3/2/0)
Channel is deactived because of too many continuous input errors! This should be caused by poor signal or improper channel-config.
%@57616%Aug 4 14:17:00:467 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;
(Serial3/2/0)
Channel is actived!
%@57617%Aug 4 14:17:02:995 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;
(Serial3/2/0)
Channel is deactived because of too many continuous input errors! This should be caused by poor signal or improper channel-config.
%@57618%Aug 4 14:17:05:467 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;
(Serial3/2/0)
Channel is actived!
%@57619%Aug 4 14:19:01:242 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;
(Serial3/2/0)
Channel is deactived because of too many continuous input errors! This should be caused by poor signal or improper channel-config. //S3/2/0瞬间收到大量错包冲击后关闭
%@57640%Aug 4 14:19:05:240 2014 SR6604 LDP/5/LDP_SESSION_DOWN: Session(1.1.1.1:0, public instance)'s state changed to down. //协议开始震荡
%@57676%Aug 4 14:19:05:287 2014 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 10 Shamlink Neighbor 10.240.214.28 from Full to Down.
%@57677%Aug 4 14:19:05:466 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;
(Serial3/2/0)
Channel is actived! //端口恢复
%@57678%Aug 4 14:19:05:512 2014 SR6604 LDP/5/LDP_SESSION_UP: Session(1.1.1.1:0, public instance)'s state changed to up.
%@57682#Aug 4 14:19:28:038 2014 SR6604 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.1 LSP 0.0.186.201 Changes to Up
%@57683#Aug 4 14:19:28:039 2014 SR6604 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.1 LSP 0.0.186.202 Changes to Up
……
%@57734#Aug 4 14:19:28:117 2014 SR6604 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.0 LsdbRouterId 10.0.10.109 Router 10.0.10.109.
%@57735#Aug 4 14:19:28:117 2014 SR6604 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.239.8.0 LsdbRouterId 10.0.10.109 Router 10.0.10.109.
……
%@57753#Aug 4 14:19:47:162 2014 SR6604 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.28 LsdbType 1 LsdbLsid 10.0.9.109 LsdbRouterId 10.0.9.109 Router 10.0.9.109 .
%@57754#Aug 4 14:19:49:271 2014 SR6604 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.28 LsdbType 1 LsdbLsid 10.0.9.109 LsdbRouterId 10.0.9.109 Router 10.0.9.109.
%@57755%Aug 4 14:19:49:271 2014 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 10 Shamlink Neighbor 10.240.214.28 from Loading to Full. //这个时间点,SR66已经恢复正常;
%@57756%Aug 4 06:32:14 2014 H3C %%10IC/6/SYS_RESTART:System restarted
–H3C Comware Software. //但在13分钟后,设备手工重启了
(此处因设备尚未启动完毕,时间还是原始值,加上8小时才是实际时间,即14:32:14)
%@57757#Aug 4 06:32:32:529 2014 H3C DEVM/1/BOARD INSERTED:
Trap 1.3.6.1.4.1.25506.8.35.12.1.9: chassisIndex is 0, slotIndex 0.1
%@57758#Aug 4 06:32:32:529 2014 H3C DEVM/1/BOARD INSERTED:
Trap 1.3.6.1.4.1.25506.8.35.12.1.9: chassisIndex is 0, slotIndex 0.3
%@57759%Aug 4 06:32:32:529 2014 H3C DEVM/2/BOARD_STATE_FAULT: Board state changes to FAULT on Chassis 0 Slot 1, type is unknown.
%@57760%Aug 4 06:32:32:529 2014 H3C DEVM/2/BOARD_STATE_FAULT: Board state changes to FAULT on Chassis 0 Slot 3, type is unknown.
%@57761#Aug 4 14:32:49:108 2014 SR6604 DEVM/4/SYSTEM COLD START:
Trap 1.3.6.1.4.1.25506.6.8.4: system cold start.
……
%@57902%Aug 4 14:34:06:801 2014 SR6604 IFNET/3/LINK_UPDOWN: Serial3/2/0 link status is UP.
%@57903%Aug 4 14:34:06:802 2014 SR6604 IFNET/3/LINK_UPDOWN: Serial3/2/3 link status is UP.
%@57913%Aug 4 14:34:07:848 2014 SR6604 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface Mp-group3/2/7 is UP.
%@57922%Aug 4 14:34:10:879 2014 SR6604 IFNET/5/PROTOCOL_UPDOWN: Protocol PPP IPCP on the interface Mp-group3/2/7 is UP.
%@57926%Aug 4 14:34:20:284 2014 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.109(Mp-group3/2/7) from Loading to Full.
%@57929%Aug 4 14:34:31:156 2014 SR6604 LDP/5/LDP_SESSION_UP: Session(1.1.1.1:0, public instance)'s state changed to up.
%@57983#Aug 4 14:34:42:841 2014 SR6604 BGP/4/ESTABLISHED: BGP TrapID1.3.6.1.2.1.15.7.1: PeerID 1.1.1.1 last error 0 changed state to 6.
%@57984%Aug 4 14:34:42:841 2014 SR6604 BGP/5/BGP_STATE_CHANGED:
1.1.1.1 state is changed from OPENCONFIRM to ESTABLISHED.
%@58022%Aug 4 14:35:00:079 2014 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 10 Shamlink Neighbor 10.240.214.28 from Loading to Full. //手工重启3分钟后,SR66于14:35恢复
如上日志中,SR66其实在14:19已经恢复,为什么在此后的13分钟内客户感知业务还是没恢复,导致14:32手工重启呢?我们看SR88的日志发现,原来此时SR88正要恢复却又因为线路质量问题再次发生震荡。
%@3205479%Aug 4 14:19:05:241 2014 SR8812 LDP/5/LDP_SESSION_DOWN: Session(2.2.2.2:0, public instance)'s state changed to down.
%@3205480%Aug 4 14:19:05:511 2014 SR8812 LDP/5/LDP_SESSION_UP: Session(2.2.2.2:0, public instance)'s state changed to up.
%@3205486%Aug 4 14:19:31:058 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 101 Shamlink Neighbor 10.0.0.128 from Full to ExStart.
%@3205499%Aug 4 14:19:49:266 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 101 Shamlink Neighbor 10.0.0.128 from Loading to Full. //此时SR88刚刚恢复
……
%@3205517#Aug 4 14:30:17:738 2014 SR8812 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.1 LSP 0.0.159.53 Changes to Up
%@3205518%Aug 4 14:30:17:739 2014 SR8812 LDP/5/LDP_SESSION_DOWN: Session(2.2.2.2:0, public instance)'s state changed to down. //14:30又陷入震荡
%@3205519#Aug 4 14:30:26:545 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.240.214.1 LsdbRouterId 10.240.211.34 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.
%@3205520#Aug 4 14:30:31:546 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.240.214.1 LsdbRouterId 10.240.211.34 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.
%@3205521#Aug 4 14:30:35:790 2014 SR8812 OSPF/5/NEIGHBOR_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.2: Non-virtual neighbor 10.0.8.110 index 0 Router 10.0.3.5 NbrRouter 2.2.2.2 state change to 1.
%@3205522#Aug 4 14:30:35:791 2014 SR8812 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.28 LsdbType 1 LsdbLsid 10.0.3.5 LsdbRouterId 10.0.3.5 Router 10.0.3.5.
%@3205523%Aug 4 14:30:35:791 2014 SR8812 OSPF/6/OSPF_LAST_NBR_DOWN: OSPF 1 Last neighbor down event: Router ID: 2.2.2.2 Local address: 10.0.8.109 Remote address: 10.0.8.110 Reason: DeadInterval timer expired. //定时器超时
%@3205524%Aug 4 14:30:35:791 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.110(Mp-group8/1/4) from Full to Down.
%@3205525#Aug 4 14:30:36:545 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.240.214.1 LsdbRouterId 10.240.211.34 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.
%@3205526#Aug 4 14:30:40:813 2014 SR8812 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.2 LSP 0.0.159.53 Changes to Down
%@3205527#Aug 4 14:30:41:545 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.128 LsdbRouterId 10.239.5.1 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.
%@3205528#Aug 4 14:30:42:861 2014 SR8812 OSPF/5/NEIGHBOR_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.2: Non-virtual neighbor 10.0.0.128 index 0 Router 10.239.5.1 NbrRouter 10.0.9.109 state change to 1.
%@3205529%Aug 4 14:30:42:862 2014 SR8812 OSPF/6/OSPF_LAST_NBR_DOWN: OSPF 101 Last neighbor down event: Router ID: 10.0.9.109 Local address: 10.240.214.28 Remote address: 10.0.0.128 Reason: DeadInterval timer expired.
%@3205530%Aug 4 14:30:42:862 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 101 Shamlink Neighbor 10.0.0.128 from Full to Down.
%@3205534#Aug 4 14:30:59:176 2014 SR8812 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.128 LsdbRouterId 10.239.5.4 Router 10.239.5.1 .
%@3205535#Aug 4 14:31:41:950 2014 SR8812 BGP/4/BACKWARD: BGP TrapID1.3.6.1.2.1.15.7.2: PeerID 2.2.2.2 last error 60 changed state to 1.
%@3205536%Aug 4 14:31:41:950 2014 SR8812 BGP/5/BGP_STATE_CHANGED:
2.2.2.2 state is changed from ESTABLISHED to IDLE.
%@3205537#Aug 4 14:31:43:653 2014 SR8812 IFNET/4/INTERFACE UPDOWN:
Trap 1.3.6.1.6.3.1.1.5.3: Interface 78315762 is Down, ifAdminStatus is 1, ifOperStatus is 2
%@3205538%Aug 4 14:31:43:653 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface S8/1/8/1:0 is DOWN.
%@3205539%Aug 4 14:31:43:666 2014 SR8812 CPET/4/IF_STATUS_CHANGE: -Slot=8;
S8/1/8/1:0: removed from Mp-group8/1/4 //串口也开始down,并被踢出MP口
%@3205540#Aug 4 14:31:45:650 2014 SR8812 IFNET/4/INTERFACE UPDOWN:
Trap 1.3.6.1.6.3.1.1.5.3: Interface 78315769 is Down, ifAdminStatus is 1, ifOperStatus is 2
%@3205541#Aug 4 14:31:45:650 2014 SR8812 IFNET/4/INTERFACE UPDOWN:
Trap 1.3.6.1.6.3.1.1.5.3: Interface 77856771 is Down, ifAdminStatus is 1, ifOperStatus is 2
%@3205542%Aug 4 14:31:45:650 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface S8/1/8/2:0 is DOWN.
%@3205543%Aug 4 14:31:45:651 2014 SR8812 IFNET/3/LINK_UPDOWN: Mp-group8/1/4 link status is DOWN.
%@3205544%Aug 4 14:31:45:651 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface Mp-group8/1/4 is DOWN.
%@3205545%Aug 4 14:31:45:651 2014 SR8812 IFNET/5/PROTOCOL_UPDOWN: Protocol PPP IPCP on the interface Mp-group8/1/4 is DOWN.
%@3205546#Aug 4 14:31:45:665 2014 SR8812 OSPF/5/IF_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.16: Non-virtual interface 10.0.8.109 index 0 Router 10.0.3.5 state change to 1.
%@3205547%Aug 4 14:31:45:697 2014 SR8812 CPET/4/IF_STATUS_CHANGE: -Slot=8;
S8/1/8/2:0: removed from Mp-group8/1/4
%@3205548#Aug 4 14:31:49:195 2014 SR8812 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.28 LsdbType 1 LsdbLsid 10.0.3.5 LsdbRouterId 10.0.3.5 Router 10.0.3.5 .
由上可见,线路质量存在问题,导致SR88在14:31将MP9/1/4的2个成员口都踢出捆绑,且MP协议也down掉,此时业务必然不通;于是在14:32,SR66被手工重启。
既然此时是线路质量导致SR88震荡引发业务中断,让大家心存困扰的是,为什么重启SR66,业务也能恢复呢?从上面日志可见,此时SR88的MP下已经没有成员口,其本身也down掉了。正是14:32 SR66的重启让SR88的MP链路重新UP(见如下日志),各协议邻居才重新恢复正常,业务必然跟着恢复!
另外,文中开头提到,协议的震荡跟错包的速率、比例以及持续的总量是有关系的。设备重启或MP接口down/up后的初期,错包相对较少,协议不容易马上再出现震荡,业务也就能稳定一段时间。
%@3205549#Aug 4 14:33:01:210 2014 SR8812 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.0 LsdbType 5 LsdbLsid 10.239.243.192 LsdbRouterId 10.239.5.1 Router 10.239.5.1 .
%@3205550#Aug 4 14:33:01:210 2014 SR8812 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.0 LsdbType 5 LsdbLsid 10.239.243.192 LsdbRouterId 10.240.211.81 Router 10.239.5.1 .
%@3205551#Aug 4 14:34:30:702 2014 SR8812 IFNET/4/INTERFACE UPDOWN:
Trap 1.3.6.1.6.3.1.1.5.4: Interface 78315769 is Up, ifAdminStatus is 1, ifOperStatus is 1
%@3205552%Aug 4 14:34:30:702 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface S8/1/8/2:0 is UP.
%@3205553%Aug 4 14:34:30:714 2014 SR8812 CPET/4/IF_STATUS_CHANGE: -Slot=8;
S8/1/8/2:0: attached to Mp-group8/1/4
%@3205554#Aug 4 14:34:30:725 2014 SR8812 IFNET/4/INTERFACE UPDOWN:
Trap 1.3.6.1.6.3.1.1.5.4: Interface 77856771 is Up, ifAdminStatus is 1, ifOperStatus is 1
%@3205555%Aug 4 14:34:30:725 2014 SR8812 IFNET/3/LINK_UPDOWN: Mp-group8/1/4 link status is UP.
%@3205556%Aug 4 14:34:30:725 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface Mp-group8/1/4 is UP.
%@3205557#Aug 4 14:34:31:694 2014 SR8812 IFNET/4/INTERFACE UPDOWN:
Trap 1.3.6.1.6.3.1.1.5.4: Interface 78315762 is Up, ifAdminStatus is 1, ifOperStatus is 1
%@3205558%Aug 4 14:34:31:694 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface S8/1/8/1:0 is UP.
%@3205559%Aug 4 14:34:31:714 2014 SR8812 CPET/4/IF_STATUS_CHANGE: -Slot=8;
S8/1/8/1:0: attached to Mp-group8/1/4
%@3205560%Aug 4 14:34:33:709 2014 SR8812 IFNET/5/PROTOCOL_UPDOWN: Protocol PPP IPCP on the interface Mp-group8/1/4 is UP. //MP链路恢复正常
%@3205572%Aug 4 14:34:43:092 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.110(Mp-group8/1/4) from Loading to Full.
%@3205581%Aug 4 14:34:53:989 2014 SR8812 LDP/5/LDP_SESSION_UP: Session(2.2.2.2:0, public instance)'s state changed to up.
%@3205586%Aug 4 14:35:05:670 2014 SR8812 BGP/5/BGP_STATE_CHANGED:
2.2.2.2 state is changed from OPENCONFIRM to ESTABLISHED.
%@3205592%Aug 4 14:35:22:884 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 101 Shamlink Neighbor 10.0.0.128 from Loading to Full. //至此,SR88也于14:35恢复正常
以上所有信息都反映线路质量有问题,建议排查SR8812与SR6604之间的链路情况。
四、 解决方法:
排查SR8812与SR6604之间线路问题,可根本解决此问题,同时还建议排查SR6604、SR8812设备接地/共地情况,以排除因施工接地不良给设备带来的影响。
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作