• 全部
  • 经验案例
  • 典型配置
  • 技术公告
  • FAQ
  • 全部
  • 全部
产品线
搜索
取消
案例类型
发布者
是否解决
是否官方
时间
高级搜索

SR66路由器重启后解决视频会议中断经验案例

2014-10-21发表
  • 2关注
  • 1收藏,2387浏览
粉丝:0人 关注:0人

一、   组网:

http://zhiliao.h3c.com/repository/52646.png

二、   问题描述:

SR8812设备作为中心视频会议核心设备,使用CPOS拆分出的63个E1连接多个分支,分支的设备为SR6604路由器。核心SR8812路由器使用CPOS拆分出来的S8/1/8/1:0和S8/1/8/2:0时隙做MP捆绑,和分支SR6604的S3/2/0、S3/2/1通过MP捆绑互联。客户反馈一个分支SR6604下挂的视频会议经常出现无法正常召开,通过手工重启SR6604后问题解决。

三、   过程分析:

首先,我们先来总结一下,因设备运行中引发协议down掉的情况,通常有以下几种。其一,设备硬件故障,比如子卡故障引发协议down掉;其二,协议软件本身故障引发协议down;其三,某段时间内线路信号质量差或者错包持续总量较大,上层协议(如OSPF/BGP/LDP)、甚至链路协议报文丢失,最终定时器超时引发协议down;其四,某段时间内错包比例或速率过大,直接导致接口关闭,引发协议down。

通过收集SR8812和SR6604路由器诊断和日志信息分析如下:

SR8812路由器日志分析:

%@3211988#Aug 18 20:27:24:024 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.155 LsdbRouterId 10.239.5.1 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.

%@3211989#Aug 18 20:27:29:024 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.155 LsdbRouterId 10.0.5.18 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.        //OSPF重传

……

%@3211993%Aug 18 20:27:29:251 2014 SR8812 LDP/5/LDP_SESSION_DOWN: Session(2.2.2.2:0, public instance)'s state changed to down.  //MPLS LDP会话DOWN

……

%@3211997%Aug 18 20:27:37:322 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.110(Mp-group8/1/4) from Full to ExStart.     //OSPF邻居DOWN

%@3211998#Aug 18 20:27:39:023 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.1.96 LsdbRouterId 10.240.211.38 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.

%@3211999#Aug 18 20:27:42:352 2014 SR8812 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.2 LSP 0.0.161.21 Changes to Down

%@3212000#Aug 18 20:27:44:023 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.2.32 LsdbRouterId 10.0.7.6 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.         //OSPF重传

%@3212001#Aug 18 20:27:49:025 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.2.32 LsdbRouterId 10.0.7.6 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.

%@3212002#Aug 18 20:27:50:679 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 10.0.8.109 index 0 Retransmit to Nbr 10.0.8.110 0 LsdbAreaId 0.0.0.28 LsdbType 1 LsdbLsid 10.0.3.5LsdbRouterId 10.0.3.5Router 10.0.3.5Nbr RID 2.2.2.2 Type is 2.

%@3212003#Aug 18 20:27:51:692 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 10.0.8.109 index 0 Retransmit to Nbr 10.0.8.110 0 LsdbAreaId 0.0.0.28 LsdbType 3 LsdbLsid 10.0.4.48 LsdbRouterId 10.0.3.5Router 10.0.3.5Nbr RID 2.2.2.2 Type is 2.

……

从SR8812侧日志来看,首先发生OSPF重传,接着MPLS LDP会话DOWN,然后OSPF邻居DOWN,这说明LDP会话DOWN并非OSPF邻居DOWN造成。大量OSPF报文的重传说明本端SR8812未收到对端的确认报文,这属于引发协议down的第三种情况。

另外,所有UP的serial端口的接收方向都有不少错包,有一些端口的错包还非常多,说明线路质量存在问题。如MP8/1/4的成员口S8/1/8/1:0、S8/1/8/2:0下则有较多错包:

S8/1/8/1:0 current state: UP

Input(total):  12917891 packets,  5042061874 bytes

Input(Bad):  12 Abort,  2881 FCS-Error,  0 FIFO-Abort,  0 Giant,  0 Runt

 

S8/1/8/2:0 current state: UP

Input(total):  12919422 packets,  5042729785 bytes

Input(Bad):  856 Abort,  3431 FCS-Error,  0 FIFO-Abort,  0 Giant,  0 Runt

第二次收集的诊断信息中,发现Mp-group 9/1/4的成员端口错包有明显增长:

S8/1/8/1:0 current state: UP

Input(Bad):  14 Abort,  2881 FCS-Error,  0 FIFO-Abort,  0 Giant,  0 Runt

 

S8/1/8/2:0 current state: UP

Input(Bad):  1298 Abort,  4178 FCS-Error,  0 FIFO-Abort,  0 Giant,  0 Runt             //增长明显

除上述两端口外,其他端口也增长明显。

第三次收集的诊断信息中,发现Mp-group 9/1/4的成员端口错包有明显增长:

S8/1/8/1:0 current state: UP

Input(Bad):  14 Abort,  2881 FCS-Error,  0 FIFO-Abort,  0 Giant,  0 Runt

 

S8/1/8/2:0 current state: UP

Input(Bad):  1940 Abort,  4692 FCS-Error,  0 FIFO-Abort,  0 Giant,  0 Runt             //增长明显

还有,设备一直存在大量物理线路错误告警(都跟信号丢失、帧丢失,或错包相关),而且,其中就包含与对端SR66相连的两条线路。

%@3053231%Aug 23 08:38:42:241 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8; ——这是第一次

Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm RAI recover! Start Time : 2013-08-23 08:38:33:44!

%@3053232%Aug 23 08:38:43:282 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;

 Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm CRC4 recover! Start Time : 2013-08-23 08:38:42:465!

%@3053233%Aug 23 08:38:51:618 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;

 Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm AIS recover! Start Time : 2013-08-23 08:38:43:506!

%@3053234%Aug 23 08:38:51:618 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;

 Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm LOF recover! Start Time : 2013-08-23 08:38:42:465!

%@3053235%Aug 23 08:38:51:618 2013 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;

 Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 6 TU12 3 E1 Alarm LMFA recover! Start Time : 2013-08-23 08:38:42:465!

……

%Sep  1 03:48:40:813 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;  ——这是最近一次(这个日志就是9.1日收集的)

Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 4 TU12 3 E1 Alarm RAI recover! Start Time : 2014-09-01 03:48:40:288!

%Sep  1 11:01:10:871 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;

Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 3 TU12 1 Alarm STU-P recover! Start Time : 2014-09-01 11:01:09:516!

%Sep  1 11:01:10:871 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;

Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 2 TU12 2 Alarm STU-P recover! Start Time : 2014-09-01 11:01:09:519!

而且,这些物理线路错误比较严重时,会直接导致接口震荡,以致协议震荡(如下所示)。这属于引发协议down的第三种情况。

Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 1 TU12 3 E1 Alarm RAI recover! Start Time : 2014-08-19 00:29:54:478!

%@3216415%Aug 19 00:29:56:843 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;

 Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 1 TU12 3 E1 Alarm CRC4 recover! Start Time : 2014-08-19 00:29:55:520!

%@3216416%Aug 19 00:48:00:088 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;

 Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 1 TU12 3 E1 Alarm CRC4 recover! Start Time : 2014-08-19 00:47:58:772!

%@3216417%Aug 19 03:34:09:808 2014 SR8812 CPET/4/CPET_LOG_WARN: -Slot=8;

 Cpos8/1/8 : VC4 1 TUG-3 1 TUG-2 2 TU12 1 E1 Alarm RAI recover! Start Time : 2014-08-19 03:34:06:390!

%@3216418#Aug 19 03:34:12:059 2014 SR8812 IFNET/4/INTERFACE UPDOWN:

 Trap 1.3.6.1.6.3.1.1.5.3: Interface 78315779 is Down, ifAdminStatus is 1, ifOperStatus is 2

%@3216419%Aug 19 03:34:12:059 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface Serial8/1/8/1:0 is DOWN.

%@3216420%Aug 19 03:34:12:059 2014 SR8812 IFNET/5/PROTOCOL_UPDOWN: Protocol PPP IPCP on the interface Serial8/1/8/1:0 is DOWN.

%@3216421#Aug 19 03:34:12:065 2014 SR8812 OSPF/5/NEIGHBOR_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.2: Non-virtual neighbor 10.240.201.86 index 0 Router 10.0.3.5NbrRouter 10.240.204.22 state change to 1.

%@3216422#Aug 19 03:34:12:065 2014 SR8812 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.22 LsdbType 1 LsdbLsid 10.0.3.5LsdbRouterId 10.0.3.5Router 10.0.3.5.

%@3216423#Aug 19 03:34:12:066 2014 SR8812 OSPF/5/IF_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.16: Non-virtual interface 10.240.201.85 index 0 Router 10.0.3.5state change to 1.

%@3216424%Aug 19 03:34:12:066 2014 SR8812 OSPF/6/OSPF_LAST_NBR_DOWN: OSPF 1 Last neighbor down event: Router ID: 10.240.204.22 Local address: 10.240.201.85 Remote address: 10.240.201.86 Reason: Vlink down.

%@3216425%Aug 19 03:34:12:066 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.240.201.86(Serial8/1/8/1:0) from Full to Down.

%@3216426%Aug 19 03:34:12:066 2014 SR8812 LDP/5/LDP_SESSION_DOWN: Session(10.240.204.22:0, public instance)'s state changed to down.

 

以上所有SR88的信息都反映线路质量有问题,建议排查SR8812与SR6604之间的线路情况。

SR6604侧设备日志分析:

从设备日志文件中可见,业务中断是路由协议震荡造成;而路由协议震荡都是收不到对端发送的协议报文后,定时器超时引起。这属于引发协议down的第三种情况。

%@32165%Jul  2 14:46:34:645 2013 SR6604 OSPF/6/OSPF_LAST_NBR_DOWN: OSPF 1 Last neighbor down event: Router ID: 10.0.3.5 Local address: 10.0.8.110 Remote address: 10.0.8.109 Reason: DeadInterval timer expired.      //收不到对端OSPF协议报文

%@32166%Jul  2 14:46:34:645 2013 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.109(Mp-group3/2/7) from Full to Down.       //OSPf邻居的DOWN

%@32167#Jul  2 14:47:16:224 2013 SR6604 BGP/4/BACKWARD: BGP TrapID1.3.6.1.2.1.15.7.2: PeerID 10.1.2.6 last error 60 changed state to 1.

%@32168%Jul  2 14:47:16:224 2013 SR6604 BGP/5/BGP_STATE_CHANGED:  10.1.2.6 state is changed from ESTABLISHED to IDLE.    //OSPF DOWN 导致BGP邻居 DOWN

%@32169#Jul  2 14:47:21:233 2013 SR6604 BGP/4/BACKWARD: BGP TrapID1.3.6.1.2.1.15.7.2: PeerID 1.1.1.1 last error 60 changed state to 1.

%@32170%Jul  2 14:47:21:233 2013 SR6604 BGP/5/BGP_STATE_CHANGED:  1.1.1.1 state is changed from ESTABLISHED to IDLE.

另外,SR66侧跟SR88一样,相关接口下也有不少错包。2014.8.28日把接口统计清空后,还是能观察到错包数的持续增长。这说明线路质量存在问题。

Serial3/2/0 current state: UP

Last clearing of counters: 16:09:38 GMT Thu 08/28/2014     ——8.28下午清空的

Input error packet detect: Initial

    Last 5 seconds input rate 72937.60 bytes/sec, 583500 bits/sec, 189.79 packets/sec

    Last 5 seconds output rate 96109.20 bytes/sec, 768873 bits/sec, 205.19 packets/sec

    Input: 6856074 packets, 2640748639 bytes, 0 no buffers

           0 broadcasts, 0 multicasts

           136 errors, 0 runts, 1 giants      ——错包一直再增长

           121 CRC, 0 align errors, 0 overruns

           0 dribbles, 0 aborts, 14 frame errors

 

Serial3/2/0 current state: UP

Last clearing of counters: 16:09:38 GMT Thu 08/28/2014

Input error packet detect: Initial

    Last 5 seconds input rate 85223.60 bytes/sec, 681788 bits/sec, 226.19 packets/sec

    Last 5 seconds output rate 101720.39 bytes/sec, 813763 bits/sec, 236.20 packets/sec

    Input: 11300985 packets, 4304683148 bytes, 0 no buffers

           0 broadcasts, 0 multicasts

           3435 errors, 0 runts, 1 giants   ——错包一直再增长

           2804 CRC, 0 align errors, 0 overruns

           0 dribbles, 0 aborts, 630 frame errors

前面既然说是线路质量导致的业务中断,那为什么业务中断时,手工重启SR66后业务就能恢复呢?

那么我们来分析一下手动重启SR6604的过程,手工启动时间——Aug  4 14:32:49:108

从SR66的日志里可以清晰地看到,这次手工启动之前,SR6604持续收到大量错包,Serial3/2/0频繁关闭开启,直接导致协议震荡(这是引发震荡的第四中情况)。

%@57613%Aug  4 14:13:02:824 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;

(Serial3/2/0)

Channel is deactived because of too many continuous input errors! This should be caused by poor signal or improper channel-config.

%@57614%Aug  4 14:13:05:468 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;

(Serial3/2/0)

Channel is actived!

%@57615%Aug  4 14:16:59:629 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;

(Serial3/2/0)

Channel is deactived because of too many continuous input errors! This should be caused by poor signal or improper channel-config.

%@57616%Aug  4 14:17:00:467 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;

(Serial3/2/0)

Channel is actived!

%@57617%Aug  4 14:17:02:995 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;

(Serial3/2/0)

Channel is deactived because of too many continuous input errors! This should be caused by poor signal or improper channel-config.

%@57618%Aug  4 14:17:05:467 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;

(Serial3/2/0)

Channel is actived!

%@57619%Aug  4 14:19:01:242 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;

(Serial3/2/0)

Channel is deactived because of too many continuous input errors! This should be caused by poor signal or improper channel-config.    //S3/2/0瞬间收到大量错包冲击后关闭

%@57640%Aug  4 14:19:05:240 2014 SR6604 LDP/5/LDP_SESSION_DOWN: Session(1.1.1.1:0, public instance)'s state changed to down.  //协议开始震荡

%@57676%Aug  4 14:19:05:287 2014 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 10 Shamlink Neighbor 10.240.214.28 from Full to Down.      

%@57677%Aug  4 14:19:05:466 2014 SR6604 DRVICOUT/1/DrvIcOutStr: -Slot=3;

(Serial3/2/0)

Channel is actived!         //端口恢复

%@57678%Aug  4 14:19:05:512 2014 SR6604 LDP/5/LDP_SESSION_UP: Session(1.1.1.1:0, public instance)'s state changed to up.

%@57682#Aug  4 14:19:28:038 2014 SR6604 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.1 LSP 0.0.186.201 Changes to Up

%@57683#Aug  4 14:19:28:039 2014 SR6604 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.1 LSP 0.0.186.202 Changes to Up

……

%@57734#Aug  4 14:19:28:117 2014 SR6604 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.0 LsdbRouterId 10.0.10.109 Router 10.0.10.109.

%@57735#Aug  4 14:19:28:117 2014 SR6604 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.239.8.0 LsdbRouterId 10.0.10.109 Router 10.0.10.109.

……

%@57753#Aug  4 14:19:47:162 2014 SR6604 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.28 LsdbType 1 LsdbLsid 10.0.9.109 LsdbRouterId 10.0.9.109 Router 10.0.9.109 .

%@57754#Aug  4 14:19:49:271 2014 SR6604 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.28 LsdbType 1 LsdbLsid 10.0.9.109 LsdbRouterId 10.0.9.109 Router 10.0.9.109.

%@57755%Aug  4 14:19:49:271 2014 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 10 Shamlink Neighbor 10.240.214.28 from Loading to Full.  //这个时间点,SR66已经恢复正常;

%@57756%Aug  4 06:32:14 2014 H3C %%10IC/6/SYS_RESTART:System restarted

–H3C Comware Software.                  //但在13分钟后,设备手工重启了

(此处因设备尚未启动完毕,时间还是原始值,加上8小时才是实际时间,即14:32:14)

%@57757#Aug  4 06:32:32:529 2014 H3C DEVM/1/BOARD INSERTED:

 Trap 1.3.6.1.4.1.25506.8.35.12.1.9: chassisIndex is 0, slotIndex 0.1

%@57758#Aug  4 06:32:32:529 2014 H3C DEVM/1/BOARD INSERTED:

 Trap 1.3.6.1.4.1.25506.8.35.12.1.9: chassisIndex is 0, slotIndex 0.3

%@57759%Aug  4 06:32:32:529 2014 H3C DEVM/2/BOARD_STATE_FAULT: Board state changes to FAULT on Chassis 0 Slot 1, type is unknown.

%@57760%Aug  4 06:32:32:529 2014 H3C DEVM/2/BOARD_STATE_FAULT: Board state changes to FAULT on Chassis 0 Slot 3, type is unknown.

%@57761#Aug  4 14:32:49:108 2014 SR6604 DEVM/4/SYSTEM COLD START:

Trap 1.3.6.1.4.1.25506.6.8.4: system cold start.

……

%@57902%Aug  4 14:34:06:801 2014 SR6604 IFNET/3/LINK_UPDOWN: Serial3/2/0 link status is UP.

%@57903%Aug  4 14:34:06:802 2014 SR6604 IFNET/3/LINK_UPDOWN: Serial3/2/3 link status is UP.

%@57913%Aug  4 14:34:07:848 2014 SR6604 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface Mp-group3/2/7 is UP.

%@57922%Aug  4 14:34:10:879 2014 SR6604 IFNET/5/PROTOCOL_UPDOWN: Protocol PPP IPCP on the interface Mp-group3/2/7 is UP.

%@57926%Aug  4 14:34:20:284 2014 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.109(Mp-group3/2/7) from Loading to Full.

%@57929%Aug  4 14:34:31:156 2014 SR6604 LDP/5/LDP_SESSION_UP: Session(1.1.1.1:0, public instance)'s state changed to up.

%@57983#Aug  4 14:34:42:841 2014 SR6604 BGP/4/ESTABLISHED: BGP TrapID1.3.6.1.2.1.15.7.1: PeerID 1.1.1.1 last error 0  changed state to 6.

%@57984%Aug  4 14:34:42:841 2014 SR6604 BGP/5/BGP_STATE_CHANGED:

 1.1.1.1 state is changed from OPENCONFIRM to ESTABLISHED.

%@58022%Aug  4 14:35:00:079 2014 SR6604 OSPF/5/OSPF_NBR_CHG: OSPF 10 Shamlink Neighbor 10.240.214.28 from Loading to Full.  //手工重启3分钟后,SR66于14:35恢复

如上日志中,SR66其实在14:19已经恢复,为什么在此后的13分钟内客户感知业务还是没恢复,导致14:32手工重启呢?我们看SR88的日志发现,原来此时SR88正要恢复却又因为线路质量问题再次发生震荡。

%@3205479%Aug  4 14:19:05:241 2014 SR8812 LDP/5/LDP_SESSION_DOWN: Session(2.2.2.2:0, public instance)'s state changed to down.

%@3205480%Aug  4 14:19:05:511 2014 SR8812 LDP/5/LDP_SESSION_UP: Session(2.2.2.2:0, public instance)'s state changed to up.

%@3205486%Aug  4 14:19:31:058 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 101 Shamlink Neighbor 10.0.0.128 from Full to ExStart.

%@3205499%Aug  4 14:19:49:266 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 101 Shamlink Neighbor 10.0.0.128 from Loading to Full.   //此时SR88刚刚恢复

……

%@3205517#Aug  4 14:30:17:738 2014 SR8812 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.1 LSP 0.0.159.53 Changes to Up

%@3205518%Aug  4 14:30:17:739 2014 SR8812 LDP/5/LDP_SESSION_DOWN: Session(2.2.2.2:0, public instance)'s state changed to down.  //14:30又陷入震荡

%@3205519#Aug  4 14:30:26:545 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.240.214.1 LsdbRouterId 10.240.211.34 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.

%@3205520#Aug  4 14:30:31:546 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.240.214.1 LsdbRouterId 10.240.211.34 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.

%@3205521#Aug  4 14:30:35:790 2014 SR8812 OSPF/5/NEIGHBOR_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.2: Non-virtual neighbor 10.0.8.110 index 0 Router 10.0.3.5 NbrRouter 2.2.2.2 state change to 1.

%@3205522#Aug  4 14:30:35:791 2014 SR8812 OSPF/6/ORIGINATE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.12: Originate new LSA AreaId 0.0.0.28 LsdbType 1 LsdbLsid 10.0.3.5 LsdbRouterId 10.0.3.5 Router 10.0.3.5.

%@3205523%Aug  4 14:30:35:791 2014 SR8812 OSPF/6/OSPF_LAST_NBR_DOWN: OSPF 1 Last neighbor down event: Router ID: 2.2.2.2 Local address: 10.0.8.109 Remote address: 10.0.8.110 Reason: DeadInterval timer expired.     //定时器超时

%@3205524%Aug  4 14:30:35:791 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.110(Mp-group8/1/4) from Full to Down.

%@3205525#Aug  4 14:30:36:545 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.240.214.1 LsdbRouterId 10.240.211.34 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.

%@3205526#Aug  4 14:30:40:813 2014 SR8812 LSPM/4/TRAP: 1.3.6.1.2.1.10.166.2.0.2 LSP 0.0.159.53 Changes to Down

%@3205527#Aug  4 14:30:41:545 2014 SR8812 OSPF/4/IF_RETX: OSPF TrapID1.3.6.1.2.1.14.16.2.10<ospfTxRetransmit>: Interface 0.0.0.0 index 0 Retransmit to Nbr 10.0.0.128 0 LsdbAreaId 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.128 LsdbRouterId 10.239.5.1 Router 10.239.5.1 Nbr RID 10.0.9.109 Type is 4.

%@3205528#Aug  4 14:30:42:861 2014 SR8812 OSPF/5/NEIGHBOR_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.2: Non-virtual neighbor 10.0.0.128 index 0 Router 10.239.5.1 NbrRouter 10.0.9.109 state change to 1.

%@3205529%Aug  4 14:30:42:862 2014 SR8812 OSPF/6/OSPF_LAST_NBR_DOWN: OSPF 101 Last neighbor down event: Router ID: 10.0.9.109 Local address: 10.240.214.28 Remote address: 10.0.0.128 Reason: DeadInterval timer expired.

%@3205530%Aug  4 14:30:42:862 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 101 Shamlink Neighbor 10.0.0.128 from Full to Down.

%@3205534#Aug  4 14:30:59:176 2014 SR8812 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.0 LsdbType 5 LsdbLsid 10.0.0.128 LsdbRouterId 10.239.5.4 Router 10.239.5.1 .

%@3205535#Aug  4 14:31:41:950 2014 SR8812 BGP/4/BACKWARD: BGP TrapID1.3.6.1.2.1.15.7.2: PeerID 2.2.2.2 last error 60 changed state to 1.

%@3205536%Aug  4 14:31:41:950 2014 SR8812 BGP/5/BGP_STATE_CHANGED:

 2.2.2.2 state is changed from ESTABLISHED to IDLE.

%@3205537#Aug  4 14:31:43:653 2014 SR8812 IFNET/4/INTERFACE UPDOWN:

 Trap 1.3.6.1.6.3.1.1.5.3: Interface 78315762 is Down, ifAdminStatus is 1, ifOperStatus is 2 

%@3205538%Aug  4 14:31:43:653 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface S8/1/8/1:0 is DOWN.

%@3205539%Aug  4 14:31:43:666 2014 SR8812 CPET/4/IF_STATUS_CHANGE: -Slot=8;

 S8/1/8/1:0: removed from Mp-group8/1/4  //串口也开始down,并被踢出MP口

%@3205540#Aug  4 14:31:45:650 2014 SR8812 IFNET/4/INTERFACE UPDOWN:

 Trap 1.3.6.1.6.3.1.1.5.3: Interface 78315769 is Down, ifAdminStatus is 1, ifOperStatus is 2 

%@3205541#Aug  4 14:31:45:650 2014 SR8812 IFNET/4/INTERFACE UPDOWN:

 Trap 1.3.6.1.6.3.1.1.5.3: Interface 77856771 is Down, ifAdminStatus is 1, ifOperStatus is 2 

%@3205542%Aug  4 14:31:45:650 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface S8/1/8/2:0 is DOWN.

%@3205543%Aug  4 14:31:45:651 2014 SR8812 IFNET/3/LINK_UPDOWN: Mp-group8/1/4 link status is DOWN.

%@3205544%Aug  4 14:31:45:651 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface Mp-group8/1/4 is DOWN.

%@3205545%Aug  4 14:31:45:651 2014 SR8812 IFNET/5/PROTOCOL_UPDOWN: Protocol PPP IPCP on the interface Mp-group8/1/4 is DOWN.

%@3205546#Aug  4 14:31:45:665 2014 SR8812 OSPF/5/IF_STATE_CHANGE: OSPF TrapID1.3.6.1.2.1.14.16.2.16: Non-virtual interface 10.0.8.109 index 0 Router 10.0.3.5 state change to 1.

%@3205547%Aug  4 14:31:45:697 2014 SR8812 CPET/4/IF_STATUS_CHANGE: -Slot=8;

 S8/1/8/2:0: removed from Mp-group8/1/4

%@3205548#Aug  4 14:31:49:195 2014 SR8812 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.28 LsdbType 1 LsdbLsid 10.0.3.5 LsdbRouterId 10.0.3.5 Router 10.0.3.5 .

由上可见,线路质量存在问题,导致SR88在14:31将MP9/1/4的2个成员口都踢出捆绑,且MP协议也down掉,此时业务必然不通;于是在14:32,SR66被手工重启。

既然此时是线路质量导致SR88震荡引发业务中断,让大家心存困扰的是,为什么重启SR66,业务也能恢复呢?从上面日志可见,此时SR88的MP下已经没有成员口,其本身也down掉了。正是14:32 SR66的重启让SR88的MP链路重新UP(见如下日志),各协议邻居才重新恢复正常,业务必然跟着恢复!

另外,文中开头提到,协议的震荡跟错包的速率、比例以及持续的总量是有关系的。设备重启或MP接口down/up后的初期,错包相对较少,协议不容易马上再出现震荡,业务也就能稳定一段时间。

%@3205549#Aug  4 14:33:01:210 2014 SR8812 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.0 LsdbType 5 LsdbLsid 10.239.243.192 LsdbRouterId 10.239.5.1 Router 10.239.5.1 .

%@3205550#Aug  4 14:33:01:210 2014 SR8812 OSPF/5/MAXAGE_LSA: OSPF TrapID1.3.6.1.2.1.14.16.2.13: Aged a LSA Area Id 0.0.0.0 LsdbType 5 LsdbLsid 10.239.243.192 LsdbRouterId 10.240.211.81 Router 10.239.5.1 .

%@3205551#Aug  4 14:34:30:702 2014 SR8812 IFNET/4/INTERFACE UPDOWN:

 Trap 1.3.6.1.6.3.1.1.5.4: Interface 78315769 is Up, ifAdminStatus is 1, ifOperStatus is 1 

%@3205552%Aug  4 14:34:30:702 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface S8/1/8/2:0 is UP.

%@3205553%Aug  4 14:34:30:714 2014 SR8812 CPET/4/IF_STATUS_CHANGE: -Slot=8;

 S8/1/8/2:0: attached to Mp-group8/1/4

%@3205554#Aug  4 14:34:30:725 2014 SR8812 IFNET/4/INTERFACE UPDOWN:

 Trap 1.3.6.1.6.3.1.1.5.4: Interface 77856771 is Up, ifAdminStatus is 1, ifOperStatus is 1 

%@3205555%Aug  4 14:34:30:725 2014 SR8812 IFNET/3/LINK_UPDOWN: Mp-group8/1/4 link status is UP.

%@3205556%Aug  4 14:34:30:725 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface Mp-group8/1/4 is UP.

%@3205557#Aug  4 14:34:31:694 2014 SR8812 IFNET/4/INTERFACE UPDOWN:

 Trap 1.3.6.1.6.3.1.1.5.4: Interface 78315762 is Up, ifAdminStatus is 1, ifOperStatus is 1 

%@3205558%Aug  4 14:34:31:694 2014 SR8812 IFNET/5/LINEPROTO_UPDOWN: Line protocol on the interface S8/1/8/1:0 is UP.

%@3205559%Aug  4 14:34:31:714 2014 SR8812 CPET/4/IF_STATUS_CHANGE: -Slot=8;

 S8/1/8/1:0: attached to Mp-group8/1/4

%@3205560%Aug  4 14:34:33:709 2014 SR8812 IFNET/5/PROTOCOL_UPDOWN: Protocol PPP IPCP on the interface Mp-group8/1/4 is UP.    //MP链路恢复正常

%@3205572%Aug  4 14:34:43:092 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 1 Neighbor 10.0.8.110(Mp-group8/1/4) from Loading to Full.

%@3205581%Aug  4 14:34:53:989 2014 SR8812 LDP/5/LDP_SESSION_UP: Session(2.2.2.2:0, public instance)'s state changed to up.

%@3205586%Aug  4 14:35:05:670 2014 SR8812 BGP/5/BGP_STATE_CHANGED:

 2.2.2.2 state is changed from OPENCONFIRM to ESTABLISHED.

%@3205592%Aug  4 14:35:22:884 2014 SR8812 OSPF/5/OSPF_NBR_CHG: OSPF 101 Shamlink Neighbor 10.0.0.128 from Loading to Full.  //至此,SR88也于14:35恢复正常

以上所有信息都反映线路质量有问题,建议排查SR8812与SR6604之间的链路情况。

四、   解决方法:

排查SR8812与SR6604之间线路问题,可根本解决此问题,同时还建议排查SR6604、SR8812设备接地/共地情况,以排除因施工接地不良给设备带来的影响。

 


0 个评论

该案例暂时没有网友评论

编辑评论

举报

×

侵犯我的权益 >
对根叔知了社区有害的内容 >
辱骂、歧视、挑衅等(不友善)

侵犯我的权益

×

泄露了我的隐私 >
侵犯了我企业的权益 >
抄袭了我的内容 >
诽谤我 >
辱骂、歧视、挑衅等(不友善)
骚扰我

泄露了我的隐私

×

您好,当您发现根叔知了上有泄漏您隐私的内容时,您可以向根叔知了进行举报。 请您把以下内容通过邮件发送到zhiliao@h3c.com 邮箱,我们会尽快处理。
  • 1. 您认为哪些内容泄露了您的隐私?(请在邮件中列出您举报的内容、链接地址,并给出简短的说明)
  • 2. 您是谁?(身份证明材料,可以是身份证或护照等证件)

侵犯了我企业的权益

×

您好,当您发现根叔知了上有关于您企业的造谣与诽谤、商业侵权等内容时,您可以向根叔知了进行举报。 请您把以下内容通过邮件发送到 zhiliao@h3c.com 邮箱,我们会在审核后尽快给您答复。
  • 1. 您举报的内容是什么?(请在邮件中列出您举报的内容和链接地址)
  • 2. 您是谁?(身份证明材料,可以是身份证或护照等证件)
  • 3. 是哪家企业?(营业执照,单位登记证明等证件)
  • 4. 您与该企业的关系是?(您是企业法人或被授权人,需提供企业委托授权书)
我们认为知名企业应该坦然接受公众讨论,对于答案中不准确的部分,我们欢迎您以正式或非正式身份在根叔知了上进行澄清。

抄袭了我的内容

×

原文链接或出处

诽谤我

×

您好,当您发现根叔知了上有诽谤您的内容时,您可以向根叔知了进行举报。 请您把以下内容通过邮件发送到zhiliao@h3c.com 邮箱,我们会尽快处理。
  • 1. 您举报的内容以及侵犯了您什么权益?(请在邮件中列出您举报的内容、链接地址,并给出简短的说明)
  • 2. 您是谁?(身份证明材料,可以是身份证或护照等证件)
我们认为知名企业应该坦然接受公众讨论,对于答案中不准确的部分,我们欢迎您以正式或非正式身份在根叔知了上进行澄清。

对根叔知了社区有害的内容

×

垃圾广告信息
色情、暴力、血腥等违反法律法规的内容
政治敏感
不规范转载 >
辱骂、歧视、挑衅等(不友善)
骚扰我
诱导投票

不规范转载

×

举报说明

提出建议

    +
<

亲~登录后才可以操作哦!

确定

你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作