Print

S12600G单板间HG UP/DOWN告警及业务检查异常告警问题案例

23小时前 发表

问题描述

现场S12600-08-G设备突然出现slot 2slot 10单板频繁报内部hgdown up告警,同时slot 2slot 3slot 6单板在报:An error occurred on the data channel between switch chips

具体告警如下:

%Jan 20 09:23:45:021 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=67, PhysicalName=Board 2, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:2,chipid:2,portid:30),destination port:(chiptype:5,slot:10,chipid:0,portid:6)), ErrorCode=481001, Reason=HiGig link went down.)

%Jan 20 09:23:45:122 2026S12608G DEV/2/INTERNALLINK_ALARM_CLEAR: Internal link alarm cleared. (PhysicalIndex=67, PhysicalName=Board 2, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:2,chipid:2,portid:30),destination port:(chiptype:5,slot:10,chipid:0,portid:6)), ErrorCode=481001, Reason=HiGig link came up.)

%Jan 20 09:23:49:822 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=67, PhysicalName=Board 2, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:2,chipid:2,portid:30),destination port:(chiptype:5,slot:10,chipid:0,portid:6)), ErrorCode=481001, Reason=HiGig link went down.)

%Jan 20 09:23:49:924 2026S12608G DEV/2/INTERNALLINK_ALARM_CLEAR: Internal link alarm cleared. (PhysicalIndex=67, PhysicalName=Board 2, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:2,chipid:2,portid:30),destination port:(chiptype:5,slot:10,chipid:0,portid:6)), ErrorCode=481001, Reason=HiGig link came up.)

%Jan 20 09:23:55:774 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=75, PhysicalName=Board 10, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:10,chipid:0,portid:6),destination port:(chiptype:5,slot:2,chipid:2,portid:30)), ErrorCode=481001, Reason=HiGig link went down.)

 

%Jan 20 09:28:43:195 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=71, PhysicalName=Board 6, RelativeResource=(bustype:data channel,sourceport:(chiptype:switch,slot:2.2,chipid:2,portid:30),destination port:(chiptype:switch,slot:6,chipid:0)), ErrorCode=473002, Reason=An error occurred on the data channel between switch chips.)

%Jan 20 09:28:43:197 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=67, PhysicalName=Board 2, RelativeResource=(bustype:data channel,sourceport:(chiptype:switch,slot:2.2,chipid:2,portid:30),destination port:(chiptype:switch,slot:6,chipid:0)), ErrorCode=473002, Reason=An error occurred on the data channel between switch chips.)

%Jan 20 09:28:43:199 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=71, PhysicalName=Board 6, RelativeResource=(bustype:data channel,sourceport:(chiptype:switch,slot:2.2,chipid:2,portid:30),destination port:(chiptype:switch,slot:6,chipid:1)), ErrorCode=473002, Reason=An error occurred on the data channel between switch chips.)

%Jan 20 09:28:43:201 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=67, PhysicalName=Board 2, RelativeResource=(bustype:data channel,sourceport:(chiptype:switch,slot:2.2,chipid:2,portid:30),destination port:(chiptype:switch,slot:6,chipid:1)), ErrorCode=473002, Reason=An error occurred on the data channel between switch chips.)

%Jan 20 09:28:43:384 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=68, PhysicalName=Board 3, RelativeResource=(bustype:data channel,sourceport:(chiptype:switch,slot:2.2,chipid:2,portid:30),destination port:(chiptype:switch,slot:3,chipid:0)), ErrorCode=473002, Reason=An error occurred on the data channel between switch chips.)

%Jan 20 09:28:43:386 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=67, PhysicalName=Board 2, RelativeResource=(bustype:data channel,sourceport:(chiptype:switch,slot:2.2,chipid:2,portid:30),destination port:(chiptype:switch,slot:3,chipid:0)), ErrorCode=473002, Reason=An error occurred on the data channel between switch chips.

过程分析

1、由于2槽位业务板和10槽位网板间互联HG频繁UP/DOWN,业务检测报文会经错HG UP/DOWN的内联口,因此也出现slot 2slot 3slot 6单板在报业务检查异常告警。

%@60495%Jan 20 09:05:17:188 2026S12608G DEV/2/INTERNALLINK_ALARM_CLEAR: Internal link alarm cleared. (PhysicalIndex=75, PhysicalName=Board 10, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:10,chipid:0,portid:6),destination port:(chiptype:5,slot:2,chipid:2,portid:30)), ErrorCode=481001, Reason=HiGig link came up.)

%@60496%Jan 20 09:05:18:129 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=75, PhysicalName=Board 10, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:10,chipid:0,portid:6),destination port:(chiptype:5,slot:2,chipid:2,portid:30)), ErrorCode=481001, Reason=HiGig link went down.)

%@60497%Jan 20 09:05:18:230 2026S12608G DEV/2/INTERNALLINK_ALARM_CLEAR: Internal link alarm cleared. (PhysicalIndex=75, PhysicalName=Board 10, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:10,chipid:0,portid:6),destination port:(chiptype:5,slot:2,chipid:2,portid:30)), ErrorCode=481001, Reason=HiGig link came up.)

%@60498%Jan 20 09:05:18:250 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=75, PhysicalName=Board 10, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:10,chipid:0,portid:6),destination port:(chiptype:5,slot:2,chipid:2,portid:30)), ErrorCode=481002, Reason=HiGig link flapped.)

%@60499%Jan 20 09:05:19:271 2026S12608G DEV/2/INTERNALLINK_ALARM_OCCUR: Internal link alarm occurred. (PhysicalIndex=75, PhysicalName=Board 10, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:10,chipid:0,portid:6),destination port:(chiptype:5,slot:2,chipid:2,portid:30)), ErrorCode=481001, Reason=HiGig link went down.)

%@60500%Jan 20 09:05:19:374 2026S12608G DEV/2/INTERNALLINK_ALARM_CLEAR: Internal link alarm cleared. (PhysicalIndex=75, PhysicalName=Board 10, RelativeResource=(bustype:DEV LINK,sourceport:(chiptype:5,slot:10,chipid:0,portid:6),destination port:(chiptype:5,slot:2,chipid:2,portid:30)), ErrorCode=481001, Reason=HiGig link came up.)

2、因此需要分析slot 2slot 10网板单板之间HG DOWN UP原因,从新收集如下信息确定,2槽位业务板侧存在较多不可纠错FEC计数,收集三次,每次读清后还会产生:

[S12608G-probe]dis hardware internal port hg-monitor slot 2

Fec Counter Record:

[uiLlogicport]    [Correct]   [Uncorrect]   [Clock]                       [Number]

==================================================================================

UpLinkPort_274    21669       391           03:58:32:711797 01/20/2026    1 

UpLinkPort_274    1177        101           03:59:15:170891 01/20/2026    2 

UpLinkPort_274    61202       1337          04:03:17:873256 01/20/2026    3 

UpLinkPort_274    134         11            04:06:45:534417 01/20/2026    4 

网板侧正常:

[S12608G-probe]dis hardware internal port hg-monitor slot 10

Fec Counter Record:

[uiLlogicport]    [Correct]   [Uncorrect]   [Clock]                       [Number]

==================================================================================

UpLinkPort_267    0           0             03:59:19:576707 01/20/2026    1 

UpLinkPort_267    0           0             04:03:22:848997 01/20/2026    2 

UpLinkPort_267    0           0             04:06:49:787219 01/20/2026    3

 

根据FEC计数情况来看,分析是2槽位业务板侧的问题。

解决方法

更换2槽位业务板。

注:类似问题,国芯设备需要收集的命令如下,以本次2槽位和10槽位直接HG/DOWN为例:

1、收集下如下命令,间隔3分钟读取一次,读取3次:

sys

prob

dis clock

dis hardware internal  port hg-monitor slot  2

dis hardware internal  port hg-monitor slot  10

 

2、确定内部互联口关系,然后收集内部互联口信息:

  ====display devm hgport chassis 0 slot 2==== 

(slot 2, slot 10):
             Slot  2         connect         Slot 10
  Lindex (Lchip, Lport, Gport)  |  Lindex (Lchip, Lport, Gport)
  274    (
2    , 30   ,0xe1e )  |  267    (0    , 6    ,0x1e06)

 

Sys
Prob
sdk slot 2 sdk
sdk slot 2 enter/lchip/
2
sdk slot 2 show/port/
0xe1e/all
sdk slot 2 port/
0xe1e /self-checking 

收集完上述槽位后,退出到系统系统,再次进入probe收集:
Sys
Prob
sdk slot 10 sdk
sdk slot 10 enter/lchip/
0
sdk slot 10 show/port/
0x1e06/all
sdk slot 10 port/
0x1e06/self-checking