H3C S3600 CPU高故障处理实例
一、 组网:
无。
二、 问题描述:
之前设备正常运行,但是最近发现S3600上CPU总是出现很高的情况,如下:
-------------------- display cpu --------------------
Unit 1
Board 0 CPU busy status:
95% in last 5 seconds
61% in last 1 minute
39% in last 5 minutes
三、 过程分析:
这种问题一般都是因为上CPU的报文过多导致。可以从收集的诊断信息中查看一下哪个端口上来的报文较多,目前新版本的诊断信息里都会打印一个display Driver NI的信息,通过这个信息可以看到一些报文上CPU的情况,例如:
-------------------- display Driver NI --------------------
Display NI packet queue:
Que Inpos Outpos packet-num remain-num que-length full-error
XmitQueue0 589 588 1 1499 1500 0
XmitQueue1 0 0 0 1500 1500 0
XmitQueue2 1243 1243 0 1500 1500 0
XmitQueue3 232 232 0 1500 1500 122770
XmitQueue4 753 724 29 1471 1500 0
XmitQueue5 0 0 0 1500 1500 0
XmitQueue6 0 0 0 1500 1500 0
XmitQueue7 1189 1189 0 1500 1500 0
NI memory malloc-free:
All_malloc 1003087443
All_Free 1003087409
Drv_Malloc 688206663
Drv_Free 356650491
Plat_Malloc 314880780
Plat_Free 646436918
dma_malloc 486507
dma_free 486507
rx_malloc 690780143
rx_free 689826534
total tx 325448106
total tx ok 325934613
tx error 0
All_malloc should equal to All_Free, dma_malloc==dma_free, rx_malloc==rx_free
NI HandShake packet send 0 receive 0 fault times 0
type total count head tail drop
IUC 2000 0 0 0 0
IPC 1000 0 0 0 0
DDP 64 0 0 0 0
Display IUC packet counter:
unit send sendOK receive receiveok
unit 1 0 0 0 0
unit 2 0 0 0 0
unit 3 0 0 0 0
unit 4 0 0 0 0
unit 5 0 0 0 0
unit 6 0 0 0 0
unit 7 0 0 0 0
unit 8 0 0 0 0
the average CPU packet rx-rate(pkt/sec) during last 5 seconds:
CosQ-0 CosQ-1 CosQ-2 CosQ-3 CosQ-4 CosQ-5 CosQ-6 CosQ-7 All
-------------------------------------------------------------------------------
2 16 96 290 0 3 0 0 408
CPU packet rx-rate over threshold 355395 times, recent 10 times recorded:
--Record 1-- Feb 22 2009 14:17:41
CPU usage: 99%, RX-RATE: CPU-408, CosQ3-290, Pri-3, by Protocol/ by Port:
TELNET-282, Other-8
(0,11)-282
--Record 2-- Feb 22 2009 14:17:36
CPU usage: 85%, RX-RATE: CPU-315, CosQ3-211, Pri-3, by Protocol/ by Port:
TELNET-207, Other-3
(0,11)-207
--Record 3-- Feb 22 2009 14:16:56
CPU usage: 99%, RX-RATE: CPU-276, CosQ3-166, Pri-3, by Protocol/ by Port:
TELNET-165, Other-1
(0,11)-165
--Record 4-- Feb 22 2009 14:16:51
CPU usage: 99%, RX-RATE: CPU-285, CosQ3-187, Pri-3, by Protocol/ by Port:
TELNET-185, Other-1
(0,11)-185
--Record 5-- Feb 22 2009 14:16:46
CPU usage: 99%, RX-RATE: CPU-360, CosQ3-262, Pri-3, by Protocol/ by Port:
TELNET-262,
(0,11)-262
--Record 6-- Feb 22 2009 14:16:41
CPU usage: 61%, RX-RATE: CPU-225, CosQ3-120, Pri-3, by Protocol/ by Port:
TELNET-118, Other-1
(0,11)-118
--Record 7-- Feb 22 2009 14:16:31
CPU usage: 38%, RX-RATE: CPU-200, CosQ2-169, Pri-0, by Protocol/ by Port:
BC-169,
(0,12)-168
--Record 8-- Feb 22 2009 14:15:51
CPU usage: 38%, RX-RATE: CPU-157, CosQ2-114, Pri-0, by Protocol/ by Port:
BC-112, Other-1
(0,12)-112
--Record 9-- Feb 22 2009 14:15:31
CPU usage: 40%, RX-RATE: CPU-150, CosQ2-102, Pri-0, by Protocol/ by Port:
BC-102,
(0,12)-100
--Record 10-- Feb 22 2009 14:14:31
CPU usage: 38%, RX-RATE: CPU-141, CosQ2-118, Pri-0, by Protocol/ by Port:
BC-117,
(0,12)-117
从这个信息里边,我们可以重点看下面这个:
CPU usage: 99%, RX-RATE: CPU-408, CosQ3-290, Pri-3, by Protocol/ by Port:
TELNET-282, Other-8
(0,11)-282
这里记录了CPU高时,上CPU比较多的报文端口
(0,11) -282 -----------表示e1/0/12口有282个报文上CPU。
下面是对这个信息的解释。
(x,y)-z
x 内部芯片号,对于s3600恒为0
y 内部芯片端口号
0~23 对应于e1/0/1 to e1/0/24 (非3Com品牌),左边24个FE端口
24~27 对应于4个GE口,g1/1/1 to g1/1/4
32~55 对应于e1/0/25 to e1/0/48(非3Com品牌的52口设备),右边24个FE端口
z 统计时的报文数量
这样我们就可以根据这个信息,判断出导致CPU高的原因了。对于堆叠的情况,要通过console口收集各个unit单元的诊断信息。通过这个信息可以配合抓对应端口的报文和debug 上CPU的报文来确定报文的内容以及来源。
四、 解决方法:
根据诊断信息的内容或抓包的内容来找到上CPU报文的来源,并做相应的处理。
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作