查看交换机cpu96%,然后display cpu-usage task发现TaskName为au0的选项91%,cpu那么高是什么引起的
(0)
可以参考此案例排查CPU高问题:
http://kms2.h3c.com/View.aspx?id=41619
(0)
1. 确定那个任务占用CPU高 当出现CPU占用率高的问题时,首先在隐含模式下通过display cpu-usage task slot Slot_ID命令查看是哪些任务占用CPU较高。 [H3C]_h Now you enter a hidden command view for developer's testing, some commands may affect operation by wrong use, please carefully use it with our engineer's direction. [H3C-hidecmd]display cpu-usage task slot 1 ===== Current CPU usage info ===== CPU Usage Stat. Cycle: 35 (Second) CPU Usage : 13% CPU Usage Stat. Time : 2015-02-06 11:09:02 CPU Usage Stat. Tick : 0x5860(CPU Tick High) 0x3e033858(CPU Tick Low) Actual Stat. Cycle : 0x2(CPU Tick High) 0xc140f836(CPU Tick Low) TaskName CPU Runtime(CPU Tick High/CPU Tick Low) VIDL 87% 2/6c7649a8 TICK 0% 0/ 2db28e6 STMR 0% 0/ 150d58c RECV 0% 0/ 79145e DSTK 0% 0/ 13247a DDEV 0% 0/ 16dd34 L2X0 0% 0/ 6d14a84 bC.0 0% 0/ 1c77afa bLK0 0% 0/ 9c4c2a bC.1 0% 0/ 2192f08 bLK1 1% 0/ deeedac DQFD 0% 0/ 347f98 DQIT 0% 0/ 1c5ccd4 LPDT 0% 0/ 11bc STAT 0% 0/ 41f2a FMCK 0% 0/ 112cfc T_DM 0% 0/ 230e62 mIPC 0% 0/ 21fba8 T_VA 0% 0/ 323e6 DARP 0% 0/ 25a1fc TPBR 0% 0/ 268fce BGRT 0% 0/ 4e905a 2. 常见占用CPU高的进程 2.1 VIDL进程 VIDL任务为CPU空转任务,该任务越高,说明CPU越空闲。 2.2 T_DM进程 T_DM进程负责异步删除芯片的单播MAC地址,删除VLAN、删除MAC或者打开/关闭端口都会触发这个任务。如果设备的T_DM进程比较高,请首先排除网络中是否存在环路,网络中是否存在端口不断UP/DOWN及网络中是否存在大量的STP TC报文。如果设备收到大量的TC报文,那么软件就要频繁的刷新MAC表项,这时T_DM进程占用CPU就会比较高,如果做了跨板卡、跨设备的链路聚合,那么L2X0/L2X1进程也会比较高,因为L2X0/L2X1是MAC地址同步任务,设备有聚合组,并且聚合组下的MAC比较多时,设备就会进行全局的聚合MAC地址同步工作。L2X0/L2X1进程不仅仅参与聚合MAC同步,由硬件向软件同步,单板间的MAC同步等工作也都由它完成。如果设备确实收到了大量TC报文,那么就需要通过display stp tc命令,找到接收TC报文的端口,从直连的设备开始,逆向排查TC报文的来源,找到TC报文产生的原因。 <H3C>display stp tc --------- STP chassis 1 slot 4 TC or TCN count -------- MSTID Port Receive Send 0 Bridge-Aggregation1 0 682 0 Bridge-Aggregation2 62 648 0 Bridge-Aggregation3 418 188 0 Bridge-Aggregation4 0 666 0 Bridge-Aggregation5 0 670 0 Bridge-Aggregation6 0 662 0 Bridge-Aggregation7 17 2804 0 Bridge-Aggregation8 0 688 2.3 INFO进程 INFO进程是信息处理中心任务,负责设备信息的输出。如果设备在短时间内产生大量的log、trap信息,或者频繁的向logfile.log写入信息,那么这个进程就会比较高。INFO进程高时,主要通过以下两种方法解决: 1) 不产生信息 根据logbuffer、trapbuffer、logfile记录的信息内容,排查这些信息产生的原因,从根本上解决问题。比如说,logbuffer中有大量端口UP/DOWN的信息输出,那么就要排查对应的端口是否有问题,端口UP/DOWN的问题解决后,log信息不再产生,INFO进程高的问题也就迎刃而解了。 2) 不输出信息 如果无法阻止log信息的产生,那么可以通过配置,使这些无用的信息不向相应的通道输出。 比如,设备上会收到大量的STP TC报文导致INFO进程高,这个log信息是由MSTP模块产生的,信息级别为6(informational),为了使此类log信息不再向logbuffer中记录,那么就可以通过命令,针对MSTP模块只输出信息级别0至5的log至本地logbuffer,信息级别为6(informational)、7(debugging)的log信息不再向logbuffer中记录,修改方法如下: %Jul 11 22:12:52:068 2000 S5500-58C-HI Core MSTP/6/MSTP_FORWARDING: Instance 0's port GigabitEthernet1/0/6 has been set to forwarding state. %Jul 11 22:18:16:567 2000 S5500-58C-HI Core MSTP/6/MSTP_FORWARDING: Instance 0's port GigabitEthernet1/0/6 has been set to forwarding state. %Jul 11 22:23:40:929 2000 S5500-58C-HI Core MSTP/6/MSTP_FORWARDING: Instance 0's port GigabitEthernet1/0/6 has been set to forwarding state. [S5500-58C-HI Core]info-center source MSTP channel logbuffer log level ? alerts Action must be taken immediately (severity=1) critical Critical conditions (severity=2) debugging Debug-level messages (severity=7) emergencies System is unusable (severity=0) errors Error conditions (severity=3) informational Informational messages (severity=6) notifications Normal but significant conditions (severity=5) warnings Warning conditions (severity=4) [S5500-58C-HI Core]info-center source MSTP channel logbuffer log level notifications 再比如,如果设备开启了Portal功能,在上下班高峰期会有大量Portal用户上下线,这时设备会产生log记录并写入logfile.log文件,如果管理员不需要关注这些上下线信息,那么可以通过以下方法使Portal模块产生的log不向logfile.log文件写入。 %Mar 16 21:55:00:255 2014 S7506E PORTAL/5/PORTAL_USER_LOGON_SUCCESS: -UserName=[142702199506080314]-IPAddr=[172.16.89.124]-IfName= [Vlan-interface1009]-VlanID=[1009]-MACAddr=[0000-0000-0000]; User got online successfully. %Mar 16 21:55:00:430 2014 S7506E PORTAL/5/PORTAL_USER_LOGOFF: -UserName=[220105196209032667]-IPAddr=[10.40.0.29]-IfName= [Vlan-interface6]-VlanID=[6]-MACAddr=[0000-0000-0000]-Reason=[User Request]; User logged off. [S7503E]display info-center //查找logfile对应的channel Information Center:enabled Log host: the interface name of the source address:LoopBack0 100.1.1.122, port number : 514, host facility : local7, channel number : 4, channel name : logbuffer 111.1.115.220, port number : 514, host facility : local7, channel number : 4, channel name : logbuffer Console: channel number : 0, channel name : console Monitor: channel number : 1, channel name : monitor SNMP Agent: channel number : 5, channel name : snmpagent Log buffer: enabled,max buffer size 1024, current buffer size 512, current messages 93, dropped messages 0, overwritten messages 0 channel number : 4, channel name : logbuffer Trap buffer: enabled,max buffer size 1024, current buffer size 256, current messages 46, dropped messages 0, overwritten messages 0 channel number : 3, channel name : trapbuffer logfile: channel number:9, channel name:channel9 syslog: channel number:6, channel name:channel6 Information timestamp setting: log - date, trap - date, debug - date, loghost – date [S7503E]info-center source PORTAL channel 9 log state off trap state off debug state off 2.4 AGNT进程 AGNT进程为网管进程,负责与网管软件进行SNMP协议的交互,配置SNMP网管后,网管服务器周期性的轮询读取设备的MIB节点信息时会占用一定的CPU资源,特殊情况下,会导致AGNT进程占用CPU过高,比如: 1) 网管软件轮询设备信息间隔太短 建议修改网管服务器对设备的轮询时间,根据实际情况,适当把轮询间隔改大。 2) 多套网管系统同时从设备采集信息 建议只使用一套网管系统进行管理。个别情况下可能会存在未经授权的网管系统从设备采集信息,可以通过打开设备调试开关,看是否收到了未经授权的IP地址发来的SNMP报文: <S7503E>terminal debugging Info: Current terminal debugging is on. <S7503E>terminal monitor Info: Current terminal monitor is on. <S7503E>debugging snmp agent packet receive *Feb 6 15:30:53:569 2015 S7503E SNMP/7/PACKET_SRC: packet received from 100.1.1.125 via UDP *Feb 6 15:30:53:569 2015 S7503E SNMP/7/PACKET: get-bulk request request-id: 51 non-repeaters: 0 max-repetitions: 10 解决方法比较简单,通过将团体名与基本访问控制列表绑定,可以允许或禁止具有特定源IP地址的网管系统对设备的访问。 [S7503E]acl number 2000 [S7503E-acl-basic-2000]rule permit source 100.1.1.100 0 [S7503E-acl-basic-2000]rule deny source any [S7503E-acl-basic-2000]quit [S7503E]snmp-agent community read public acl 2000 [S7503E]snmp-agent community write private acl 2000 3) 一次性采集的信息量太大 如果网管一次性采集大量的MIB节点,由于信息量太大,就会导致短时间内CPU占用率高。建议每次采集信息不要太多,适当减少每次轮询的MIB节点数量。 2.5 bDPC进程 bDPC任务是用来处理设备在运行过程中由于芯片异常而产生的错误告警信息。交换机在长时间的运行过程中,会小概率出现某些芯片表项错误,导致持续不断的产生芯片级错误告警信息,而bDPC任务要对该错误告警信息进行记录,导致bDPC任务持续占用较高的CPU,进而导致该Slot的CPU使用率过高(超过60%)。错误的表项一般是当前设备不使用的表项,所以一般不影响业务的运行,但是由于长期的CPU高,会影响当前Slot上其它任务占用CPU,进而影响这些任务的运行效率。 进一步确认导致bDPC的CPU利用率过高的原因,可以通过如下命令行查看当前Slot的local logbuffer信息 [HP-diagnose]local logbuffer 0 display Feb 13 2012 15:37:57:0301:unit 0 L2X entry 1146 parity error Feb 13 2012 15:37:57:0302:unit 0 L2X entry 1146 parity error Feb 13 2012 15:37:57:0302:unit 0 L2X entry 1146 parity error 如果local logbuffer 中出现如上类似的几类错误提示,就可以确认bDPC任务CPU高是由于以下几类芯片硬件表项错误导致的。 目前所知共有如下5类硬件表项错误会导致bDPC任务持续升高。 1) VLAN_XLATE entry parity error 2) L2X entry parity error 3) ING_IPFIX_SESSION_TABLE/ EGR_IPFIX_SESSION_TABLE entry parity error 4) L3_ENTRY_ONLY entry parity error 5) START_BY_START_ERR 解决方法: 1) 此告警影响CPU占用率问题可以升级到最新版本解决。 2) 如果现场无法升级,由于这个任务导致的CPU利用率高实际并不影响业务运行,可以择机选择重启CPU占用率高的业务板规避。 2.6 PTMT进程 PTMT进程为Portal进程,如果PTMT进程高,一般是由于短时间内认证用户比较多,或者在线下发大量Portal free rule所导致,这种情况下,等待一段时间CPU就会恢复正常。 2.7 FMCK进程 FMCK任务负责轮询读取光模块,用于实时检测是哪种光模块插入。有光模块的话这个任务就会工作,插入的光模块比较多的话该进程会比较高。 2.8 BLK0/BLK1进程 是端口扫描任务,每隔50ms对所有芯片端口进行扫描,如果设备的接口较多,CPU占有率就会高一些。 2.9 vt0进程 vt进程就是telnet所调用的进程,当通过telnet远程登录设备时这个任务就会工作,如果telnet输出信息较多,这时候这个进程占用CPU就会比较高。 2.10 bRX1/bRX2/SOCK bRX1/bRX2/SOCK是CPU收包进程,如果上送CPU的报文比较多,这些进程就会高。以下几种情况会导致上送CPU报文较多: 1) 环路 网络中的环路会导致一些上送CPU的组播/广播报文反复上送CPU,比如ARP报文。可以通过查看端口下流量是否过大,组播/广播报文比重是否较大,或者查看芯片的MAC地址漂移记录,观察在故障时间点是否有MAC地址漂移记录,MAC地址漂移的次数是否在增长。 <H3C>display interface FortyGigE0/0/1 ………… Peak value of input: 34047398 bytes/sec, at 2014-07-16 00:35:40 Peak value of output: 4515355087 bytes/sec, at 2014-07-15 23:59:47 Last 300 seconds input: 0 packets/sec 10 bytes/sec 0% Last 300 seconds output: 32522615 packets/sec 4122836214 bytes/sec 95% Input (total): 54318147 packets, 10536126343 bytes 0 unicasts, 1 broadcasts, 54201062 multicasts, 0 pauses Input (normal): 54201063 packets, - bytes 0 unicasts, 1 broadcasts, 54201062 multicasts, 0 pauses Input: 128522 input errors, 0 runts, 0 giants, 0 throttles 117084 CRC, 0 frame, - overruns, 11438 aborts - ignored, - parity errors Output (total): 1812214814705 packets, 232088774233621 bytes 0 unicasts, 27407065122 broadcasts, 1784807749583 multicasts, 0 pauses Output (normal): 1812214814705 packets, - bytes 0 unicasts, 27407065122 broadcasts, 1784807749583 multicasts, 0 pauses Output: 0 output errors, - underruns, 0 buffer failures 0 aborts, 0 deferred, 0 collisions, 0 late collisions 0 lost carrier, - no carrier [H3C-diagnose]debug l2 1 0 mac/move_rec/show ===================L2MACMOVEMODULE INFO============================= L2MacMoveModule Enabled L2MacMoveDebug Switch Off ===========================L2MACMOVE Record INFO==================== MacAddress Vlan Agg Mod Port ->Agg Mod Port Cnt LatestTime Del 34:40:b5:b2:42:15 19 1 0 6 ->1 0 0 7 2012/12/18 15:38:4 1 36:6c:67:93:9a:1f 31 1 0 6 ->1 0 0 14 2012/12/18 15:38:4 1 0 :5 :b7:8 :7d:39 128 1 0 6 ->1 0 0 16 2012/12/18 15:38:4 1 0 :50:56:b9:79:c6 18 1 0 6 ->1 0 0 10 2012/12/18 15:38:4 1 2) 表项超规格 如果表项超规格,相关表项无法正常下发到硬件,一方面这些超规格的表项由于无法正常下发而产生错误信息,另一方面由于没有硬件表项,相关报文会上送CPU进行处理,最终导致CPU占用率高。可以通过查看设备的实际表现规格,看是否能够满足当前的业务需求,也可以通过命令查看表项的规格和实际的占用数量,从而判断是否有超规格的情况。比如,通过命令查看Slot1当前的ARP、ND、路由数量是否满足规格: [H3C-diagnose]debug l3intf-drv show statistics slot 1 ********************************************************** - L3INTF Statistics Slot 1 ********************************************************** - NH: 8192 - ARP SPECIFICATION: 8192 //ARP规格 COUNT: 3734 //下发到Slot1上的ARP数量 NHCOUNT: 3734 - IPV4 ROUTE SPECIFICATION: 12288 //ARP规格 COUNT: 1050 //下发到Slot1上的路由数量 - ND SPECIFICATION: 4096 //ND规格 COUNT: 1139 //下发到Slot1上的ND数量 NHCOUNT: 1141 SPECIFICATION: 6144 ROUTE COUNT: 0 3) 攻击报文 网络中有大量上送CPU的协议报文,比如ARP、STP、OSPF、VRRP等。 4) 配置了ip unreachables enable 配置这条命令以后,对于所有的IP报文,如果检查硬件表项找不到出接口,那么就会上送CPU进行软件转发。常见的原因包括路由超规格、没有正确下发相关路由条目等。因此,如果没有特殊要求,不建议配置这条命令。 当上送CPU的报文较多时,可以通过以下几种方法进行排查: 1) 确认哪个端口上CPU报文多 通过诊断模式下的debug rxtx show Slot_ID Chip_ID命令,查看从哪个端口接收的上CPU报文比较多。 [H3C-diagnose]debug rxtx show 1 0 RxDv: Dv=4,Dvhead=0x163d2308,Dvtail=0x163d3748,token=0,Pps=0 TxDv: Dv=0,Dvactive=0x0,Dvfree=0x17b1bb98,Dvfreecnt=1 Intr: Desc=989890,Chain=330224,Tx=33,Rx=1320769 Cos[0]=0 Cos[1]=0 Cos[2]=0 Cos[3]=0 Cos[4]=0 Cos[5]=0 Cos[6]=0 Cos[7]=1320226 P01_rx=0 P02_rx=0 P03_rx=0 P04_rx=0 P05_rx=86 P06_rx=0 P07_rx=432 P08_rx=0 P09_rx=0 P10_rx=0 P11_rx=0 P12_rx=0 P13_rx=0 P14_rx=0 P15_rx=0 P16_rx=0 P17_rx=0 P18_rx=0 P19_rx=25 P20_rx=0 P21_rx=0 P22_rx=0 P23_rx=0 P24_rx=0 P25_rx=0 P26_rx=0 P27_rx=325973 P28_rx=645689 P29_rx=348564 P30_rx=0 P31_rx=0 P32_rx=0 P33_rx=0 P38_rx=0 P39_rx=0 P40_rx=0 P41_rx=0 P42_rx=0 P43_rx=0 P44_rx=0 P45_rx=0 P50_rx=0 P51_rx=0 P52_rx=0 说明:其中P01_rx----P52_rx 中的端口信息是芯片端口号,可以通过debug port map Slot_ID查看芯片端口对应的面板端口号。比如通过下面的信息可以知道,芯片端口号为3的端口对应的面板端口号为GE1/0/1。 [H3C-diagnose]debug port mapping 1 [Interface] [Unit][Port][Name][Combo?][Active?][IfIndex] [MID][Link] [Attr] ===================================================================== GE1/0/1 0 3 ge2 no no 0x900000 4 down Bridge GE1/0/2 0 2 ge1 no no 0x900001 4 down Bridge GE1/0/3 0 5 ge4 no no 0x900002 4 up Bridge GE1/0/4 0 4 ge3 no no 0x900003 4 down Bridge GE1/0/5 0 7 ge6 no no 0x900004 4 down Bridge GE1/0/6 0 6 ge5 no no 0x900005 4 down Bridge 2) 确认哪类报文上CPU多 通过debug rxtx softcar show Slot_ID命令查看某类协议报文上送CPU的速率和丢包累计个数。 [H3C-diagnose]debug rxtx softcar show 1 ID Type Pkt_PSec DisPkt_All Pps Dynamic Switch Hash ACLmax ………… 28 IPV4_AUTORP 0 0 100 S On SMAC 8 29 ARP 53 43123 100 S On SMAC 8 30 ARP_REPLY 0 0 100 S On SMAC 8 31 DHCP_CLIENT 0 0 100 S On SMAC 8 32 DHCP_SERVER 0 0 100 S On SMAC 8 也可以通过debug rxtx catch命令,按照源目IP地址、源目MAC地址、VLAN号、报文类型等条件,对上CPU的报文进行统计。比如按照以太帧类型进行统计,首先通过debug rxtx catch by etype Slot_ID命令开始统计,等待一段时间后,通过debug rxtx catch end Slot_ID命令结束统计。统计结束后会打印统计结果。 [H3C-diagnose]debug rxtx catch by ? da Dest packet mac dip Dest IP etype Packet type iptype Packet IP type sa Source packet mac sip Source IP vlan VLAN [H3C-diagnose]debug rxtx catch by etype 1 //开始统计 Slot 1: information of Module RxTx [H3C-diagnose]debug rxtx catch end 1 //结束统计 Slot 1: information of Module RxTx The Catch Result of etype is : 806 -------- 94 //以太帧类型0x0806代表ARP报文,这里统计到94个ARP报文 [H3C-diagnose] 通过上面的统计信息可以看出,设备上有大量ARP报文上送CPU并产生大量丢包,这时可以打开Comware平台相应的协议模块调试开关,查看ARP模块接收和发送的报文,通过这种方法可以看到ARP报文的源目MAC地址和载荷信息,进而根据这些信息排查报文的来源。 <H3C>terminal debugging Info: Current terminal debugging is on. <H3C>terminal monitor Info: Current terminal monitor is on. <H3C>debugging arp packet <H3C>debug ethernet packet *Apr 27 06:21:19:417 2000 H3C ETH/7/eth_rcv: Receive an eth packet, interface: GigabitEthernet1/0/1, format: 0, prototype: 0806, src_addr: 00d0-f800-0001, dst_addr: ffff-ffff-ffff *Apr 27 06:21:19:637 2000 H3C ARP/7/arp_rcv: Receive an ARP Packet, operation : 1, sender_eth_addr : 00d0-f800-0001, sender_ip_addr : 20.1.1.1, target_eth_addr : 0000-0000-0000, target_ip_addr : 20.1.1.254 3) 打印上CPU报文 为了能够直观的看到上CPU报文的具体内容,可以把这些报文打印出来。因为上CPU的报文可能很多,如果全部打印的话意义不大,只需要按照报文特征选择性的打印即可,比如可以按照报文的源目MAC地址、源目IP地址、VLAN、报文类型等特征进行过滤。例如,先通过display rxtx source-mac Source-MAC-Address Slot_ID命令设置过滤开关,只输出特定源MAC地址报文,然后通过debug rxtx -c Num -s Len pkt Slot_ID命令将这些报文打印出来,“-c”后面的参数为打印报文的个数,“-s”后面的参数为打印报文的长度。 <H3C>terminal debugging Info: Current terminal debugging is on. <H3C>terminal monitor Info: Current terminal monitor is on. <H3C>system-view System View: return to User View with Ctrl+Z. [H3C]en_diag CAUTION : Now you enter an en_diag command view for developer's testing, some commands may be dangerous, please carefully use it with our engineer's direction. [H3C-diagnose]display rxtx ? all All packet broadcast Broadcast packet chip Chip cos COS dest_mac Dest packet mac dip Dest IP dipv6 Dest ipv6 etype Packet ethernet type iptype Packet IP type length length matchrule rx match rule multicast Multicast packet port Port reason Receive packet reason receive Receive packet send Send packet sip Source IP sipv6 source ipv6 source_mac Source packet mac switchflag display switch flag unicast Unicast packet vlan VLAN vp VP packet [H3C-diagnose]display rxtx source-mac 00d0-f800-0001 1 [H3C-diagnose]debug rxtx -c 5 -s 100 pkt 1 Slot 1: information of Module RxTx *Apr 27 07:58:54:514 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 Debug RxTx packet is on! *Apr 27 07:58:54:709 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- [H3C-diagnose] *Apr 27 07:58:55:239 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 *Apr 27 07:58:55:460 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- *Apr 27 07:58:55:960 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 *Apr 27 07:58:56:161 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- *Apr 27 07:58:56:662 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 *Apr 27 07:58:56:852 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- *Apr 27 07:58:57:354 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 *Apr 27 07:58:57:557 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- *Apr 27 07:58:58:057 2000 H3C RXTX/7/pkt: From board 1: debug RxTx packet is off! [H3C-diagnose]display rxtx all 1 Slot 1: information of Module RxTx 在打印的内容中,我们不仅能够看到具体的报文内容,结合debug port mapping Slot_ID信息还可以看到这些报文是从哪个端口接收的(chip0,port3, sMod=4,sPort=3),上CPU的优先级是多少(cos=8),以及上送CPU的原因(reason=0x1000),这些内容对于我们定位问题都很有价值。 虽然有了报文内容,但是却是以十六进制表示的,如果对报文结构不够熟悉,那么就需要进一步通过报文解析软件对这些报文内容进行解析。这里以著名的Wireshark为例进行说明。通过CMD调用Wireshark安装目录下自带的工具软件text2pcap.exe将捕获到的报文转化为抓包文件,然后就可以直接通过Wireshark打开。 C:\>cd C:\Program Files\Wireshark C:\Program Files\Wireshark>text2pcap.exe captureCPU.txt captureCPU.cap Input from: captureCPU.txt Output to: captureCPU.cap Output format: PCAP Wrote packet of 68 bytes. Wrote packet of 68 bytes. Wrote packet of 68 bytes. Wrote packet of 68 bytes. Wrote packet of 68 bytes. Read 10 potential packets, wrote 5 packets (444 bytes). C:\Program Files\Wireshark> 使用Wireshark打开转换后的抓包文件captureCPU.cap: 有些时候会遇到text2pcap无法正常转换报文的情况,这时就需要手工处理原始的报文打印信息,将报文内容以外的信息去除,处理完成后格式如下: 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 将处理完成的文本文件导入Wireshark: 3. 收集占用CPU高进程的调用栈 前面罗列了常见的CPU占用高的进程,如果是其他的进程高,并且无法通过进程名称判断任务作用,或者无法找到引起该进程占用CPU高的原因,那么可以收集该进程的调用栈反馈给L3工程师或研发人员分析。收集方法分为两步: 1) 找到异常任务的索引 根据异常任务名称,在隐含模式下通过display task slot Slot_ID命令找到占用CPU高的任务所对应的Vid值。 [H3C]_h Now you enter a hidden command view for developer's testing, some commands may affect operation by wrong use, please carefully use it with our engineer's direction. [H3C-hidecmd]display task slot 2 name Tid Vid TSize Mod priority Status Total/Max/Last(Millsecs) ========================================================================== VIDL 85fdb000 1 40 P 1 preemptready 291121396/ 11/ 1 TICK 85fd1c00 2 40 P 250 preemptready 777699/ 1/ 0 STMR 85fd1a00 3 40 N 150 eventblock 213779/ 38/ 0 dGDB 85fd1800 4 40 N 180 eventblock 0/ 0/ 0 RECV 85fd1600 5 39 N 216 semblock 264509/ 101/ 0 DSTK 85fd1400 6 40 N 140 sleep 28878/ 1/ 0 DST2 85fd1200 7 40 N 180 eventblock 0/ 0/ 0 FEVT 85fd1000 8 40 N 180 eventblock 0/ 0/ 0 DDEV 85fd0e00 9 40 N 140 eventblock 0/ 0/ 0 SUBC 85fd0c00 10 40 N 140 sleep 1619/ 0/ 0 bDPC 85fd0a00 11 32 N 95 semblock 0/ 0/ 0 L2X0 85fd0800 12 32 N 55 sleep 3131055/ 2/ 0 bC.0 85fd0600 13 32 N 55 semblock 4659292/ 12/ 9 bTX 85fd0400 14 32 N 140 semblock 0/ 0/ 0 2) 打印异常任务调用栈 根据前面获取到的任务索引,通过display task Vid slot Slot_ID命令打印异常任务的调用栈信息。 [H3C-hidecmd]display task 12 slot 2 Task name : L2X0 Task PLAT Index : 12 Task OS Index : 0x85fd0800 Task StackTop : 0x82330000 Task priority : 55 Task Status : sleep Last run time(CPU Tick) : 0x0(high) 0x6a32(low) Max run time(CPU Tick) : 0x0(high) 0x2a25f(low) Total run time(CPU Tick): 0x30(high) 0xbd04b79d(low) Stack Information: 0x822303c0 0x82233100 0x829c4ba4 0x829bb550 0x837be928 解决办法 请参考原因分析部分。 建议与总结 1、打印上CPU报文时,需要通过console口打印,不要通过远程telnet方式打印,否则可能会出现大量的telnet报文,不便于问题定位; 2、通过display rxtx命令设置过滤开关后,在使用完毕后注意通过命令display rxtx all Slot_ID恢复选择开关; 3、向L3和研发反馈信息时,同时需要反馈设备故障时的诊断信息和logfile.log文件。
提示数据不存在
1. 确定那个任务占用CPU高 当出现CPU占用率高的问题时,首先在隐含模式下通过display cpu-usage task slot Slot_ID命令查看是哪些任务占用CPU较高。 [H3C]_h Now you enter a hidden command view for developer's testing, some commands may affect operation by wrong use, please carefully use it with our engineer's direction. [H3C-hidecmd]display cpu-usage task slot 1 ===== Current CPU usage info ===== CPU Usage Stat. Cycle: 35 (Second) CPU Usage : 13% CPU Usage Stat. Time : 2015-02-06 11:09:02 CPU Usage Stat. Tick : 0x5860(CPU Tick High) 0x3e033858(CPU Tick Low) Actual Stat. Cycle : 0x2(CPU Tick High) 0xc140f836(CPU Tick Low) TaskName CPU Runtime(CPU Tick High/CPU Tick Low) VIDL 87% 2/6c7649a8 TICK 0% 0/ 2db28e6 STMR 0% 0/ 150d58c RECV 0% 0/ 79145e DSTK 0% 0/ 13247a DDEV 0% 0/ 16dd34 L2X0 0% 0/ 6d14a84 bC.0 0% 0/ 1c77afa bLK0 0% 0/ 9c4c2a bC.1 0% 0/ 2192f08 bLK1 1% 0/ deeedac DQFD 0% 0/ 347f98 DQIT 0% 0/ 1c5ccd4 LPDT 0% 0/ 11bc STAT 0% 0/ 41f2a FMCK 0% 0/ 112cfc T_DM 0% 0/ 230e62 mIPC 0% 0/ 21fba8 T_VA 0% 0/ 323e6 DARP 0% 0/ 25a1fc TPBR 0% 0/ 268fce BGRT 0% 0/ 4e905a 2. 常见占用CPU高的进程 2.1 VIDL进程 VIDL任务为CPU空转任务,该任务越高,说明CPU越空闲。 2.2 T_DM进程 T_DM进程负责异步删除芯片的单播MAC地址,删除VLAN、删除MAC或者打开/关闭端口都会触发这个任务。如果设备的T_DM进程比较高,请首先排除网络中是否存在环路,网络中是否存在端口不断UP/DOWN及网络中是否存在大量的STP TC报文。如果设备收到大量的TC报文,那么软件就要频繁的刷新MAC表项,这时T_DM进程占用CPU就会比较高,如果做了跨板卡、跨设备的链路聚合,那么L2X0/L2X1进程也会比较高,因为L2X0/L2X1是MAC地址同步任务,设备有聚合组,并且聚合组下的MAC比较多时,设备就会进行全局的聚合MAC地址同步工作。L2X0/L2X1进程不仅仅参与聚合MAC同步,由硬件向软件同步,单板间的MAC同步等工作也都由它完成。如果设备确实收到了大量TC报文,那么就需要通过display stp tc命令,找到接收TC报文的端口,从直连的设备开始,逆向排查TC报文的来源,找到TC报文产生的原因。 <H3C>display stp tc --------- STP chassis 1 slot 4 TC or TCN count -------- MSTID Port Receive Send 0 Bridge-Aggregation1 0 682 0 Bridge-Aggregation2 62 648 0 Bridge-Aggregation3 418 188 0 Bridge-Aggregation4 0 666 0 Bridge-Aggregation5 0 670 0 Bridge-Aggregation6 0 662 0 Bridge-Aggregation7 17 2804 0 Bridge-Aggregation8 0 688 2.3 INFO进程 INFO进程是信息处理中心任务,负责设备信息的输出。如果设备在短时间内产生大量的log、trap信息,或者频繁的向logfile.log写入信息,那么这个进程就会比较高。INFO进程高时,主要通过以下两种方法解决: 1) 不产生信息 根据logbuffer、trapbuffer、logfile记录的信息内容,排查这些信息产生的原因,从根本上解决问题。比如说,logbuffer中有大量端口UP/DOWN的信息输出,那么就要排查对应的端口是否有问题,端口UP/DOWN的问题解决后,log信息不再产生,INFO进程高的问题也就迎刃而解了。 2) 不输出信息 如果无法阻止log信息的产生,那么可以通过配置,使这些无用的信息不向相应的通道输出。 比如,设备上会收到大量的STP TC报文导致INFO进程高,这个log信息是由MSTP模块产生的,信息级别为6(informational),为了使此类log信息不再向logbuffer中记录,那么就可以通过命令,针对MSTP模块只输出信息级别0至5的log至本地logbuffer,信息级别为6(informational)、7(debugging)的log信息不再向logbuffer中记录,修改方法如下: %Jul 11 22:12:52:068 2000 S5500-58C-HI Core MSTP/6/MSTP_FORWARDING: Instance 0's port GigabitEthernet1/0/6 has been set to forwarding state. %Jul 11 22:18:16:567 2000 S5500-58C-HI Core MSTP/6/MSTP_FORWARDING: Instance 0's port GigabitEthernet1/0/6 has been set to forwarding state. %Jul 11 22:23:40:929 2000 S5500-58C-HI Core MSTP/6/MSTP_FORWARDING: Instance 0's port GigabitEthernet1/0/6 has been set to forwarding state. [S5500-58C-HI Core]info-center source MSTP channel logbuffer log level ? alerts Action must be taken immediately (severity=1) critical Critical conditions (severity=2) debugging Debug-level messages (severity=7) emergencies System is unusable (severity=0) errors Error conditions (severity=3) informational Informational messages (severity=6) notifications Normal but significant conditions (severity=5) warnings Warning conditions (severity=4) [S5500-58C-HI Core]info-center source MSTP channel logbuffer log level notifications 再比如,如果设备开启了Portal功能,在上下班高峰期会有大量Portal用户上下线,这时设备会产生log记录并写入logfile.log文件,如果管理员不需要关注这些上下线信息,那么可以通过以下方法使Portal模块产生的log不向logfile.log文件写入。 %Mar 16 21:55:00:255 2014 S7506E PORTAL/5/PORTAL_USER_LOGON_SUCCESS: -UserName=[142702199506080314]-IPAddr=[172.16.89.124]-IfName= [Vlan-interface1009]-VlanID=[1009]-MACAddr=[0000-0000-0000]; User got online successfully. %Mar 16 21:55:00:430 2014 S7506E PORTAL/5/PORTAL_USER_LOGOFF: -UserName=[220105196209032667]-IPAddr=[10.40.0.29]-IfName= [Vlan-interface6]-VlanID=[6]-MACAddr=[0000-0000-0000]-Reason=[User Request]; User logged off. [S7503E]display info-center //查找logfile对应的channel Information Center:enabled Log host: the interface name of the source address:LoopBack0 100.1.1.122, port number : 514, host facility : local7, channel number : 4, channel name : logbuffer 111.1.115.220, port number : 514, host facility : local7, channel number : 4, channel name : logbuffer Console: channel number : 0, channel name : console Monitor: channel number : 1, channel name : monitor SNMP Agent: channel number : 5, channel name : snmpagent Log buffer: enabled,max buffer size 1024, current buffer size 512, current messages 93, dropped messages 0, overwritten messages 0 channel number : 4, channel name : logbuffer Trap buffer: enabled,max buffer size 1024, current buffer size 256, current messages 46, dropped messages 0, overwritten messages 0 channel number : 3, channel name : trapbuffer logfile: channel number:9, channel name:channel9 syslog: channel number:6, channel name:channel6 Information timestamp setting: log - date, trap - date, debug - date, loghost – date [S7503E]info-center source PORTAL channel 9 log state off trap state off debug state off 2.4 AGNT进程 AGNT进程为网管进程,负责与网管软件进行SNMP协议的交互,配置SNMP网管后,网管服务器周期性的轮询读取设备的MIB节点信息时会占用一定的CPU资源,特殊情况下,会导致AGNT进程占用CPU过高,比如: 1) 网管软件轮询设备信息间隔太短 建议修改网管服务器对设备的轮询时间,根据实际情况,适当把轮询间隔改大。 2) 多套网管系统同时从设备采集信息 建议只使用一套网管系统进行管理。个别情况下可能会存在未经授权的网管系统从设备采集信息,可以通过打开设备调试开关,看是否收到了未经授权的IP地址发来的SNMP报文: <S7503E>terminal debugging Info: Current terminal debugging is on. <S7503E>terminal monitor Info: Current terminal monitor is on. <S7503E>debugging snmp agent packet receive *Feb 6 15:30:53:569 2015 S7503E SNMP/7/PACKET_SRC: packet received from 100.1.1.125 via UDP *Feb 6 15:30:53:569 2015 S7503E SNMP/7/PACKET: get-bulk request request-id: 51 non-repeaters: 0 max-repetitions: 10 解决方法比较简单,通过将团体名与基本访问控制列表绑定,可以允许或禁止具有特定源IP地址的网管系统对设备的访问。 [S7503E]acl number 2000 [S7503E-acl-basic-2000]rule permit source 100.1.1.100 0 [S7503E-acl-basic-2000]rule deny source any [S7503E-acl-basic-2000]quit [S7503E]snmp-agent community read public acl 2000 [S7503E]snmp-agent community write private acl 2000 3) 一次性采集的信息量太大 如果网管一次性采集大量的MIB节点,由于信息量太大,就会导致短时间内CPU占用率高。建议每次采集信息不要太多,适当减少每次轮询的MIB节点数量。 2.5 bDPC进程 bDPC任务是用来处理设备在运行过程中由于芯片异常而产生的错误告警信息。交换机在长时间的运行过程中,会小概率出现某些芯片表项错误,导致持续不断的产生芯片级错误告警信息,而bDPC任务要对该错误告警信息进行记录,导致bDPC任务持续占用较高的CPU,进而导致该Slot的CPU使用率过高(超过60%)。错误的表项一般是当前设备不使用的表项,所以一般不影响业务的运行,但是由于长期的CPU高,会影响当前Slot上其它任务占用CPU,进而影响这些任务的运行效率。 进一步确认导致bDPC的CPU利用率过高的原因,可以通过如下命令行查看当前Slot的local logbuffer信息 [HP-diagnose]local logbuffer 0 display Feb 13 2012 15:37:57:0301:unit 0 L2X entry 1146 parity error Feb 13 2012 15:37:57:0302:unit 0 L2X entry 1146 parity error Feb 13 2012 15:37:57:0302:unit 0 L2X entry 1146 parity error 如果local logbuffer 中出现如上类似的几类错误提示,就可以确认bDPC任务CPU高是由于以下几类芯片硬件表项错误导致的。 目前所知共有如下5类硬件表项错误会导致bDPC任务持续升高。 1) VLAN_XLATE entry parity error 2) L2X entry parity error 3) ING_IPFIX_SESSION_TABLE/ EGR_IPFIX_SESSION_TABLE entry parity error 4) L3_ENTRY_ONLY entry parity error 5) START_BY_START_ERR 解决方法: 1) 此告警影响CPU占用率问题可以升级到最新版本解决。 2) 如果现场无法升级,由于这个任务导致的CPU利用率高实际并不影响业务运行,可以择机选择重启CPU占用率高的业务板规避。 2.6 PTMT进程 PTMT进程为Portal进程,如果PTMT进程高,一般是由于短时间内认证用户比较多,或者在线下发大量Portal free rule所导致,这种情况下,等待一段时间CPU就会恢复正常。 2.7 FMCK进程 FMCK任务负责轮询读取光模块,用于实时检测是哪种光模块插入。有光模块的话这个任务就会工作,插入的光模块比较多的话该进程会比较高。 2.8 BLK0/BLK1进程 是端口扫描任务,每隔50ms对所有芯片端口进行扫描,如果设备的接口较多,CPU占有率就会高一些。 2.9 vt0进程 vt进程就是telnet所调用的进程,当通过telnet远程登录设备时这个任务就会工作,如果telnet输出信息较多,这时候这个进程占用CPU就会比较高。 2.10 bRX1/bRX2/SOCK bRX1/bRX2/SOCK是CPU收包进程,如果上送CPU的报文比较多,这些进程就会高。以下几种情况会导致上送CPU报文较多: 1) 环路 网络中的环路会导致一些上送CPU的组播/广播报文反复上送CPU,比如ARP报文。可以通过查看端口下流量是否过大,组播/广播报文比重是否较大,或者查看芯片的MAC地址漂移记录,观察在故障时间点是否有MAC地址漂移记录,MAC地址漂移的次数是否在增长。 <H3C>display interface FortyGigE0/0/1 ………… Peak value of input: 34047398 bytes/sec, at 2014-07-16 00:35:40 Peak value of output: 4515355087 bytes/sec, at 2014-07-15 23:59:47 Last 300 seconds input: 0 packets/sec 10 bytes/sec 0% Last 300 seconds output: 32522615 packets/sec 4122836214 bytes/sec 95% Input (total): 54318147 packets, 10536126343 bytes 0 unicasts, 1 broadcasts, 54201062 multicasts, 0 pauses Input (normal): 54201063 packets, - bytes 0 unicasts, 1 broadcasts, 54201062 multicasts, 0 pauses Input: 128522 input errors, 0 runts, 0 giants, 0 throttles 117084 CRC, 0 frame, - overruns, 11438 aborts - ignored, - parity errors Output (total): 1812214814705 packets, 232088774233621 bytes 0 unicasts, 27407065122 broadcasts, 1784807749583 multicasts, 0 pauses Output (normal): 1812214814705 packets, - bytes 0 unicasts, 27407065122 broadcasts, 1784807749583 multicasts, 0 pauses Output: 0 output errors, - underruns, 0 buffer failures 0 aborts, 0 deferred, 0 collisions, 0 late collisions 0 lost carrier, - no carrier [H3C-diagnose]debug l2 1 0 mac/move_rec/show ===================L2MACMOVEMODULE INFO============================= L2MacMoveModule Enabled L2MacMoveDebug Switch Off ===========================L2MACMOVE Record INFO==================== MacAddress Vlan Agg Mod Port ->Agg Mod Port Cnt LatestTime Del 34:40:b5:b2:42:15 19 1 0 6 ->1 0 0 7 2012/12/18 15:38:4 1 36:6c:67:93:9a:1f 31 1 0 6 ->1 0 0 14 2012/12/18 15:38:4 1 0 :5 :b7:8 :7d:39 128 1 0 6 ->1 0 0 16 2012/12/18 15:38:4 1 0 :50:56:b9:79:c6 18 1 0 6 ->1 0 0 10 2012/12/18 15:38:4 1 2) 表项超规格 如果表项超规格,相关表项无法正常下发到硬件,一方面这些超规格的表项由于无法正常下发而产生错误信息,另一方面由于没有硬件表项,相关报文会上送CPU进行处理,最终导致CPU占用率高。可以通过查看设备的实际表现规格,看是否能够满足当前的业务需求,也可以通过命令查看表项的规格和实际的占用数量,从而判断是否有超规格的情况。比如,通过命令查看Slot1当前的ARP、ND、路由数量是否满足规格: [H3C-diagnose]debug l3intf-drv show statistics slot 1 ********************************************************** - L3INTF Statistics Slot 1 ********************************************************** - NH: 8192 - ARP SPECIFICATION: 8192 //ARP规格 COUNT: 3734 //下发到Slot1上的ARP数量 NHCOUNT: 3734 - IPV4 ROUTE SPECIFICATION: 12288 //ARP规格 COUNT: 1050 //下发到Slot1上的路由数量 - ND SPECIFICATION: 4096 //ND规格 COUNT: 1139 //下发到Slot1上的ND数量 NHCOUNT: 1141 SPECIFICATION: 6144 ROUTE COUNT: 0 3) 攻击报文 网络中有大量上送CPU的协议报文,比如ARP、STP、OSPF、VRRP等。 4) 配置了ip unreachables enable 配置这条命令以后,对于所有的IP报文,如果检查硬件表项找不到出接口,那么就会上送CPU进行软件转发。常见的原因包括路由超规格、没有正确下发相关路由条目等。因此,如果没有特殊要求,不建议配置这条命令。 当上送CPU的报文较多时,可以通过以下几种方法进行排查: 1) 确认哪个端口上CPU报文多 通过诊断模式下的debug rxtx show Slot_ID Chip_ID命令,查看从哪个端口接收的上CPU报文比较多。 [H3C-diagnose]debug rxtx show 1 0 RxDv: Dv=4,Dvhead=0x163d2308,Dvtail=0x163d3748,token=0,Pps=0 TxDv: Dv=0,Dvactive=0x0,Dvfree=0x17b1bb98,Dvfreecnt=1 Intr: Desc=989890,Chain=330224,Tx=33,Rx=1320769 Cos[0]=0 Cos[1]=0 Cos[2]=0 Cos[3]=0 Cos[4]=0 Cos[5]=0 Cos[6]=0 Cos[7]=1320226 P01_rx=0 P02_rx=0 P03_rx=0 P04_rx=0 P05_rx=86 P06_rx=0 P07_rx=432 P08_rx=0 P09_rx=0 P10_rx=0 P11_rx=0 P12_rx=0 P13_rx=0 P14_rx=0 P15_rx=0 P16_rx=0 P17_rx=0 P18_rx=0 P19_rx=25 P20_rx=0 P21_rx=0 P22_rx=0 P23_rx=0 P24_rx=0 P25_rx=0 P26_rx=0 P27_rx=325973 P28_rx=645689 P29_rx=348564 P30_rx=0 P31_rx=0 P32_rx=0 P33_rx=0 P38_rx=0 P39_rx=0 P40_rx=0 P41_rx=0 P42_rx=0 P43_rx=0 P44_rx=0 P45_rx=0 P50_rx=0 P51_rx=0 P52_rx=0 说明:其中P01_rx----P52_rx 中的端口信息是芯片端口号,可以通过debug port map Slot_ID查看芯片端口对应的面板端口号。比如通过下面的信息可以知道,芯片端口号为3的端口对应的面板端口号为GE1/0/1。 [H3C-diagnose]debug port mapping 1 [Interface] [Unit][Port][Name][Combo?][Active?][IfIndex] [MID][Link] [Attr] ===================================================================== GE1/0/1 0 3 ge2 no no 0x900000 4 down Bridge GE1/0/2 0 2 ge1 no no 0x900001 4 down Bridge GE1/0/3 0 5 ge4 no no 0x900002 4 up Bridge GE1/0/4 0 4 ge3 no no 0x900003 4 down Bridge GE1/0/5 0 7 ge6 no no 0x900004 4 down Bridge GE1/0/6 0 6 ge5 no no 0x900005 4 down Bridge 2) 确认哪类报文上CPU多 通过debug rxtx softcar show Slot_ID命令查看某类协议报文上送CPU的速率和丢包累计个数。 [H3C-diagnose]debug rxtx softcar show 1 ID Type Pkt_PSec DisPkt_All Pps Dynamic Switch Hash ACLmax ………… 28 IPV4_AUTORP 0 0 100 S On SMAC 8 29 ARP 53 43123 100 S On SMAC 8 30 ARP_REPLY 0 0 100 S On SMAC 8 31 DHCP_CLIENT 0 0 100 S On SMAC 8 32 DHCP_SERVER 0 0 100 S On SMAC 8 也可以通过debug rxtx catch命令,按照源目IP地址、源目MAC地址、VLAN号、报文类型等条件,对上CPU的报文进行统计。比如按照以太帧类型进行统计,首先通过debug rxtx catch by etype Slot_ID命令开始统计,等待一段时间后,通过debug rxtx catch end Slot_ID命令结束统计。统计结束后会打印统计结果。 [H3C-diagnose]debug rxtx catch by ? da Dest packet mac dip Dest IP etype Packet type iptype Packet IP type sa Source packet mac sip Source IP vlan VLAN [H3C-diagnose]debug rxtx catch by etype 1 //开始统计 Slot 1: information of Module RxTx [H3C-diagnose]debug rxtx catch end 1 //结束统计 Slot 1: information of Module RxTx The Catch Result of etype is : 806 -------- 94 //以太帧类型0x0806代表ARP报文,这里统计到94个ARP报文 [H3C-diagnose] 通过上面的统计信息可以看出,设备上有大量ARP报文上送CPU并产生大量丢包,这时可以打开Comware平台相应的协议模块调试开关,查看ARP模块接收和发送的报文,通过这种方法可以看到ARP报文的源目MAC地址和载荷信息,进而根据这些信息排查报文的来源。 <H3C>terminal debugging Info: Current terminal debugging is on. <H3C>terminal monitor Info: Current terminal monitor is on. <H3C>debugging arp packet <H3C>debug ethernet packet *Apr 27 06:21:19:417 2000 H3C ETH/7/eth_rcv: Receive an eth packet, interface: GigabitEthernet1/0/1, format: 0, prototype: 0806, src_addr: 00d0-f800-0001, dst_addr: ffff-ffff-ffff *Apr 27 06:21:19:637 2000 H3C ARP/7/arp_rcv: Receive an ARP Packet, operation : 1, sender_eth_addr : 00d0-f800-0001, sender_ip_addr : 20.1.1.1, target_eth_addr : 0000-0000-0000, target_ip_addr : 20.1.1.254 3) 打印上CPU报文 为了能够直观的看到上CPU报文的具体内容,可以把这些报文打印出来。因为上CPU的报文可能很多,如果全部打印的话意义不大,只需要按照报文特征选择性的打印即可,比如可以按照报文的源目MAC地址、源目IP地址、VLAN、报文类型等特征进行过滤。例如,先通过display rxtx source-mac Source-MAC-Address Slot_ID命令设置过滤开关,只输出特定源MAC地址报文,然后通过debug rxtx -c Num -s Len pkt Slot_ID命令将这些报文打印出来,“-c”后面的参数为打印报文的个数,“-s”后面的参数为打印报文的长度。 <H3C>terminal debugging Info: Current terminal debugging is on. <H3C>terminal monitor Info: Current terminal monitor is on. <H3C>system-view System View: return to User View with Ctrl+Z. [H3C]en_diag CAUTION : Now you enter an en_diag command view for developer's testing, some commands may be dangerous, please carefully use it with our engineer's direction. [H3C-diagnose]display rxtx ? all All packet broadcast Broadcast packet chip Chip cos COS dest_mac Dest packet mac dip Dest IP dipv6 Dest ipv6 etype Packet ethernet type iptype Packet IP type length length matchrule rx match rule multicast Multicast packet port Port reason Receive packet reason receive Receive packet send Send packet sip Source IP sipv6 source ipv6 source_mac Source packet mac switchflag display switch flag unicast Unicast packet vlan VLAN vp VP packet [H3C-diagnose]display rxtx source-mac 00d0-f800-0001 1 [H3C-diagnose]debug rxtx -c 5 -s 100 pkt 1 Slot 1: information of Module RxTx *Apr 27 07:58:54:514 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 Debug RxTx packet is on! *Apr 27 07:58:54:709 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- [H3C-diagnose] *Apr 27 07:58:55:239 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 *Apr 27 07:58:55:460 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- *Apr 27 07:58:55:960 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 *Apr 27 07:58:56:161 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- *Apr 27 07:58:56:662 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 *Apr 27 07:58:56:852 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- *Apr 27 07:58:57:354 2000 H3C RXTX/7/pkt: From board 1: received packet from chip0,port3,reason=0x1000,cos=8,sMod=4,sPort=3,len=68, Matched=29,time is 0 *Apr 27 07:58:57:557 2000 H3C RXTX/7/pkt: ----------------------------------------------------- 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 ----------------------------------------------------- *Apr 27 07:58:58:057 2000 H3C RXTX/7/pkt: From board 1: debug RxTx packet is off! [H3C-diagnose]display rxtx all 1 Slot 1: information of Module RxTx 在打印的内容中,我们不仅能够看到具体的报文内容,结合debug port mapping Slot_ID信息还可以看到这些报文是从哪个端口接收的(chip0,port3, sMod=4,sPort=3),上CPU的优先级是多少(cos=8),以及上送CPU的原因(reason=0x1000),这些内容对于我们定位问题都很有价值。 虽然有了报文内容,但是却是以十六进制表示的,如果对报文结构不够熟悉,那么就需要进一步通过报文解析软件对这些报文内容进行解析。这里以著名的Wireshark为例进行说明。通过CMD调用Wireshark安装目录下自带的工具软件text2pcap.exe将捕获到的报文转化为抓包文件,然后就可以直接通过Wireshark打开。 C:\>cd C:\Program Files\Wireshark C:\Program Files\Wireshark>text2pcap.exe captureCPU.txt captureCPU.cap Input from: captureCPU.txt Output to: captureCPU.cap Output format: PCAP Wrote packet of 68 bytes. Wrote packet of 68 bytes. Wrote packet of 68 bytes. Wrote packet of 68 bytes. Wrote packet of 68 bytes. Read 10 potential packets, wrote 5 packets (444 bytes). C:\Program Files\Wireshark> 使用Wireshark打开转换后的抓包文件captureCPU.cap: 有些时候会遇到text2pcap无法正常转换报文的情况,这时就需要手工处理原始的报文打印信息,将报文内容以外的信息去除,处理完成后格式如下: 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 0000 ff ff ff ff ff ff 00 d0 f8 00 00 01 81 00 00 14 0010 08 06 00 01 08 00 06 04 00 01 d0 00 00 00 01 00 0020 14 01 01 01 00 00 00 00 00 00 14 01 01 fe 00 00 0030 00 00 00 00 00 00 00 01 00 02 80 00 63 05 00 00 0040 00 00 00 00 将处理完成的文本文件导入Wireshark: 3. 收集占用CPU高进程的调用栈 前面罗列了常见的CPU占用高的进程,如果是其他的进程高,并且无法通过进程名称判断任务作用,或者无法找到引起该进程占用CPU高的原因,那么可以收集该进程的调用栈反馈给L3工程师或研发人员分析。收集方法分为两步: 1) 找到异常任务的索引 根据异常任务名称,在隐含模式下通过display task slot Slot_ID命令找到占用CPU高的任务所对应的Vid值。 [H3C]_h Now you enter a hidden command view for developer's testing, some commands may affect operation by wrong use, please carefully use it with our engineer's direction. [H3C-hidecmd]display task slot 2 name Tid Vid TSize Mod priority Status Total/Max/Last(Millsecs) ========================================================================== VIDL 85fdb000 1 40 P 1 preemptready 291121396/ 11/ 1 TICK 85fd1c00 2 40 P 250 preemptready 777699/ 1/ 0 STMR 85fd1a00 3 40 N 150 eventblock 213779/ 38/ 0 dGDB 85fd1800 4 40 N 180 eventblock 0/ 0/ 0 RECV 85fd1600 5 39 N 216 semblock 264509/ 101/ 0 DSTK 85fd1400 6 40 N 140 sleep 28878/ 1/ 0 DST2 85fd1200 7 40 N 180 eventblock 0/ 0/ 0 FEVT 85fd1000 8 40 N 180 eventblock 0/ 0/ 0 DDEV 85fd0e00 9 40 N 140 eventblock 0/ 0/ 0 SUBC 85fd0c00 10 40 N 140 sleep 1619/ 0/ 0 bDPC 85fd0a00 11 32 N 95 semblock 0/ 0/ 0 L2X0 85fd0800 12 32 N 55 sleep 3131055/ 2/ 0 bC.0 85fd0600 13 32 N 55 semblock 4659292/ 12/ 9 bTX 85fd0400 14 32 N 140 semblock 0/ 0/ 0 2) 打印异常任务调用栈 根据前面获取到的任务索引,通过display task Vid slot Slot_ID命令打印异常任务的调用栈信息。 [H3C-hidecmd]display task 12 slot 2 Task name : L2X0 Task PLAT Index : 12 Task OS Index : 0x85fd0800 Task StackTop : 0x82330000 Task priority : 55 Task Status : sleep Last run time(CPU Tick) : 0x0(high) 0x6a32(low) Max run time(CPU Tick) : 0x0(high) 0x2a25f(low) Total run time(CPU Tick): 0x30(high) 0xbd04b79d(low) Stack Information: 0x822303c0 0x82233100 0x829c4ba4 0x829bb550 0x837be928 解决办法 请参考原因分析部分。 建议与总结 1、打印上CPU报文时,需要通过console口打印,不要通过远程telnet方式打印,否则可能会出现大量的telnet报文,不便于问题定位; 2、通过display rxtx命令设置过滤开关后,在使用完毕后注意通过命令display rxtx all Slot_ID恢复选择开关; 3、向L3和研发反馈信息时,同时需要反馈设备故障时的诊断信息和logfile.log文件。
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作
举报
×
侵犯我的权益
×
侵犯了我企业的权益
×
抄袭了我的内容
×
原文链接或出处
诽谤我
×
对根叔社区有害的内容
×
不规范转载
×
举报说明