• 全部
  • 经验案例
  • 典型配置
  • 技术公告
  • FAQ
  • 全部
  • 全部
产品线
搜索
取消
案例类型
发布者
是否解决
是否官方
时间
高级搜索

某局点一台HPE ProLiant DL380 Gen9服务器间歇性死机

2018-04-06发表
  • 0关注
  • 0收藏,1885浏览
周锋 九段
粉丝:21人 关注:0人

某局点一台HPE ProLiant DL380 Gen9服务器安装运行CentOS 7.01系统,机器发生间歇性死机的情况,死机发生后业务不能正常运行,服务器网卡无法ping通,iLO 4能连接上,但是通过iLO远程控制服务器时远程控制窗口黑屏,没有任何显示,键盘无响应。可以通过iLO远程进行服务器重启,重启后能正常进入系统,业务也能恢复正常,但是一段时间后有发生死机的情况。

未发现任何告警信息,服务器AHS(Active Health System)日志未发现报错信息。

硬件的AHS日志未发现任何报警信息,在服务器死机的时间段也没有发现异常,只记录有在死机发生后不久的人为触发的服务器重启记录。另外,服务器的BIOS和P440ar控制器固件版本稍微低些,不是最新的。

分析操作系统的SOSReprot日志发现在服务器死机之前的时间段有OOM(Out Of Memory)内存溢出记录,最好发现是用户l3fw进程导致的,关闭l3fw进程后,故障未复现,确认是由于用户自己的进程导致内存溢出,最后产生的服务器死机无响应的问题,与服务器硬件无关。

具体的日志分析过程如下:

1.13,14,15,16,17号messages日志里都记录有大量的内存溢出而杀死l3fw进程,如下:
Mar 13 19:43:03 localhost kernel: Out of memory: Kill process 8676 (l3fw) score 982 or sacrifice child
Mar 13 19:43:03 localhost kernel: Killed process 8676 (l3fw) total-vm:97243004kB, anon-rss:63878444kB, file-rss:0kB
Mar 13 19:43:03 localhost kernel: l3fw: page allocation failure: order:0, mode:0x2015a
Mar 13 19:43:03 localhost kernel: CPU: 0 PID: 8676 Comm: l3fw Not tainted 3.10.0-123.el7.x86_64 #1

Mar 14 09:27:18 localhost kernel: Out of memory: Kill process 4748 (l3fw) score 982 or sacrifice child
Mar 14 09:27:18 localhost kernel: Killed process 4748 (l3fw) total-vm:97241980kB, anon-rss:63826664kB, file-rss:0kB

Mar 15 13:21:31 localhost kernel: Out of memory: Kill process 7628 (l3fw) score 981 or sacrifice child
Mar 15 13:21:31 localhost kernel: Killed process 7628 (l3fw) total-vm:97111932kB, anon-rss:63811384kB, file-rss:356kB

Mar 16 10:44:47 localhost kernel: Out of memory: Kill process 12456 (l3fw) score 980 or sacrifice child
Mar 16 10:44:47 localhost kernel: Killed process 12456 (l3fw) total-vm:97045372kB, anon-rss:63801988kB, file-rss:0kB

Mar 17 10:42:41 localhost kernel: Out of memory: Kill process 6881 (l3fw) score 980 or sacrifice child
Mar 17 10:42:41 localhost kernel: Killed process 6881 (l3fw) total-vm:96980860kB, anon-rss:63894712kB, file-rss:564kB

2.内存的溢出导致了机器系统无相应,但是硬件没有任何报错的产生,以13号日志为例:

Mar 13 19:43:03 localhost kernel: l3fw invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Mar 13 19:43:03 localhost kernel: l3fw cpuset=/ mems_allowed=0-1
Mar 13 19:43:03 localhost kernel: CPU: 13 PID: 8696 Comm: l3fw Not tainted 3.10.0-123.el7.x86_64 #1

Mar 13 19:43:03 localhost kernel: active_anon:15173182 inactive_anon:979308 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:52843 slab_reclaimable:14200 slab_unreclaimable:25628
mapped:2296 shmem:2312 pagetables:50753 bounce:0
free_cma:0
Mar 13 19:43:03 localhost kernel: Node 0 DMA free:15748kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar 13 19:43:03 localhost kernel: lowmem_reserve[]: 0 1641 31847 31847
Mar 13 19:43:03 localhost kernel: Node 0 DMA32 free:121684kB min:2304kB low:2880kB high:3456kB active_anon:1150040kB inactive_anon:411036kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1948156kB managed:1681388kB mlocked:0kB dirty:0kB writeback:0kB mapped:44kB shmem:40kB slab_reclaimable:1212kB slab_unreclaimable:3268kB kernel_stack:40kB pagetables:4500kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11 all_unreclaimable? yes
Mar 13 19:43:03 localhost kernel: lowmem_reserve[]: 0 0 30205 30205
Mar 13 19:43:03 localhost kernel: Node 0 Normal free:28732kB min:42452kB low:53064kB high:63676kB active_anon:28765404kB inactive_anon:1692180kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:31457280kB managed:30930200kB mlocked:0kB dirty:0kB writeback:0kB mapped:2376kB shmem:2372kB slab_reclaimable:21104kB slab_unreclaimable:43300kB kernel_stack:2776kB pagetables:104472kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:16 all_unreclaimable? yes
Mar 13 19:43:03 localhost kernel: lowmem_reserve[]: 0 0 0 0
Mar 13 19:43:03 localhost kernel: Node 1 Normal free:45208kB min:45328kB low:56660kB high:67992kB active_anon:30777284kB inactive_anon:1814016kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:33554432kB managed:33027104kB mlocked:0kB dirty:0kB writeback:0kB mapped:6764kB shmem:6836kB slab_reclaimable:34484kB slab_unreclaimable:55944kB kernel_stack:2984kB pagetables:94040kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar 13 19:43:03 localhost kernel: lowmem_reserve[]: 0 0 0 0
Mar 13 19:43:03 localhost kernel: Node 0 DMA: 1*4kB (U) 2*8kB (U) 3*16kB (U) 0*32kB 1*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15748kB
Mar 13 19:43:03 localhost kernel: Node 0 DMA32: 210*4kB (UEM) 232*8kB (UEM) 171*16kB (UEM) 65*32kB (UEM) 178*64kB (UEM) 249*128kB (UEM) 139*256kB (UEM) 45*512kB (UEM) 12*1024kB (UEM) 0*2048kB 0*4096kB = 121688kB
Mar 13 19:43:03 localhost kernel: Node 0 Normal: 2437*4kB (UEM) 1246*8kB (UEM) 471*16kB (UEM) 34*32kB (UEM) 8*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 28852kB
Mar 13 19:43:03 localhost kernel: Node 1 Normal: 3768*4kB (UEM) 1952*8kB (UEM) 836*16kB (UEM) 73*32kB (UEM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46400kB
Mar 13 19:43:03 localhost kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 13 19:43:03 localhost kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 13 19:43:03 localhost kernel: 11645 total pagecache pages
Mar 13 19:43:03 localhost kernel: 9228 pages in swap cache
Mar 13 19:43:03 localhost kernel: Swap cache stats: add 10047161, delete 10037933, find 478315/649415
Mar 13 19:43:03 localhost kernel: Free swap = 0kB
Mar 13 19:43:03 localhost kernel: Total swap = 32972796kB
Mar 13 19:43:03 localhost kernel: 16777215 pages RAM
Mar 13 19:43:03 localhost kernel: 358910 pages reserved
Mar 13 19:43:03 localhost kernel: 6375790 pages shared
Mar 13 19:43:03 localhost kernel: 16351253 pages non-shared
Mar 13 19:43:03 localhost kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Mar 13 19:43:03 localhost kernel: [ 852] 0 852 14832 2719 33 48 0 systemd-journal
Mar 13 19:43:03 localhost kernel: [ 864] 0 864 44546 0 21 86 0 lvmetad
Mar 13 19:43:03 localhost kernel: [ 868] 0 868 10731 2 22 326 -1000 systemd-udevd
Mar 13 19:43:03 localhost kernel: [ 1593] 0 1593 12784 541 25 80 -1000 auditd
Mar 13 19:43:03 localhost kernel: [ 1599] 0 1599 20055 39 9 15 0 audispd
Mar 13 19:43:03 localhost kernel: [ 1600] 0 1600 6547 15 19 36 0 sedispatch
Mar 13 19:43:03 localhost kernel: [ 1621] 0 1621 4187 5 14 41 0 alsactl
Mar 13 19:43:03 localhost kernel: [ 1627] 997 1627 1084 9 7 21 0 lsmd
Mar 13 19:43:03 localhost kernel: [ 1628] 70 1628 7540 23 21 65 0 avahi-daemon
Mar 13 19:43:03 localhost kernel: [ 1634] 0 1634 56065 1037 38 118 0 rsyslogd
Mar 13 19:43:03 localhost kernel: [ 1637] 996 1637 5667 30 16 27 0 chronyd
Mar 13 19:43:03 localhost kernel: [ 1638] 0 1638 53020 1 56 401 0 abrtd
Mar 13 19:43:03 localhost kernel: [ 1640] 0 1640 52445 14 54 313 0 abrt-watch-log
Mar 13 19:43:03 localhost kernel: [ 1641] 70 1641 7507 5 19 53 0 avahi-daemon
Mar 13 19:43:03 localhost kernel: [ 1644] 0 1644 52445 2 55 325 0 abrt-watch-log
Mar 13 19:43:03 localhost kernel: [ 1647] 0 1647 100628 116 84 2413 0 tuned
Mar 13 19:43:03 localhost kernel: [ 1650] 0 1650 81602 0 62 283 0 ModemManager
Mar 13 19:43:03 localhost kernel: [ 1653] 0 1653 4844 78 13 31 0 irqbalance
Mar 13 19:43:03 localhost kernel: [ 1661] 0 1661 32512 0 20 139 0 smartd
Mar 13 19:43:03 localhost kernel: [ 1666] 0 1666 1076 7 8 16 0 rngd
Mar 13 19:43:03 localhost kernel: [ 1667] 0 1667 8671 39 21 50 0 systemd-logind
Mar 13 19:43:03 localhost kernel: [ 1669] 0 1669 92676 142 37 56 0 accounts-daemon
Mar 13 19:43:03 localhost kernel: [ 1685] 172 1685 41155 5 16 48 0 rtkit-daemon
Mar 13 19:43:03 localhost kernel: [ 1686] 81 1686 7372 199 18 83 -900 dbus-daemon
Mar 13 19:43:03 localhost kernel: [ 1693] 0 1693 95083 262 68 266 0 NetworkManager
Mar 13 19:43:03 localhost kernel: [ 1697] 0 1697 31574 25 18 129 0 crond
Mar 13 19:43:03 localhost kernel: [ 1698] 0 1698 6482 0 17 51 0 atd
Mar 13 19:43:03 localhost kernel: [ 1700] 0 1700 1621 0 9 29 0 iprupdate
Mar 13 19:43:03 localhost kernel: [ 1701] 0 1701 74759 0 36 667 0 gdm
Mar 13 19:43:03 localhost kernel: [ 1703] 0 1703 1621 0 9 29 0 iprinit
Mar 13 19:43:03 localhost kernel: [ 1710] 0 1710 28804 42 12 25 0 ksmtuned
Mar 13 19:43:03 localhost kernel: [ 1714] 999 1714 131271 4017 52 631 0 polkitd
Mar 13 19:43:03 localhost kernel: [ 1723] 0 1723 95634 0 41 214 0 gdm-simple-slav
Mar 13 19:43:03 localhost kernel: [ 1751] 0 1751 71534 891 94 1940 0 Xorg
Mar 13 19:43:03 localhost kernel: [ 1789] 0 1789 9781 1 9 23 0 iprdump
Mar 13 19:43:03 localhost kernel: [ 1895] 0 1895 89761 117 63 162 0 gdm-session-wor
Mar 13 19:43:03 localhost kernel: [ 1907] 42 1907 175819 185 126 780 0 gnome-session
Mar 13 19:43:03 localhost kernel: [ 1910] 42 1910 3486 0 12 47 0 dbus-launch
Mar 13 19:43:03 localhost kernel: [ 1911] 42 1911 7217 1 17 138 0 dbus-daemon
Mar 13 19:43:03 localhost kernel: [ 1914] 42 1914 84968 0 33 159 0 at-spi-bus-laun
Mar 13 19:43:03 localhost kernel: [ 1918] 42 1918 7127 0 18 83 0 dbus-daemon
Mar 13 19:43:03 localhost kernel: [ 1921] 42 1921 32379 0 32 168 0 at-spi2-registr
Mar 13 19:43:03 localhost kernel: [ 1950] 42 1950 214020 818 170 1053 0 gnome-settings-
Mar 13 19:43:03 localhost kernel: [ 1962] 0 1962 59183 88 48 162 0 upowerd
Mar 13 19:43:03 localhost kernel: [ 2115] 42 2115 383618 24031 327 5414 0 gnome-shell
Mar 13 19:43:03 localhost kernel: [ 2122] 42 2122 93223 12 72 265 0 pulseaudio
Mar 13 19:43:03 localhost kernel: [ 2124] 995 2124 83212 50 53 277 0 colord
Mar 13 19:43:03 localhost kernel: [ 2190] 42 2190 45125 0 23 116 0 dconf-service
Mar 13 19:43:03 localhost kernel: [ 2203] 0 2203 120823 1 129 1229 0 libvirtd
Mar 13 19:43:03 localhost kernel: [ 2208] 32 2208 9975 17 23 89 0 rpcbind
Mar 13 19:43:03 localhost kernel: [ 2211] 0 2211 13189 1 27 140 0 vsftpd
Mar 13 19:43:03 localhost kernel: [ 2215] 0 2215 20739 23 42 187 -1000 sshd
Mar 13 19:43:03 localhost kernel: [ 2249] 29 2249 11639 1 26 210 0 rpc.statd
Mar 13 19:43:03 localhost kernel: [ 2373] 42 2373 115288 76 46 452 0 ibus-daemon
Mar 13 19:43:03 localhost kernel: [ 2424] 42 2424 77608 0 40 188 0 ibus-dconf
Mar 13 19:43:03 localhost kernel: [ 2427] 42 2427 93196 0 101 512 0 ibus-x11
Mar 13 19:43:03 localhost kernel: [ 2988] 0 2988 23446 19 46 239 0 master
Mar 13 19:43:03 localhost kernel: [ 3014] 89 3014 23516 35 45 233 0 qmgr
Mar 13 19:43:03 localhost kernel: [ 3358] 42 3358 59122 0 36 165 0 ibus-engine-sim
Mar 13 19:43:03 localhost kernel: [ 3366] 99 3366 3881 2 11 46 0 dnsmasq
Mar 13 19:43:03 localhost kernel: [ 3871] 42 3871 80755 320 60 533 0 mission-control
Mar 13 19:43:03 localhost kernel: [ 3878] 42 3878 137870 614 136 1003 0 goa-daemon
Mar 13 19:43:03 localhost kernel: [ 8662] 1000 8662 28314 62 13 21 0 watchdog.sh
Mar 13 19:43:03 localhost kernel: [ 8673] 1000 8673 28314 62 12 20 0 watchdog.sh
Mar 13 19:43:03 localhost kernel: [ 8676] 1000 8676 24310751 15969611 47273 8216529 0 l3fw
Mar 13 19:43:03 localhost kernel: [ 8680] 1000 8680 278436 136280 308 912 0 mr_kpi
Mar 13 19:43:03 localhost kernel: [39255] 0 39255 44475 0 42 267 0 cupsd
Mar 13 19:43:03 localhost kernel: [ 544] 89 544 23472 259 47 0 0 pickup
Mar 13 19:43:03 localhost kernel: [ 558] 89 558 23473 254 44 0 0 trivial-rewrite
Mar 13 19:43:03 localhost kernel: [ 559] 89 559 23509 263 47 0 0 cleanup
Mar 13 19:43:03 localhost kernel: [ 560] 0 560 22977 267 45 0 0 local
Mar 13 19:43:03 localhost kernel: [ 944] 89 944 23481 257 45 0 0 bounce
Mar 13 19:43:03 localhost kernel: [ 4693] 0 4693 26973 23 11 0 0 sleep
Mar 13 19:43:03 localhost kernel: [ 4705] 1000 4705 26973 21 11 0 0 sleep
Mar 13 19:43:03 localhost kernel: [ 4706] 1000 4706 26973 25 11 0 0 sleep
Mar 13 19:43:03 localhost kernel: Out of memory: Kill process 8676 (l3fw) score 982 or sacrifice child
Mar 13 19:43:03 localhost kernel: Killed process 8676 (l3fw) total-vm:97243004kB, anon-rss:63878444kB, file-rss:0kB
Mar 13 19:43:03 localhost kernel: l3fw: page allocation failure: order:0, mode:0x2015a
Mar 13 19:43:03 localhost kernel: CPU: 0 PID: 8676 Comm: l3fw Not tainted 3.10.0-123.el7.x86_64 #1
Mar 13 19:43:03 localhost kernel: Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/25/2017
Mar 13 19:43:03 localhost kernel: 000000000002015a 0000000054ec36a8 ffff880828e2da58 ffffffff815e19ba
Mar 13 19:43:03 localhost kernel: ffff880828e2dae8 ffffffff811472e0 00000000ffffffff 0000000000000000
Mar 13 19:43:03 localhost kernel: ffff88087ffd9e80 ffff88087ffd9e80 ffff880828e2dae8 0000000054ec36a8
Mar 13 19:43:03 localhost kernel: Call Trace:
Mar 13 19:43:03 localhost kernel: [<ffffffff815e19ba>] dump_stack+0x19/0x1b
Mar 13 19:43:03 localhost kernel: [<ffffffff811472e0>] warn_alloc_failed+0x110/0x180
Mar 13 19:43:03 localhost kernel: [<ffffffff81086ab0>] ? wake_up_bit+0x30/0x30
Mar 13 19:43:03 localhost kernel: [<ffffffff8114b47c>] __alloc_pages_nodemask+0x90c/0xb10
Mar 13 19:43:03 localhost kernel: [<ffffffff81188779>] alloc_pages_current+0xa9/0x170
Mar 13 19:43:03 localhost kernel: [<ffffffff811419f7>] __page_cache_alloc+0x87/0xb0
Mar 13 19:43:03 localhost kernel: [<ffffffff81143d48>] filemap_fault+0x188/0x430
Mar 13 19:43:03 localhost kernel: [<ffffffff811682ce>] __do_fault+0x7e/0x520
Mar 13 19:43:03 localhost kernel: [<ffffffff8116c615>] handle_mm_fault+0x3e5/0xd90
Mar 13 19:43:03 localhost kernel: [<ffffffff81011619>] ? __switch_to+0x179/0x490
Mar 13 19:43:03 localhost kernel: [<ffffffff815ed186>] __do_page_fault+0x156/0x540
Mar 13 19:43:03 localhost kernel: [<ffffffff815e6292>] ? do_nanosleep+0x92/0x130
Mar 13 19:43:03 localhost kernel: [<ffffffff8109b776>] ? __dequeue_entity+0x26/0x40
Mar 13 19:43:03 localhost kernel: [<ffffffff81011619>] ? __switch_to+0x179/0x490
Mar 13 19:43:03 localhost kernel: [<ffffffff815ed58a>] do_page_fault+0x1a/0x70
Mar 13 19:43:03 localhost kernel: [<ffffffff815e97c8>] page_fault+0x28/0x30

从上面看OOM(Out Of Memory)是由于l3fw进程触发的,gfp_mask=0x280da的低2bit2,表示此次申请内存是从Normal空间的内存块进行申请的。由于Normal空间的free数小于min数(Normal free:28732kB min:42452kBNormal free:45208kB min:45328kB)所以l3fw会触发OOM进程l3fwID号是8676,使用的内存容量是(anon-rss+file-rss63878444kB, file-rss:0kB总共接近64GB,机器配置的物理内存总共64Gb。也就是说l3fw进程占用的物理内存过大,导致物理内存不够用了。

 

3.分析当时时间段的sar日志发现,下午740750前后性能统计值差别很大,如下:

Mar 13
内存页面的统计
12:00:01 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
07:40:02 PM 1334.18 3644.38 3849.19 18.77 3041.35 1837.84 0.00 937.01 50.98
07:50:01 PM 5946.69 553.72 3590.47 184.80 30495.82 6011.73 572.05 1574.39 23.91
08:00:01 PM 7.22 382.74 2574.70 0.13 1738.53 0.00 0.00 0.00 0.00


输出内存页面的统计信息
12:00:01 PM frmpg/s bufpg/s campg/s
07:40:02 PM -4.16 0.00 -8.23
07:50:01 PM 25885.06 0.00 53.77

内存和交换空间的统计(450740内存使用率一直高居不下

12:00:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
04:50:01 PM 400680 65272540 99.39 1600 6491352 1496888 1.52 55230660 8404056 672
05:00:01 PM 279364 65393856 99.57 1600 5766280 1504860 1.53 55923088 7827212 67632
05:10:01 PM 264468 65408752 99.60 1600 4785304 1551176 1.57 56802700 6954784 3840
05:20:01 PM 268608 65404612 99.59 1600 3828068 1552508 1.57 57643556 6187596 872
05:30:01 PM 268392 65404828 99.59 1600 2679968 1493268 1.51 58447472 5453212 2016
05:40:01 PM 278408 65394812 99.58 1600 1528480 1490828 1.51 59165468 4731948 984
05:50:01 PM 269084 65404136 99.59 1600 352804 1489916 1.51 60002024 3911012 1892
06:00:03 PM 267752 65405468 99.59 8 95724 1491104 1.51 59947316 4011504 244
06:10:01 PM 300956 65372264 99.54 8 34924 1555036 1.58 59958244 3966468 96
06:20:01 PM 265864 65407356 99.60 8 36332 1490008 1.51 59984804 3965992 172
06:30:03 PM 331948 65341272 99.49 8 65288 1491276 1.51 59907428 3978700 0
06:40:01 PM 281392 65391828 99.57 8 38348 1554852 1.58 59962524 3966624 0
06:50:01 PM 262536 65410684 99.60 8 42616 1480724 1.50 59962460 3980400 0
07:00:03 PM 294100 65379120 99.55 8 78460 1484400 1.50 59922088 3989932 4504
07:10:01 PM 255696 65417524 99.61 8 48888 1483448 1.50 59965968 3974056 0
07:20:01 PM 279908 65393312 99.57 8 46924 1483144 1.50 60328660 3578184 0
07:30:04 PM 298464 65374756 99.55 8 62780 1553600 1.57 59914212 3969348 116
07:40:02 PM 288512 65384708 99.56 8 43100 1484432 1.50 60559344 3954704 112
07:50:01 PM 62351560 3321660 5.06 8 172016 1476864 1.50 2184296 153472 956

 

4.进程l3fw的用户及位置如下:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
user1 3764 0.0 0.0 113124 1464 ? - 09:20 0:00 /bin/bash ./watchdog.sh l3fw_ctrl.sh start 10
user1 - 0.0 - - - - S 09:20 0:00 -
user1 3775 0.0 0.0 113124 1468 ? - 09:20 0:00 /bin/bash ./watchdog.sh mrkpi_ctrl.sh start 10
user1 - 0.0 - - - - S 09:20 0:00 -
user1 3779 14.8 24.2 16367740 15894692 ? - 09:20 11:25 ./l3fw ../conf/l3fw.conf

 

 

 

停止导致占用大量内存的l3fw进程,该进程使机器内存溢出服务器停止响应。

从上述分析可以看出:

1.服务器硬件日志没有任何报错,而且iLO还可以连通并进行控制,问题需要从发生死机故障的时间点去分析;

2.Linux的系统日志中查看sar日志,在故障时间点没有LINUX RESTART系统重启的记录;

3.sar日志记录重启的时间点都是用户人为通过iLO进行的手动重启;

4.仔细分析故障发生时间点之前的messages日志能发现有大量的OOM记录,同时其他发生死机的日期都有类似大量的OOM记录,通过这些基本就可以确定问题发生是由于系统下的软件导致的服务器系统层面的死机,不是硬件层的死机;

5.针对服务器和系统结合的问题,在做分析判断前建议先查询系统兼容性列表,确认该操作系统的具体版本是测试过的,满足支持的条件。

0 个评论

该案例暂时没有网友评论

编辑评论

举报

×

侵犯我的权益 >
对根叔知了社区有害的内容 >
辱骂、歧视、挑衅等(不友善)

侵犯我的权益

×

泄露了我的隐私 >
侵犯了我企业的权益 >
抄袭了我的内容 >
诽谤我 >
辱骂、歧视、挑衅等(不友善)
骚扰我

泄露了我的隐私

×

您好,当您发现根叔知了上有泄漏您隐私的内容时,您可以向根叔知了进行举报。 请您把以下内容通过邮件发送到zhiliao@h3c.com 邮箱,我们会尽快处理。
  • 1. 您认为哪些内容泄露了您的隐私?(请在邮件中列出您举报的内容、链接地址,并给出简短的说明)
  • 2. 您是谁?(身份证明材料,可以是身份证或护照等证件)

侵犯了我企业的权益

×

您好,当您发现根叔知了上有关于您企业的造谣与诽谤、商业侵权等内容时,您可以向根叔知了进行举报。 请您把以下内容通过邮件发送到 zhiliao@h3c.com 邮箱,我们会在审核后尽快给您答复。
  • 1. 您举报的内容是什么?(请在邮件中列出您举报的内容和链接地址)
  • 2. 您是谁?(身份证明材料,可以是身份证或护照等证件)
  • 3. 是哪家企业?(营业执照,单位登记证明等证件)
  • 4. 您与该企业的关系是?(您是企业法人或被授权人,需提供企业委托授权书)
我们认为知名企业应该坦然接受公众讨论,对于答案中不准确的部分,我们欢迎您以正式或非正式身份在根叔知了上进行澄清。

抄袭了我的内容

×

原文链接或出处

诽谤我

×

您好,当您发现根叔知了上有诽谤您的内容时,您可以向根叔知了进行举报。 请您把以下内容通过邮件发送到zhiliao@h3c.com 邮箱,我们会尽快处理。
  • 1. 您举报的内容以及侵犯了您什么权益?(请在邮件中列出您举报的内容、链接地址,并给出简短的说明)
  • 2. 您是谁?(身份证明材料,可以是身份证或护照等证件)
我们认为知名企业应该坦然接受公众讨论,对于答案中不准确的部分,我们欢迎您以正式或非正式身份在根叔知了上进行澄清。

对根叔知了社区有害的内容

×

垃圾广告信息
色情、暴力、血腥等违反法律法规的内容
政治敏感
不规范转载 >
辱骂、歧视、挑衅等(不友善)
骚扰我
诱导投票

不规范转载

×

举报说明

提出建议

    +
<

亲~登录后才可以操作哦!

确定

你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作