R4900 G3 / QLA2692 / CentOS Linux 7.6
光纤卡SFP故障引发光纤磁带库识别异常
1. 硬件排查
1.1. 硬件系统健康日志
未见异常,无打印信息
1.2. 动态监控日志
截取其中一次重启过程,未见异常
0 1 2022-04-07 11:50:55 2022-04-07 03:50:55 PDIndex(Front:6)----Inserted: PD 11(e1/s6) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e006,0000000000000000
0 1 2022-04-07 11:50:55 2022-04-07 03:50:55 PDIndex(Front:0)----Dedicated Hot Spare created on PD 0d(e8/s0) (ded,rev,ac=1)
0 1 2022-04-07 11:50:55 2022-04-07 03:50:55 Controller operating temperature within normal range, full operation restored---CtrlIndex(2)
0 1 2022-04-07 11:50:56 2022-04-07 03:50:56 Time established as 04/07/22 3:50:24; (94 seconds since power on)---CtrlIndex(2)
0 0 2022-04-07 11:51:18 2022-04-07 03:51:18 SensorType: OS Boot, SensorName: System, EventType: Discrete, Event: boot completed - boot device not specified Boot completed - boot device not specified
0 0 2022-04-07 11:51:18 2022-04-07 03:51:18 EventType: OEM, Event: ME Firmware Health Event---Event data:0xa0 0xe 0x2, Data2: 14, Data3: 2 ME Firmware Health Event---Event data:0xa0 0xe 0x2
0 0 2022-04-07 11:51:49 2022-04-07 03:51:49 EventType: System ACPI Power State, Event: LPC Reset occurred LPC Reset occurred
0 0 2022-04-07 11:51:50 2022-04-07 03:51:50 EventType: System Boot / Restart, Event: System Restart, Data2: 48 System restart---Unknown cause
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Firmware initialization started (PCI ID 005d/1000/9361/1000)---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Firmware version 4.680.00-8551---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Battery Present---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Package version 24.21.0-0146---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Board Revision 11C---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Battery charge complete---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Battery temperature is normal---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure (SES) discovered on PD 08(e1/s0)---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure PD 08(e1/s0) communication restored---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure PD 08(e1/s0) phy bad for slot 12---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure PD 08(e1/s0) phy bad for slot 13---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure PD 08(e1/s0) phy bad for slot 14---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure PD 08(e1/s0) phy bad for slot 15---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure PD 08(e1/s0) phy bad for slot 16---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure PD 08(e1/s0) phy bad for slot 17---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure PD 08(e1/s0) phy bad for slot 18---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Enclosure PD 08(e1/s0) phy bad for slot 19---CtrlIndex(2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Inserted: PD 08(e8/s255)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 Inserted: PD 08(e1/s0) Info: enclPd=08, scsiType=d, portMap=00, sasAddr=578aa82cb193e07e,0000000000000000
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Rear:9)----Inserted: PD 09(e8/s28)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Rear:9)----Inserted: PD 09(e1/s28) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e01c,0000000000000000
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Rear:10)----Inserted: PD 0a(e8/s29)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Rear:10)----Inserted: PD 0a(e1/s29) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e01d,0000000000000000
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:2)----Inserted: PD 0b(e8/s2)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:2)----Inserted: PD 0b(e1/s2) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e002,0000000000000000
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:4)----Inserted: PD 0c(e8/s4)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:4)----Inserted: PD 0c(e1/s4) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e004,0000000000000000
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:0)----Inserted: PD 0d(e8/s0)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:0)----Inserted: PD 0d(e1/s0) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e000,0000000000000000
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:5)----Inserted: PD 0e(e8/s5)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:5)----Inserted: PD 0e(e1/s5) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e005,0000000000000000
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:3)----Inserted: PD 0f(e8/s3)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:3)----Inserted: PD 0f(e1/s3) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e003,0000000000000000
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:1)----Inserted: PD 10(e8/s1)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:1)----Inserted: PD 10(e1/s1) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e001,0000000000000000
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:6)----Inserted: PD 11(e8/s6)
0 1 2022-04-07 11:53:50 2022-04-07 03:53:50 PDIndex(Front:6)----Inserted: PD 11(e1/s6) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=578aa82cb193e006,0000000000000000
0 1 2022-04-07 11:53:51 2022-04-07 03:53:51 PDIndex(Front:0)----Dedicated Hot Spare created on PD 0d(e8/s0) (ded,rev,ac=1)
0 1 2022-04-07 11:53:51 2022-04-07 03:53:51 Controller operating temperature within normal range, full operation restored---CtrlIndex(2)
0 1 2022-04-07 11:53:51 2022-04-07 03:53:51 Time established as 04/07/22 3:53:40; (94 seconds since power on)---CtrlIndex(2)
0 0 2022-04-07 11:54:35 2022-04-07 03:54:35 EventType: OEM, Event: ME Firmware Health Event---Event data:0xa0 0xe 0x2, Data2: 14, Data3: 2 ME Firmware Health Event---Event data:0xa0 0xe 0x2
0 0 2022-04-07 11:54:35 2022-04-07 03:54:35 SensorType: OS Boot, SensorName: System, EventType: Discrete, Event: boot completed - boot device not specified Boot completed - boot device not specified
1.3. 硬件底层日志排查
未见异常
1.4. 光纤卡QLA2692固件
当前版本未见异常
2. 系统排查
2.2.1. 系统驱动版本
版本未见异常
filename: /lib/modules/3.10.0-957.el7.x86_64/extra/qlgc-qla2xxx/qla2xxx.ko
firmware: ql2700_fw.bin
firmware: ql8300_fw.bin
firmware: ql2600_fw.bin
firmware: ql2500_fw.bin
firmware: ql2400_fw.bin
firmware: ql2322_fw.bin
firmware: ql2300_fw.bin
firmware: ql2200_fw.bin
firmware: ql2100_fw.bin
version: 10.01.00.33.07.6-k
license: GPL
description: Cavium Fibre Channel HBA Driver
author: QLogic Corporation
2.2.2. 光纤卡固件版本
固件版本为"9.07.00",未见异常
2.2.3. 参考系统日志,每次插拔光纤线缆出现重启的会伴随出现vmcore-dmesg,截取部分信息如下
[ 927.020984 ] qla2xxx [0000:d8:00.1]-5090:16: LOOP INIT ERROR (2003).
[ 927.022568] qla2xxx [0000:d8:00.1]-d011:16: -> fwdt0 running...
[ 927.038189] qla2xxx [0000:d8:00.1]-d015:16: -> Firmware dump saved to buffer (16/ffffa67cc8c6a000) <f>
[ 927.429050] qla2xxx [0000:d8:00.1]-00af:16: Performing ISP error recovery - ha=ffff9b5f72e06000.
[ 927.436162] qla2xxx [0000:d8:00.1]-0075:16: ZIO mode 6 enabled; timer delay (200 us).
[ 932.878261] qla2xxx [0000:d8:00.1]-5090:16: LOOP INIT ERROR (2003).
[ 932.879940] qla2xxx [0000:d8:00.1]-d01f:16: -> Firmware already dumped (ffffa67cc8c6a000) -- ignoring request
[ 934.935491] qla2xxx [0000:d8:00.1]-5090:16: LOOP INIT ERROR (200b).
[ 934.937253] qla2xxx [0000:d8:00.1]-d01f:16: -> Firmware already dumped (ffffa67cc8c6a000) -- ignoring request
[ 936.981735] qla2xxx [0000:d8:00.1]-5090:16: LOOP INIT ERROR (200b).
[ 936.983557] qla2xxx [0000:d8:00.1]-d01f:16: -> Firmware already dumped (ffffa67cc8c6a000) -- ignoring request
[ 937.006489] qla2xxx [0000:d8:00.1]-500a:16: LOOP UP detected (8 Gbps).
[ 937.072687] qla2xxx [0000:d8:00.1]-1005:16: Cmd 0x5d aborted with timeout since ISP Abort is pending
[ 937.072701] qla2xxx [0000:d8:00.1]-1005:16: Cmd 0x7c aborted with timeout since ISP Abort is pending
[ 937.072741] BUG: unable to handle kernel NULL pointer dereference at 00000000000001a0
[ 937.074647] IP: [<ffffffffc06c8195>] qla2x00_free_fcport+0x15/0x150 [qla2xxx]
[ 937.076539] PGD 0
[ 937.078362] Oops: 0000 [#1] SMP
[ 937.080162] Modules linked in: binfmt_misc target_core_user target_core_mod uio ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iTCO_wdt iTCO_vendor_support skx_edac coretemp intel_rapl iosf_mbi kvm_intel kvm vfat fat irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr ses enclosure scsi_transport_sas joydev sg mei_me i2c_i801 lpc_ich mei wmi ipmi_si ipmi_devintf
[ 937.089407] ipmi_msghandler acpi_power_meter ip_tables xfs sd_mod crc_t10dif crct10dif_generic qla2xxx(OE) crct10dif_pclmul crct10dif_common crc32c_intel ast i2c_algo_bit drm_kms_helper bnx2x nvme_fc syscopyarea sysfillrect nvme_fabrics sysimgblt fb_sys_fops nvme_core ttm scsi_transport_fc i40e scsi_tgt drm ahci libahci mdio ptp pps_core megaraid_sas libcrc32c libata drm_panel_orientation_quirks nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod
[ 937.094758] CPU: 5 PID: 9597 Comm: qla2xxx_16_dpc Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1
[ 937.096597] Hardware name: N/A N/A/RS33M2C9S, BIOS 2.00.48 03/10/2021
[ 937.097525] task: ffff9b66d2e46180 ti: ffff9b66c1dbc000 task.ti: ffff9b66c1dbc000
[ 937.098460] RIP: 0010:[<ffffffffc06c8195>] [<ffffffffc06c8195>] qla2x00_free_fcport+0x15/0x150 [qla2xxx]
[ 937.099426] RSP: 0018:ffff9b66c1dbfd20 EFLAGS: 00010282
[ 937.100376] RAX: 0000000000000100 RBX: 0000000000000000 RCX: ffffffffc0758ad0
[ 937.101336] RDX: 0000000000000000 RSI: ffff9b66dd530740 RDI: 0000000000000000
[ 937.102286] RBP: ffff9b66c1dbfd48 R08: 0000000000000100 R09: 0000000000000001
[ 937.103226] R10: 0000000000000a99 R11: ffff9b66c1dbf7be R12: ffff9b5f72e06000
[ 937.104159] R13: 0000000000000100 R14: 0000000000000100 R15: ffff9b66dd5307d0
[ 937.105081] FS: 0000000000000000(0000) GS:ffff9b5edcb40000(0000) knlGS:0000000000000000
[ 937.105913] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 937.106724] CR2: 00000000000001a0 CR3: 000000073ea76000 CR4: 00000000007607e0
[ 937.107532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 937.108345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 937.109126] PKRU: 00000000
[ 937.109891] Call Trace:
[ 937.110661] [<ffffffffc06cc272>] qla2x00_loop_resync+0x722/0x1120 [qla2xxx]
[ 937.111456] [<ffffffffc06b957a>] qla2x00_do_dpc+0x9fa/0xbc0 [qla2xxx]
[ 937.112257] [<ffffffffc06b8b80>] ? qla24xx_process_purex_list+0xd0/0xd0 [qla2xxx]
[ 937.113069] [<ffffffff89ac1c31>] kthread+0xd1/0xe0
[ 937.113884] [<ffffffff89ac1b60>] ? insert_kthread_work+0x40/0x40
[ 937.114712] [<ffffffff8a174c1d>] ret_from_fork_nospec_begin+0x7/0x21
[ 937.115557] [<ffffffff89ac1b60>] ? insert_kthread_work+0x40/0x40
[ 937.116386] Code: 84 00 00 00 00 00 e8 2b f2 3c c9 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 89 fb 48 83 ec 10 <48> 8b 97 a0 01 00 00 48 85 d2 74 77 48 8b 47 10 48 8b 8f a8 01
[ 937.118180] RIP [<ffffffffc06c8195>] qla2x00_free_fcport+0x15/0x150 [qla2xxx]
[ 937.119040] RSP <ffff9b66c1dbfd20>
[ 937.119879] CR2: 00000000000001a0
参考以上信息,在Call trace时,扔指向qla2xxx先关,结合以上的内容,光纤卡固件与驱动未见异常,继续光纤卡排查
2.2.4. 通过光纤卡工具"QConvergeConsole"进一步排查光纤卡实时状态
2.2.4.1. 通过命令“# qaucli -z”查看光纤卡固件生效状况和相关参数
现场远程查看,未见异常
2.2.4.2 .通过命令"#qaucli -pr fc -dm all general"查看光纤卡SFP模块状态信息
实际发现,一个端口的SFP,发光功率低于临界值“0.1259mW”,尝试插拔重置无效。更换SFP后正常。
且之后,多次插拔测试再未出现异常。
### 3.结论
3.1. 主机光纤卡HBA中的SFP光衰导致本次故障
3.2. 主机其他硬件未见异常
3.3. 未发现系统、固件、驱动异常导致本次故障
### 4. 建议
4.1.更换光纤卡故障SFP模块
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作