• 全部
  • 经验案例
  • 典型配置
  • 技术公告
  • FAQ
  • 漏洞说明
  • 全部
  • 全部
  • 大数据引擎
  • 知了引擎
产品线
搜索
取消
案例类型
发布者
是否解决
是否官方
时间
搜索引擎
匹配模式
高级搜索

多路径链路震荡导致主机fence重启

2020-09-28 发表
  • 0关注
  • 0收藏 5353浏览
粉丝:28人 关注:2人

组网及说明


问题描述

CAS平台出现部分虚拟机业务异常,虚拟机显示蓝色状态,主机触发fence重启,后台执行共享存储相关命令会挂住。


过程分析

CVK主机后台执行共享存储检测脚本,会有共享存储阻塞的提示:


查看主机kern.log日志,可以看到有共享存储多路径链路震荡,以其中一台主机日志为例,日志内容如下,多路径链路不停断开再恢复:


Jul 23 17:42:52 L01-E3-ZWW-CVK-R6900-01 kernel: [10958703.208382] (o2hb-B14C005525,10951,8):o2hb_thread:1447 do disk heartbeat used 3048 msecs on device(dm-1), ret = 0.//出现共享存储心跳超时


Jul 23 17:42:52 L01-E3-ZWW-CVK-R6900-01 kernel: [10958703.208574] (o2hb-774E8026D9,10750,10):o2hb_thread:1447 do disk heartbeat used 2632 msecs on device(dm-0), ret = 0.


Jul 23 17:42:52 L01-E3-ZWW-CVK-R6900-01 kernel: [10958703.213676] (o2hb-DF5A141CCF,11150,89):o2hb_thread:1447 do disk heartbeat used 2189 msecs on device(dm-2), ret = 0.


Jul 23 17:43:11 L01-E3-ZWW-CVK-R6900-01 kernel: [10958722.124639] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:0 --  0 2002.//多路径链路异常


Jul 23 17:47:17 L01-E3-ZWW-CVK-R6900-01 kernel: [10958968.401071] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:1 --  0 2002.


Jul 23 17:52:33 L01-E3-ZWW-CVK-R6900-01 kernel: [10959284.686307] (o2hb-69E6B455D9,11460,15):o2hb_thread:1447 do disk heartbeat used 3075 msecs on device(dm-3), ret = 0.


Jul 23 18:03:48 L01-E3-ZWW-CVK-R6900-01 kernel: [10959959.238115] sd 1:0:3:0: [sdo] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK


Jul 23 18:03:48 L01-E3-ZWW-CVK-R6900-01 kernel: [10959959.238125] sd 1:0:3:0: [sdo] tag#0 CDB: Read(16) 88 00 00 00 00 02 d3 e7 ac 00 00 00 02 00 00 00


Jul 23 18:03:48 L01-E3-ZWW-CVK-R6900-01 kernel: [10959959.238128] print_req_error: I/O error, dev sdo, sector 12145110016//主机访问存储I/O报错


Jul 23 18:03:48 L01-E3-ZWW-CVK-R6900-01 kernel: [10959959.238211] device-mapper: multipath: Failing path 8:224.//多路径链路断开


Jul 23 18:03:52 L01-E3-ZWW-CVK-R6900-01 kernel: [10959963.847899] device-mapper: multipath: Reinstating path 8:224.//多路径链路恢复


Jul 23 18:03:52 L01-E3-ZWW-CVK-R6900-01 kernel: [10959963.859228] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:03:52 L01-E3-ZWW-CVK-R6900-01 kernel: [10959963.859327] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:13:19 L01-E3-ZWW-CVK-R6900-01 kernel: [10960530.029822] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:1 --  0 2002.


Jul 23 18:14:17 L01-E3-ZWW-CVK-R6900-01 kernel: [10960588.910855] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:0 --  0 2002.


Jul 23 18:17:16 L01-E3-ZWW-CVK-R6900-01 kernel: [10960767.090128] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:0 --  0 2002.


Jul 23 18:17:16 L01-E3-ZWW-CVK-R6900-01 kernel: [10960767.108009] (o2hb-774E8026D9,10750,4):o2hb_thread:1447 do disk heartbeat used 3166 msecs on device(dm-0), ret = 0.


Jul 23 18:17:39 L01-E3-ZWW-CVK-R6900-01 kernel: [10960790.185856] sd 1:0:3:0: [sdo] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK


Jul 23 18:17:39 L01-E3-ZWW-CVK-R6900-01 kernel: [10960790.185865] sd 1:0:3:0: [sdo] tag#0 CDB: Read(16) 88 00 00 00 00 02 d3 e7 ac 00 00 00 02 00 00 00


Jul 23 18:17:39 L01-E3-ZWW-CVK-R6900-01 kernel: [10960790.185867] print_req_error: I/O error, dev sdo, sector 12145110016


Jul 23 18:17:39 L01-E3-ZWW-CVK-R6900-01 kernel: [10960790.185936] device-mapper: multipath: Failing path 8:224.


Jul 23 18:17:43 L01-E3-ZWW-CVK-R6900-01 kernel: [10960794.503803] device-mapper: multipath: Reinstating path 8:224.


Jul 23 18:17:43 L01-E3-ZWW-CVK-R6900-01 kernel: [10960794.514581] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:17:43 L01-E3-ZWW-CVK-R6900-01 kernel: [10960794.514721] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:18:46 L01-E3-ZWW-CVK-R6900-01 kernel: [10960857.715695] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:1 --  0 2002.


Jul 23 18:19:18 L01-E3-ZWW-CVK-R6900-01 kernel: [10960889.204218] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:0 --  0 2002.


Jul 23 18:19:23 L01-E3-ZWW-CVK-R6900-01 kernel: [10960894.853971] (o2hb-DF5A141CCF,11150,4):o2hb_thread:1447 do disk heartbeat used 2173 msecs on device(dm-2), ret = 0.


Jul 23 18:19:23 L01-E3-ZWW-CVK-R6900-01 kernel: [10960894.860052] (o2hb-B14C005525,10951,8):o2hb_thread:1447 do disk heartbeat used 2243 msecs on device(dm-1), ret = 0.


Jul 23 18:19:54 L01-E3-ZWW-CVK-R6900-01 kernel: [10960925.813000] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:0 --  0 2002.


Jul 23 18:19:59 L01-E3-ZWW-CVK-R6900-01 kernel: [10960930.933122] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:0 --  0 2002.


Jul 23 18:19:59 L01-E3-ZWW-CVK-R6900-01 kernel: [10960930.953332] (o2hb-774E8026D9,10750,4):o2hb_thread:1447 do disk heartbeat used 8608 msecs on device(dm-0), ret = 0.


Jul 23 18:20:54 L01-E3-ZWW-CVK-R6900-01 kernel: [10960985.462076] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:0 --  0 2002.


Jul 23 18:20:54 L01-E3-ZWW-CVK-R6900-01 kernel: [10960985.478316] (o2hb-774E8026D9,10750,4):o2hb_thread:1447 do disk heartbeat used 6140 msecs on device(dm-0), ret = 0.


Jul 23 18:21:27 L01-E3-ZWW-CVK-R6900-01 kernel: [10961018.071958] sd 1:0:3:0: [sdo] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK


Jul 23 18:21:27 L01-E3-ZWW-CVK-R6900-01 kernel: [10961018.071978] sd 1:0:3:0: [sdo] tag#0 CDB: Read(16) 88 00 00 00 00 03 be d6 54 00 00 00 02 00 00 00


Jul 23 18:21:27 L01-E3-ZWW-CVK-R6900-01 kernel: [10961018.071980] print_req_error: I/O error, dev sdo, sector 16086619136


Jul 23 18:21:27 L01-E3-ZWW-CVK-R6900-01 kernel: [10961018.072049] device-mapper: multipath: Failing path 8:224.


Jul 23 18:21:31 L01-E3-ZWW-CVK-R6900-01 kernel: [10961022.686528] device-mapper: multipath: Reinstating path 8:224.


Jul 23 18:21:31 L01-E3-ZWW-CVK-R6900-01 kernel: [10961022.694713] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:21:31 L01-E3-ZWW-CVK-R6900-01 kernel: [10961022.694809] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:22:16 L01-E3-ZWW-CVK-R6900-01 kernel: [10961067.087595] sd 1:0:0:1: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:22:16 L01-E3-ZWW-CVK-R6900-01 kernel: [10961067.087697] sd 1:0:0:1: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:22:55 L01-E3-ZWW-CVK-R6900-01 kernel: [10961106.428510] (o2hb-774E8026D9,10750,3):o2hb_thread:1447 do disk heartbeat used 2000 msecs on device(dm-0), ret = 0.


Jul 23 18:23:25 L01-E3-ZWW-CVK-R6900-01 kernel: [10961136.464537] sd 1:0:3:0: [sdo] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK


Jul 23 18:23:25 L01-E3-ZWW-CVK-R6900-01 kernel: [10961136.464556] sd 1:0:3:0: [sdo] tag#0 CDB: Read(16) 88 00 00 00 00 04 a0 0f 70 00 00 00 02 00 00 00


Jul 23 18:23:25 L01-E3-ZWW-CVK-R6900-01 kernel: [10961136.464558] print_req_error: I/O error, dev sdo, sector 19865235456


Jul 23 18:23:25 L01-E3-ZWW-CVK-R6900-01 kernel: [10961136.464621] device-mapper: multipath: Failing path 8:224.


Jul 23 18:23:27 L01-E3-ZWW-CVK-R6900-01 kernel: [10961138.128479] device-mapper: multipath: Reinstating path 8:224.


Jul 23 18:23:27 L01-E3-ZWW-CVK-R6900-01 kernel: [10961138.136838] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:23:27 L01-E3-ZWW-CVK-R6900-01 kernel: [10961138.136939] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:23:38 L01-E3-ZWW-CVK-R6900-01 kernel: [10961149.050969] (o2hb-774E8026D9,10750,5):o2hb_thread:1447 do disk heartbeat used 2286 msecs on device(dm-0), ret = 0.


Jul 23 18:23:39 L01-E3-ZWW-CVK-R6900-01 kernel: [10961150.763518] sd 1:0:3:0: [sdo] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK


Jul 23 18:23:39 L01-E3-ZWW-CVK-R6900-01 kernel: [10961150.763538] sd 1:0:3:0: [sdo] tag#0 CDB: Read(16) 88 00 00 00 00 00 63 05 8a 00 00 00 02 00 00 00


Jul 23 18:23:39 L01-E3-ZWW-CVK-R6900-01 kernel: [10961150.763540] print_req_error: I/O error, dev sdo, sector 1661307392


Jul 23 18:23:39 L01-E3-ZWW-CVK-R6900-01 kernel: [10961150.763604] device-mapper: multipath: Failing path 8:224.


Jul 23 18:23:42 L01-E3-ZWW-CVK-R6900-01 kernel: [10961153.141681] device-mapper: multipath: Reinstating path 8:224.


Jul 23 18:23:42 L01-E3-ZWW-CVK-R6900-01 kernel: [10961153.153194] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:23:42 L01-E3-ZWW-CVK-R6900-01 kernel: [10961153.153293] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:25:12 L01-E3-ZWW-CVK-R6900-01 kernel: [10961243.258856] qla2xxx [0000:ad:00.0]-801c:1: Abort command issued nexus=1:3:0 --  0 2002.


Jul 23 18:25:12 L01-E3-ZWW-CVK-R6900-01 kernel: [10961243.275033] (o2hb-774E8026D9,10750,7):o2hb_thread:1447 do disk heartbeat used 25660 msecs on device(dm-0), ret = 0.


Jul 23 18:25:21 L01-E3-ZWW-CVK-R6900-01 kernel: [10961252.086171] (o2hb-DF5A141CCF,11150,4):o2hb_thread:1447 do disk heartbeat used 2407 msecs on device(dm-2), ret = 0.


Jul 23 18:25:21 L01-E3-ZWW-CVK-R6900-01 kernel: [10961252.086384] (o2hb-774E8026D9,10750,7):o2hb_thread:1447 do disk heartbeat used 2759 msecs on device(dm-0), ret = 0.


Jul 23 18:26:01 L01-E3-ZWW-CVK-R6900-01 kernel: [10961292.371735] sd 1:0:0:4: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:26:01 L01-E3-ZWW-CVK-R6900-01 kernel: [10961292.371835] sd 1:0:0:4: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:26:20 L01-E3-ZWW-CVK-R6900-01 kernel: [10961311.053520] sd 1:0:3:0: [sdo] tag#1 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK


Jul 23 18:26:20 L01-E3-ZWW-CVK-R6900-01 kernel: [10961311.053523] sd 1:0:3:0: [sdo] tag#1 CDB: Read(16) 88 00 00 00 00 02 9b bf b8 00 00 00 02 00 00 00


Jul 23 18:26:20 L01-E3-ZWW-CVK-R6900-01 kernel: [10961311.053526] print_req_error: I/O error, dev sdo, sector 11202967552


Jul 23 18:26:20 L01-E3-ZWW-CVK-R6900-01 kernel: [10961311.053543] device-mapper: multipath: Failing path 8:224.


Jul 23 18:26:24 L01-E3-ZWW-CVK-R6900-01 kernel: [10961315.374468] device-mapper: multipath: Reinstating path 8:224.


Jul 23 18:26:24 L01-E3-ZWW-CVK-R6900-01 kernel: [10961315.384142] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:26:24 L01-E3-ZWW-CVK-R6900-01 kernel: [10961315.384238] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:26:58 L01-E3-ZWW-CVK-R6900-01 kernel: [10961349.366801] (o2hb-69E6B455D9,11460,16):o2hb_thread:1447 do disk heartbeat used 2790 msecs on device(dm-3), ret = 0.


Jul 23 18:27:16 L01-E3-ZWW-CVK-R6900-01 kernel: [10961367.449795] sd 1:0:3:0: [sdo] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK


Jul 23 18:27:16 L01-E3-ZWW-CVK-R6900-01 kernel: [10961367.449814] sd 1:0:3:0: [sdo] tag#0 CDB: Read(16) 88 00 00 00 00 00 63 05 86 00 00 00 02 00 00 00


Jul 23 18:27:16 L01-E3-ZWW-CVK-R6900-01 kernel: [10961367.449816] print_req_error: I/O error, dev sdo, sector 1661306368


Jul 23 18:27:16 L01-E3-ZWW-CVK-R6900-01 kernel: [10961367.449874] device-mapper: multipath: Failing path 8:224.


Jul 23 18:27:19 L01-E3-ZWW-CVK-R6900-01 kernel: [10961370.416732] device-mapper: multipath: Reinstating path 8:224.


Jul 23 18:27:19 L01-E3-ZWW-CVK-R6900-01 kernel: [10961370.425136] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


Jul 23 18:27:19 L01-E3-ZWW-CVK-R6900-01 kernel: [10961370.425232] sd 1:0:3:0: alua: port group 01 state A preferred supports tolusnA


从上面的日志可以看出,是多路径链路震荡,导致主机访问共享存储阻塞,引起集群锁。

通过反复执行集群锁检测脚本可以看出,一会出现集群锁,一会又正常,与日志中多路径频繁断开再恢复能够对应上:


I/O阻塞严重时,触发主机fence重启,ocfs2_fence_restart.log日志记录如下:

Restarted,Thu Jul 23 21:59:00 2020,dm-0,774E8026D9E642519B7AA9A7FA35D908,DISK fault leading to being fenced

Restarted,Thu Jul 23 22:56:48 2020,dm-1,B14C005525B4442C85761DB0D6C69330,DISK fault leading to being fenced

Restarted,Thu Jul 23 23:42:16 2020,dm-2,DF5A141CCF3E44F38C99AF272A6D1291,DISK fault leading to being fenced

导致链路震荡的可能原因,包括,HBA/光模块/光纤线/FC交换机/存储等。

通过现场排查发现存储上一个端口有光衰的情况:


通过disable此光衰端口后,主机访问共享存储恢复正常。



解决方法

通过目前的信息确认,共享存储阻塞问题,原因是链路震荡导致,链路震荡原因是存储端一个端口出现光衰导致。

对应此种情况,建议定期检查FC交换机及存储光模块光功率,发现问题尽快更换解决,避免引起严重后果。


该案例对您是否有帮助:

您的评价:1

若您有关于案例的建议,请反馈:

0 个评论

该案例暂时没有网友评论

编辑评论

举报

×

侵犯我的权益 >
对根叔知了社区有害的内容 >
辱骂、歧视、挑衅等(不友善)

侵犯我的权益

×

泄露了我的隐私 >
侵犯了我企业的权益 >
抄袭了我的内容 >
诽谤我 >
辱骂、歧视、挑衅等(不友善)
骚扰我

泄露了我的隐私

×

您好,当您发现根叔知了上有泄漏您隐私的内容时,您可以向根叔知了进行举报。 请您把以下内容通过邮件发送到zhiliao@h3c.com 邮箱,我们会尽快处理。
  • 1. 您认为哪些内容泄露了您的隐私?(请在邮件中列出您举报的内容、链接地址,并给出简短的说明)
  • 2. 您是谁?(身份证明材料,可以是身份证或护照等证件)

侵犯了我企业的权益

×

您好,当您发现根叔知了上有关于您企业的造谣与诽谤、商业侵权等内容时,您可以向根叔知了进行举报。 请您把以下内容通过邮件发送到 zhiliao@h3c.com 邮箱,我们会在审核后尽快给您答复。
  • 1. 您举报的内容是什么?(请在邮件中列出您举报的内容和链接地址)
  • 2. 您是谁?(身份证明材料,可以是身份证或护照等证件)
  • 3. 是哪家企业?(营业执照,单位登记证明等证件)
  • 4. 您与该企业的关系是?(您是企业法人或被授权人,需提供企业委托授权书)
我们认为知名企业应该坦然接受公众讨论,对于答案中不准确的部分,我们欢迎您以正式或非正式身份在根叔知了上进行澄清。

抄袭了我的内容

×

原文链接或出处

诽谤我

×

您好,当您发现根叔知了上有诽谤您的内容时,您可以向根叔知了进行举报。 请您把以下内容通过邮件发送到zhiliao@h3c.com 邮箱,我们会尽快处理。
  • 1. 您举报的内容以及侵犯了您什么权益?(请在邮件中列出您举报的内容、链接地址,并给出简短的说明)
  • 2. 您是谁?(身份证明材料,可以是身份证或护照等证件)
我们认为知名企业应该坦然接受公众讨论,对于答案中不准确的部分,我们欢迎您以正式或非正式身份在根叔知了上进行澄清。

对根叔知了社区有害的内容

×

垃圾广告信息
色情、暴力、血腥等违反法律法规的内容
政治敏感
不规范转载 >
辱骂、歧视、挑衅等(不友善)
骚扰我
诱导投票

不规范转载

×

举报说明

提出建议

    +

亲~登录后才可以操作哦!

确定

亲~检测到您登陆的账号未在http://hclhub.h3c.com进行注册

注册后可访问此模块

跳转hclhub

你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作