有两台R5300 G5服务器,A服务器21日下午异常重启,带外有大量bus uncorrectable error指向GPU。同集群中的B服务器21日下午也有大量的bus uncorrectable error指向GPU。
1、日志打印:
A服务器:
1、sds中重启时间点为2月21日14:12:54:
Informational System ACPI Power State ACPI_State Assertion event From BMC 2025-02-21 14:12:54 CUSTOMER LPC Reset occurred
重启前有大量的slot12 UCE刷屏,重启后解除。
1023 Warning Critical Interrupt PCIE12_GPU Assertion event From BIOS 2025-02-21 14:10:38 ENGINEER Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4
1025 Warning Critical Interrupt PCIE12_GPU Assertion event From BIOS 2025-02-21 14:10:39 CUSTOMER Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4
1026 Warning Critical Interrupt PCIE12_GPU Assertion event From BIOS 2025-02-21 14:10:40 ENGINEER Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4
2、 系统日志中,重启时间点为Feb 21 14:10:02
Feb 21 14:10:02 sna-12f-b-03-h5300-03-4u12 kernel: Linux version 3.10.0-957.27.8.2.g295089a.el7.x86_64 (root@172-20-53-23) (gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) ) #1 SMP Mon Nov 14 04:25:17 EST 2022
重启前有大量的如下打印:
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Dazed and confused, but trying to continue
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: sched: RT throttling activated
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Dazed and confused, but trying to continue
14日也有一次重启,sds日志和系统日志打印基本和21日相同。
B服务器:
1、sds日志中有大量slot10的UCE,未解除。
Warning Critical Interrupt PCIE10_GPU Assertion event From BIOS 2025-02-21 14:09:59 CUSTOMER Bus Uncorrectable Error---Slot 10---PCIE Name: Tesla T4
2、系统日志:21日没有重启记录,但是有CPU softlock,并且也有如下打印:
Feb 21 14:04:17 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:04:23 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:04:23 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue
Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: sched: RT throttling activated
Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue
Feb 21 14:04:34 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:04:34 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:04:50 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue
2、进一步分析报Bus Uncorrectable Error时的status值,两台每次报UCE时status值相同,举例如下:
status是0x00100000这个错误为bit20置位1。161带外UCE告警的status和162完全相同,也是0x00100000。其代表的含义如下图所示,即为来自T4 GPU不支持的请求响应(UR),该错误由PCIe RootPort触发系统处理器上的不可屏蔽中断(NMI),从而导致不可恢复的系统错误。
带外告警是由于T4 GPU收到了不支持的请求响应,造成了带外UCE和服务器重启,后续由系统和业务层面进行排查及调整。
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作