HP Superdome 2 -16s Server
IDC日志的基本信息
Enclosure/ Blade Usage/ CPU Memory Use Par Pending
Blade Product Name Status* OK/ (GB) On Num Deletion
Indicted/ OK/ Next
Deconf/ Indicted/ Boot
Max Deconf
========== ============== ======================= =========== ================= ==== === ========
1/1 CB900s i2 Inactive Base /OK 8/0/0/8 32.0/0.0/0.0 yes 1 -
1/2 CB900s i2 Inactive Base /OK 8/0/0/8 32.0/0.0/0.0 yes 1 -
1/3 CB900s i2 Inactive Base /OK 8/0/0/8 32.0/0.0/0.0 yes 1 -
1/4 CB900s i2 Inactive Base /I D 8/0/0/8 32.0/0.0/0.0 yes 1 - <<<<------------
1/5 CB900s i2 Inactive Base /OK 8/0/0/8 32.0/0.0/0.0 yes 1 -
1/6 CB900s i2 Inactive Base /OK 8/0/0/8 32.0/0.0/0.0 yes 1 -
1/7 CB900s i2 Inactive Base /OK 8/0/0/8 32.0/0.0/0.0 yes 1 -
1/8 CB900s i2 Inactive Base /OK 8/0/0/8 32.0/0.0/0.0 yes 1 -
IDC日志的SHOW CAE -E -n 4423
Summary :
Global MCA due to multi-bit error in the Agent's on-chip SRAM
Full Description :
A multi-bit error has occurred in the Agent's on-chip SRAM, which caused a global MCA. The partition has
been rebooted without the indicted blade.
Probable Cause 1 :
Error in Agent's internal SRAM.
Recommended Action 1 :
Replace the blade.
Replaceable Unit(s) :
Part Manufacturer : HP
Spare Part No. : AH342-67101
Part Serial No. : SGH3509NDS
Board Serial No. : MYJ348021U
Part Location : 0x0100ff04ffffff94 enclosure1/blade4 <<<<------------
Additional Info : Not Applicable
------------------------------------------------------------------------------
Starting analysis
------------------------------------------------------------------------------
The most severe error found in the error logs was Fatal Primary.
Problem:
Enc 1, Blade 4, Agent 1, CA 0[29]: A multi-bit ECC error was detected in an
enabled L4 tag way of the agent's on chip SRAM.
Possible Cause:
SRAM failure.
Possible Fix:
1. Replace the cell blade in bay 4 of enclosure 1.
------------------------------------------------------------------------------
Analysis completed
------------------------------------------------------------------------------
更换了Blade4之后,主机无法启动,反复重启。
FPL:
Line 324425: 302596 SFW 1,8,0,0,0 1 *5 a39c224701e10000 0000000000004008 MEM_SMBUS_WRITE_FAILED
Line 324433: 302603 SFW 1,8,0,0,0 1 *3 639c26b801e10000 0000000000000000 MEM_MC_INIT_FAIL
Line 324867: 302930 SFW 1,8,0,0,0 1 *3 649c1f1101e10000 0100ff08ffffff94 BLADE_BOOT_ERROR
Line 327280: 304680 SFW 1,8,0,0,0 1 *5 a39c224701e10000 0000000000004008 MEM_SMBUS_WRITE_FAILED
Line 327296: 304695 SFW 1,8,0,0,0 1 *3 639c26b801e10000 0000000000000000 MEM_MC_INIT_FAIL
Line 327552: 304870 SFW 1,8,0,0,0 1 *3 649c1f1101e10000 0100ff08ffffff94 BLADE_BOOT_ERROR
IPMI Event Code: 041c259e01e10000 0100ff080001ff71
Record Type = E1h
Reporting Entity ID = System Firmware - Enclosure# 1, Blade # 8, CPU Socket # 0, Core 0, Thread 0
Event ID = #9630
...........................................................
Keyword = MEM_SMI_EARLY_DDR_CHAN_INIT_1
Description:
Phase 1 of the memory controller initialization has begun.
Cause / Action:
System firmware has begun to initialize the memory controller.
Recommendation:
Informational only
___________________________________________________________
Alert Level = 0 - Minor Forward Progress
Data Type = 4 - Physical location
Source = 7 - Memory
Detail = 1 - Controller
Formatting physical location
----------------------------
Cabinet # = 01
Blade Slot # = 08
CPU Socket # = 0
DIMM Controller # = 1
Data = 01 00 ff 08 00 01 ff 71
________________________________________________
Title: HP Integrity Superdome 2 Server - nPartition Power On Failed with the Error MEM_SMBUS_WRITE_FAILED or MEM_SMBUS_READ_FAILED (HW ERT)
Object Name: mmr_kc-0124566
Document Type: Support Information
Original owner: KCS - HW Integrity Servers
Disclosure level: HP Confidential
Version state: final
Environment
FACT:HP Integrity Superdome 2 Server
Questions/Symptoms
SYMPTOM:nPartition Power On failed
SYMPTOM:Boot sequencing error halted forward progress
SYMPTOM:BLADE_BOOT_ERROR
SYMPTOM:MEM_SMBUS_READ_FAILED
SYMPTOM:MEM_MC_INIT_FAIL
SYMPTOM:MEM_SMBUS_WRITE_FAILED
Fails to turn on nPartition after the system firmware update
1. OA syslog
Jan 24 09:27:07 caemon: Indication : IndicatiOnIdentifier= 1765920150124092707 ProviderName = FPL_IndicationProvider PerceivedSeverity = 3 EventID = 7659 NparID = 1
Jan 24 09:28:16 mgmt: Blade 1 Ambient thermal state is OK.
Jan 24 09:28:26 mgmt: Blade 2 Ambient thermal state is OK.
Jan 24 09:28:26 mgmt: Blade 3 Ambient thermal state is OK.
Jan 24 09:28:26 mgmt: Blade 4 Ambient thermal state is OK.
Jan 24 09:28:37 mgmt: Blade 5 Ambient thermal state is OK.
Jan 24 09:28:37 mgmt: Blade 6 Ambient thermal state is OK.
Jan 24 09:28:37 mgmt: Blade 7 Ambient thermal state is OK.
Jan 24 09:28:47 mgmt: Blade 8 Ambient thermal state is OK.
Jan 24 09:33:13 -cli: [hpadmin] CONNECT PARTITION 1
Jan 24 09:34:06 mgmt: Blade 1 Ambient thermal state is OK.
Jan 24 09:34:16 mgmt: Blade 2 Ambient thermal state is OK.
Jan 24 09:34:16 mgmt: Blade 3 Ambient thermal state is OK.
Jan 24 09:34:16 mgmt: Blade 4 Ambient thermal state is OK.
Jan 24 09:34:26 mgmt: Blade 5 Ambient thermal state is OK.
Jan 24 09:34:26 mgmt: Blade 6 Ambient thermal state is OK.
Jan 24 09:34:26 mgmt: Blade 7 Ambient thermal state is OK.
Jan 24 09:34:37 mgmt: Blade 8 Ambient thermal state is OK.
Jan 24 09:39:41 parcon: Error: nPartition 1: nPartition Power On failed. Error: Firmware operation was un-successful.
2. CAE log event
OA> SHOW CAE -L
Sl.No Severity EventId EventCategory PartitionId EventTime Summary
#####################################################################################################
16 Information 2000 System Har... N/A Sat Jan 24 09:54:45 2015 An Acquittal has been performed.
15 Information 2000 System Har... N/A Sat Jan 24 09:53:57 2015 An Acquittal has been performed.
14 Fatal 3020 System Power N/A Sat Jan 24 09:52:34 2015 Electronic fuse (e-Fuse) has blown
13 Fatal 3020 System Power N/A Sat Jan 24 09:52:28 2015 Electronic fuse (e-Fuse) has blown
12 Degraded 7659 System Fir... 1 Sat Jan 24 09:27:07 2015 Boot sequencing error halted forward...
OA> SHOW CAE -E -n 12
Alert Number : 12
Event Identification :
Event ID : 7659
Provider Name : FPL_IndicationProvider
Event Time : Sat Jan 24 09:27:07 2015
Indication Identifier : 1765920150124092707
Managed Entity :
OA Name : OA
System Type : 59
System Serial No. : USE1118161
OA IP Address : 123.126.99.178
Affected Domain :
Enclosure Name : Dome_6
RackName : Dome_6_Rack
RackUID : 02SGH5104ACV
Impacted Domain : Partition
Complex Name : Dome_6
Partition ID : 1
Summary :
Boot sequencing error halted forward progress. Check other events for more data.
Full Description :
Boot sequencing has encountered an error and forward progress cannot continue.
Probable Cause 1 :
Cause can be one of several possible cases. Other error indications will show the specific case.
Recommended Action 1 :
Examine error details and determine if the blade is viable to include in the npar.
Replaceable Unit(s) :
Part Manufacturer : HP
Spare Part No. : AH342-67001
Part Serial No. : USE2257HHJ
Board Serial No. : MYJ13906LM
Part Location : 0x0100ff07ffffff94 enclosure1/blade7 <----
Additional Info : Not Applicable
Additional Data :
Severity : Degraded/Warning
Alert Type : Device Alert
Event Category : System Firmware
Event Subcategory : Unknown
Probable Cause : Other
Event Threshold : 1
Event Time Window : 0 (minutes)
Actual Event Threshold : 1
Actual Event Time Window : 0 (minutes)
Record ID : 0x0
Record Type : E1
Reporting Entity : 0x0100ff07ff000017 enclosure1/blade7/cpusocket0/cpucore0
Alert Level : 0x3
Data Type : 0x4
Data Payload : 0x100ff07ffffff94
Extended Reporting Entity ID : 0x6
Reporting Entity ID : 0x1
IPMI Event ID : 0x1f11
OEM System Model : NA
Original Product Number : AH337A
Current Product Number : AH337A
OEM Serial Number : NA
Version Info :
Complex FW Version : 3.7.98
Provider Version : 4.90
Error Log Data :
Error Log Bundle : 4000000000000206
3. Blade status reported with parstatus command
[Compute Enclosure]
Enclosure Enclosure Num Num Bay Enclosure
Num Type Blades IOBays Slots Name
========= ========= ====== ====== ===== ================================
1 Compute 8 0 8 Dome_6
[Blade]
Enclosure/ Blade Usage/ CPU Memory Use Par Pending
Blade Product Name Status* OK/ (GB) On Num Deletion
Indicted/ OK/ Next
Deconf/ Indicted/ Boot
Max Deconf
========== ============== ======================= =========== ================= ==== === ========
1/1 CB900s i2 Inactive Base /OK 8/0/0/8 128.0/0.0/0.0 yes 1 -
1/2 CB900s i2 Inactive Base /OK 8/0/0/8 128.0/0.0/0.0 yes 1 -
1/3 CB900s i2 Inactive Base /OK 8/0/0/8 128.0/0.0/0.0 yes 1 -
1/4 CB900s i2 Inactive Base /OK 8/0/0/8 128.0/0.0/0.0 yes 1 -
1/5 CB900s i2 Inactive Base /OK 8/0/0/8 128.0/0.0/0.0 yes 1 -
1/6 CB900s i2 Inactive Base /OK 8/0/0/8 128.0/0.0/0.0 yes 1 -
1/7 CB900s i2 Inactive Base /OK 8/0/0/8 128.0/0.0/0.0 yes 1 -
1/8 CB900s i2 Inactive Base /OK 8/0/0/8 128.0/0.0/0.0 yes 1 -
* D-Deconfigured I-Indicted
4. Show blade. All blade status are OK, however, the power status is Off.
OA> SHOW BLADE NAMES
Bay Server Name Serial Number Status Power UID Partner
--- ----------------------------- --------------- -------- ------- --- -------
1 CB900s i2 USE1118164 OK Off Off
2 CB900s i2 USE1118162 OK Off Off
3 CB900s i2 USE1118163 OK Off Off
4 CB900s i2 USE143J7CL OK Off Off
5 CB900s i2 USE1118165 OK Off Off
6 CB900s i2 USE1118166 OK Off Off
7 CB900s i2 USE2257HHJ OK Off Off
8 CB900s i2 USE2257HHK OK Off Off
9 [Subsumed]
10 [Subsumed]
11 [Subsumed]
12 [Subsumed]
13 [Subsumed]
14 [Subsumed]
15 [Subsumed]
16 [Subsumed]
Totals: 8 server blades installed, 0 powered on.
Cause
CAUSE:slot 7 blade
Answer/Solution
FIX:Replaced the slot 7 blade.
Note:
- BLADE_BOOT_ERROR will be only seen if firmware bundle version is 3.7.60 or higher.
- CAE event 7956 is logged with the firmware version 3.7.60 or higher.1. Check the firmware update status. It appears firmware update operation worked fine.
- OA syslog
Jan 24 08:21:28 mgmt: A USB Key was inserted into the Onboard Administrator.
Jan 24 08:26:12 -cli: initFirmwareUpdate: uri = usb://d2/hpsd2-3.7.98-fw.bundle
|
Jan 24 08:26:12 -cli: UPDATE FIRMWARE complex --all --timeout=10 --timeout_action=proceed usb://d2/hpsd2-3.7.98-fw.bundle
Jan 24 08:26:13 update_firmware[29496]: Firmware update PENDING
Jan 24 08:31:34 -cli: hpadmin logged out of the Onboard Administrator
Jan 24 08:31:36 hpoa: hpadmin logged out of the Onboard Administrator
Jan 24 08:33:07 update_firmware[29496]: Firmware update STARTED
Jan 24 08:33:11 update_firmware[3330]: Update Blade 1/1: Starting update now
Jan 24 08:33:11 update_firmware[3343]: Update Blade 1/3: Starting update now
Jan 24 08:33:11 update_firmware[3344]: Update Blade 1/2: Starting update now
Jan 24 08:33:12 update_firmware[3357]: Update Blade 1/4: Starting update now
|
Jan 24 08:38:12 update_firmware[4953]: Update Blade 1/5: Starting update now
Jan 24 08:38:12 update_firmware[4958]: Update Blade 1/6: Starting update now
Jan 24 08:38:13 update_firmware[4990]: Update Blade 1/7: Starting update now
Jan 24 08:38:13 update_firmware[4995]: Update Blade 1/8: Starting update now
|
Jan 24 08:43:15 update_firmware[6526]: Update OA 1/2: Starting update now
Jan 24 08:46:43 update_firmware[7563]: Update Blade 1/2: Success, complete
Jan 24 08:46:48 update_firmware[7631]: Update Blade 1/3: Success, complete
Jan 24 08:46:49 mgmt: Utility Processor on Blade 2 has completed boot.
|
Jan 24 08:46:53 update_firmware[7694]: Update Blade 1/1: Success, complete
Jan 24 08:46:53 update_firmware[7698]: Update Blade 1/4: Success, complete
Jan 24 08:46:57 mgmt: Utility Processor on Blade 3 has completed boot.
Jan 24 08:47:01 mgmt: Utility Processor on Blade 1 has completed boot.
Jan 24 08:47:03 mgmt: Utility Processor on Blade 4 has completed boot.
Jan 24 08:48:13 update_firmware[8444]: Update XFM 1/1: Starting update now
Jan 24 08:48:14 update_firmware[8462]: Update XFM 1/3: Starting update now
Jan 24 08:48:15 update_firmware[8490]: Update GPSM 1/2: Starting update now
Jan 24 08:48:15 update_firmware[8493]: Update XFM 1/2: Starting update now
Jan 24 08:48:15 update_firmware[8503]: Update XFM 1/4: Starting update now
Jan 24 08:48:17 update_firmware[8526]: Update GPSM 1/1: Starting update now
Jan 24 08:49:39 update_firmware[9130]: Update XFM 1/1: Success, complete
Jan 24 08:49:43 update_firmware[9153]: Update XFM 1/3: Success, complete
Jan 24 08:49:44 update_firmware[9184]: Update XFM 1/2: Success, complete
Jan 24 08:49:50 update_firmware[9240]: Update XFM 1/4: Success, complete
Jan 24 08:49:55 update_firmware[9306]: Update GPSM 1/2: Success, complete
Jan 24 08:50:00 update_firmware[9341]: Update GPSM 1/1: Success, complete
|
Jan 24 08:51:08 update_firmware[9775]: Update Blade 1/6: Success, complete
Jan 24 08:51:18 mgmt: Utility Processor on Blade 7 has completed boot.
Jan 24 08:51:19 update_firmware[9873]: Update Blade 1/7: Success, complete
Jan 24 08:51:26 mgmt: Utility Processor on Blade 6 has completed boot.
|
Jan 24 08:51:41 update_firmware[10111]: Update Blade 1/8: Success, complete
Jan 24 08:51:46 mgmt: Utility Processor on Blade 8 has completed boot.
Jan 24 08:51:47 mgmt: Utility Processor on Blade 5 appears responsive again.
Jan 24 08:51:59 update_firmware[10282]: Update Blade 1/5: Success, complete
Jan 24 08:52:06 mgmt: Utility Processor on Blade 5 has completed boot.
Jan 24 08:53:18 update_firmware[11048]: Update IOX 9/1: Starting update now
Jan 24 08:53:18 update_firmware[11051]: Update IOX 10/1: Starting update now
|
Jan 24 08:54:45 update_firmware[11564]: Update IOX 10/1: Success, complete
Jan 24 08:54:47 update_firmware[11622]: Update IOX 9/1: Success, complete
Jan 24 08:55:59 update_firmware[12066]: Update OA 1/2: Success, complete
Jan 24 08:56:02 update_firmware[12139]: Update OA 1/1: Starting update now
Jan 24 08:58:42 update_firmware[13067]: Update OA 1/1: Success, part 1 of update
|
Jan 24 09:03:11 OA: No NVRAM downgrade required for OA RPM: 4.83-0
- FPL
567992 OA 1,1 None 0 0b0020cb00e10000 0100000054c39f13 FIRMWARE_UPDATE_COMPLEX
567993 OA 1,1 None 0 0b0020cd00e10000 0100000054c39f13 FIRMWARE_UPDATE_NPAR
|
569392 OA 1,1 None 0 0b0020cc00e10000 0100000054c3a512 FIRMWARE_UPDATE_COMPLETE
2. Check the firmware version
OA> SHOW UPDATE FIRMWARE
Configured complex firmware bundle version: 3.7.98
===============================================
Firmware on all devices matches the complex configured bundle version.
3. Review SEL and FPL
- Last entries of the boot error
13706 SFW 1,7,1,0,0 1 *5 a398224741e1569f 0000000000004008 MEM_SMBUS_WRITE_FAILED
13706 02/15/2014 11:32:59
13707 SFW 1,7,1,0,0 1 *3 639826b841e156a1 0000000000000000 MEM_MC_INIT_FAIL
13707 02/15/2014 11:32:59
13708 SFW 1,7,0,0,0 1 *3 64981f1101e156a3 0100ff07ffffff94 BLADE_BOOT_ERROR
13708 02/15/2014 11:32:59
16669 SFW 1,7,1,0,0 1 *5 a398224641e16ce9 0000000000004007 MEM_SMBUS_READ_FAILED
- The entries of the boot error logged at the earliest time
16081 SFW 1,7,1,0,0 1 *5 a398224741e168a4 0000000000004008 MEM_SMBUS_WRITE_FAILED
16081 01/24/2015 09:26:58
16082 SFW 1,7,1,0,0 1 *3 639826b841e168a6 0000000000000000 MEM_MC_INIT_FAIL
16082 01/24/2015 09:26:58
16083 PDHC 1,1 1 2 568024b100e168a8 0000000000000000 ELS_RECOVER_WAIT_DEL
16083 01/24/2015 09:27:06
16084 SFW 1,7,0,0,0 1 *3 64981f1101e168aa 0100ff07ffffff94 BLADE_BOOT_ERROR
16084 01/24/2015 09:27:07
4. Error decoding of the MEM_SMBUS_WRITE_FAILED. Note that MEM_SMBUS_READ_FAILED is also
detected.
Most suspect hardware is the blade. Use the Reporting Entity ID to identify the
suspect blade location. It is Enclosure # 1 : Blade # 7 in this case.
Note: The I2C interface between Mill Brook and FPGA is defined as SMBUS
(System Management Bus). See block diagram at the botom of this document.
Event 13706
IPMI Event Code: a398224741e1569f 0000000000004008
Record Type = E1
Reporting Entity ID = System Firmware - Enclosure # 1 : Blade # 7 : CPU Socket # 1 : Core # 0 : Thread # 0
Event ID = #8775
...........................................................
Keyword = MEM_SMBUS_WRITE_FAILED
Description:
An SM Bus write operation failed during system firmware boot
Cause/Action:
Memory initialization failure
Recommendation:
Refer to related WS-Man alerts.
Alert Level = 5 - Critical
Data Type = 3 - Actual data
Data = 00 00 00 00 00 00 40 08
5. Simple block diagram of CPU and memory subsystem
Search keywords:
ecen sd2
更换Blade 8 BU
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作