panic[cpu0]/thread=ffffff000961dc60: Unrecoverable Machine-Check Exception

Discussion:

Ed Kaczmarek

2009-01-17 14:42:49 UTC

Greetings, I get the following panic on a whitebox (Tyan) dual socket box, Dual core AMD Opteron model 2222, 4GB ram, running build 106 of OpenSolaris. Single PATA DVD, single SATA 500GB WD HDD. Ubuntu 7.10 and win2k8 boot/installs/works, so I believe the hardware is functional.

panic[cpu0]/thread=ffffff000961dc60: Unrecoverable Machine-Check Exception

ffffff000961d690 unix:cmi_mca_panic+1b ()
ffffff000961d6d0 unix:cmi_mca_trap+170 ()
ffffff000961d6e0 unix:mcetrap+154 ()
ffffff000961d860 unix:ddi_getb+f ()
ffffff000961d8f0 ata:ata_drive_type+7f ()
ffffff000961d9a0 ata:ata_init_drive+c3 ()
ffffff000961da10 ata:ata_attach+6b ()
ffffff000961da70 genunix:devi_attach+80 ()
ffffff000961daa0 genunix:attach_node+95 ()
ffffff000961dae0 genunix:i_ndi_config_node+a5 ()
ffffff000961db00 genunix:i_ddi_attachchild+40 ()
ffffff000961db40 genunix:devi_attach_node+ac ()
ffffff000961dba0 genunix:config_immediate_children+d5 ()
ffffff000961dbf0 genunix:devi_config_common+a6 ()
ffffff000961dc40 genunix:mt_config_thread+53 ()
ffffff000961dc50 unix:thread_start+8 ()

Here's the panic messages from same box with Solaris 10 update 6

SunOS Release 5.10 Version Generic_137138-09 32-bit
Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Configuring devices.
panic[cpu0]/thread=cd099de0: Unrecoverable Machine-Check Exception

cd099c00 unix:cmi_mca_panic+2d (cc942200, 376, cc94)
cd099c1c unix:cmi_mca_trap+13b (cd099c28)
cd099c28 unix:mcetrap+59 (cd0901b0, cd090000,)
cd099c94 unix:ddi_io_get8+13 (a0, cc942200, 170, )
cd099cd8 ata:ata_probe+78 (d20f6140)
cd099d00 genunix:devi_probe+32 (d20f6140, d20f6140,)
cd099d14 genunix:probe_node+30 (d20f6140)
cd099d2c genunix:i_ndi_config_node+a9 (d20f6140, 6, 0)
cd099d44 genunix:i_ddi_attachchild+32 (d20f6140)
cd099d5c genunix:devi_attach_node+65 (d20f6140, 4004048)
cd099d7c genunix:config_immediate_children+b4 (d20f63b0, 4004048, )
cd099d98 genunix:devi_config_common+76 (d20f63b0, 4004048, )
cd099dc8 genunix:mt_config_thread+90 (ccf9bc48, 0)
cd099dd8 unix:thread_start+8 ()

Thoughts ?

--
This message posted from opensolaris.org

Gavin Maltby

2009-01-17 22:06:36 UTC

Permalink

Hi,

Post by Ed Kaczmarek
Greetings, I get the following panic on a whitebox (Tyan) dual socket box, Dual core AMD Opteron model 2222, 4GB ram, running build 106 of OpenSolaris. Single PATA DVD, single SATA 500GB WD HDD. Ubuntu 7.10 and win2k8 boot/installs/works, so I believe the hardware is functional.
panic[cpu0]/thread=ffffff000961dc60: Unrecoverable Machine-Check Exception
ffffff000961d690 unix:cmi_mca_panic+1b ()
ffffff000961d6d0 unix:cmi_mca_trap+170 ()
ffffff000961d6e0 unix:mcetrap+154 ()
ffffff000961d860 unix:ddi_getb+f ()
ffffff000961d8f0 ata:ata_drive_type+7f ()
ffffff000961d9a0 ata:ata_init_drive+c3 ()
ffffff000961da10 ata:ata_attach+6b ()
ffffff000961da70 genunix:devi_attach+80 ()
ffffff000961daa0 genunix:attach_node+95 ()
ffffff000961dae0 genunix:i_ndi_config_node+a5 ()
ffffff000961db00 genunix:i_ddi_attachchild+40 ()
ffffff000961db40 genunix:devi_attach_node+ac ()
ffffff000961dba0 genunix:config_immediate_children+d5 ()
ffffff000961dbf0 genunix:devi_config_common+a6 ()
ffffff000961dc40 genunix:mt_config_thread+53 ()
ffffff000961dc50 unix:thread_start+8 ()

Is that all the console info you see - I'd expect a few lines
of class=.... detector=... etc as it dumps the terminal
error report summary. It looks like we may be too early in boot
for a crash dump to succeed.

Let's have it blow through machine checks so we can boot and
log them. At the grub menu edit the 'unix' line and
append '-kd' to it so we boot into the kernel debugger.
When it stops at the prompt type

cmi_panic_on_uncorrectable_error/W0
:c

The first line makes machine checks non-terminal, the second
continues.

You'll no doubt get a few events during boot as the driver
pokes at registers again, but we won't panic for them
(but if they are important then we may hang). Assuming
we boot ok could you mail the output of 'fmdump -eV'
so we can see details of the error.

Solaris switches on error traps that other OS do not, and
on Opteron changes the NorthBridge error configuration.
There've been a few cases of that opening us to errors
that are really harmless that others wouldn't notice
(e.g., driver trying to read from a bogus register -
can be harmless if just probing).

Gavin

Gavin Maltby

2009-01-18 23:20:35 UTC

Permalink

Hi Ed,

Post by Gavin Maltby
Is that all the console info you see - I'd expect a few lines
of class=.... detector=... etc as it dumps the terminal
error report summary. It looks like we may be too early in boot
for a crash dump to succeed.

attached is the full original console output.

Post by Gavin Maltby
Let's have it blow through machine checks so we can boot and
log them. At the grub menu edit the 'unix' line and
append '-kd' to it so we boot into the kernel debugger.
When it stops at the prompt type
cmi_panic_on_uncorrectable_error/W0
:c
The first line makes machine checks non-terminal, the second
continues.
You'll no doubt get a few events during boot as the driver
pokes at registers again, but we won't panic for them
(but if they are important then we may hang).

It looks hung, it's been at this point for 5+ minutes.
no spinning of the dial...
elcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ unix krtld genunix ]
[0]> cmi_panic_on_uncorrectable_error/W0
cmi_panic_on_uncorrectable_error: 0x1 = 0x0
[0]> :c
SunOS Release 5.11 Version snv_106 64-bit
Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
WARNING: Last shutdown is later than time on time-of-day chip; check date.
Configuring /dev
\

Post by Gavin Maltby
Assuming
we boot ok could you mail the output of 'fmdump -eV'
so we can see details of the error.

Sorry it's well hung. :)

From the console output:

syncing file systems... done
ereport.cpu.amd.nb.wdog ena=1dd4c26ab700001 detector=[ version=0 scheme="hc"
hc-list=[ hc-name="motherboard" hc-id="0" hc-name="chip" hc-id="0" hc-name=
"core" hc-id="0" hc-name="strand" hc-id="0" ] ] compound_errorname=
"BUSLG_SRC_ERR__NOTIMEOUT_ERR" disp=
"processor_context_corrupt,return_ip_invalid,unconstrained,forcefatal"
IA32_MCG_STATUS=4 machine_check_in_progress=1 privileged=0 bank_number=4
bank_msr_offset=410 IA32_MCi_STATUS=b200000000070f0f overflow=0
error_uncorrected=1 error_enabled=1 processor_context_corrupt=1 error_code=f0f
model_specific_error_code=7

skipping system dump - no dump device configured

The ereport class is ereport.cpu.amd.nb.wdog - NorthBridge watchdog.
We panic because the error is uncorrected and processor context
is corrupt.

The NorthBridge watchdog fires (if it is enabled) when a request is not
replied to within some timeout period. The error address register
for watchdogs isn't an address, but decodes to tell us who made the
request to who etc. Unfortunately there's a bug here - the
generic code that grabs the registers only grabs the
address register if the status register indicates the
address is valid (ADDRV bit) which it never is for a watchdog -
so we never grab and include the address in the error report above.
I'll log a bug on that. The address usually doesn't tell you
as much as you'd hope anyway.

The default policy is to enable the watchdog if not enabled in the BIOS.
We can change that.

/*
* The default watchdog policy is to enable it (at the above rate) if it
* is disabled; if it is enabled then we leave it enabled at the rate
* chosen by the BIOS.
*/
enum {
AO_NB_WDOG_LEAVEALONE, /* Don't touch watchdog config */
AO_NB_WDOG_DISABLE, /* Always disable watchdog */
AO_NB_WDOG_ENABLE_IF_DISABLED, /* If disabled, enable at our rate */
AO_NB_WDOG_ENABLE_FORCE_RATE /* Enable and set our rate */
} ao_nb_watchdog_policy = AO_NB_WDOG_ENABLE_IF_DISABLED;

Changing that with kmdb involves setting deferred breakpoints. We'll cheat
by first disabling everything and setting what we want in /etc/system:

1) Boot into kmdb as before (add -kd to unix line in grub). At the prompt
utter 'cmi_no_init/W1' then ':c' to continue. We'll boot with loading
and cpu module support, and that should get you booted I think (if not
there are bigger problems)

2) Append the following to /etc/system:

set cpu_ms\.AuthenticAMD\.15:ao_nb_watchdog_policy=0

3) Reboot normally

That will leave the watchdog as the BIOS had it, and I suspect it's
off by default, while leaving other functionality operational.

I think ata has necessitated this workaround on one or two other motherboards
before now. I don't know the true root cause.

Gavin

Gavin Maltby

2009-01-18 23:50:46 UTC

Permalink

Post by Gavin Maltby
The NorthBridge watchdog fires (if it is enabled) when a request is not
replied to within some timeout period. The error address register
for watchdogs isn't an address, but decodes to tell us who made the
request to who etc. Unfortunately there's a bug here - the
generic code that grabs the registers only grabs the
address register if the status register indicates the
address is valid (ADDRV bit) which it never is for a watchdog -
so we never grab and include the address in the error report above.
I'll log a bug on that.

6795177 raised for that

Gavin

Ed Kaczmarek

2009-01-19 13:59:44 UTC

Permalink

Post by Gavin Maltby
Changing that with kmdb involves setting deferred breakpoints. We'll cheat
1) Boot into kmdb as before (add -kd to unix line in grub). At the prompt
utter 'cmi_no_init/W1' then ':c' to continue. We'll boot with loading
and cpu module support, and that should get you booted I think (if not
there are bigger problems)

I got big problems then...

Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ unix krtld genunix ]
[0]> cmi_no_init/W1
cmi_no_init: 0 = 0x1
[0]> :c
SunOS Release 5.11 Version snv_106 64-bit
Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
WARNING: Time-of-day chip unresponsive; dead batteries?
Configuring /dev
\

I searched thru BIOS screens for any mention of any watchdog timer
settings.
None.

Post by Gavin Maltby
set cpu_ms\.AuthenticAMD\.15:ao_nb_watchdog_policy=0
3) Reboot normally
That will leave the watchdog as the BIOS had it, and I suspect it's
off by default, while leaving other functionality operational.
I think ata has necessitated this workaround on one or two other motherboards
before now. I don't know the true root cause.
Gavin

Gavin Maltby

2009-01-20 00:51:26 UTC

Permalink

Hi,

Post by Ed Kaczmarek

I got big problems then...
Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ unix krtld genunix ]
[0]> cmi_no_init/W1
cmi_no_init: 0 = 0x1

Good news for me - that switches off just about all my code :-)

Post by Ed Kaczmarek
[0]> :c
SunOS Release 5.11 Version snv_106 64-bit
Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
WARNING: Time-of-day chip unresponsive; dead batteries?
Configuring /dev
\

A few more tests to try

1) Is this a dual-core system? If so could you force use of
just a single cpu as follows and see if we boot ok:

boot kmdb (-kd on unix line in grub)
use_mp/W0
:c

2) We can force the NB watchdog to be disabled when the cpu
module is loaded (we're getting that far because we
saw the machine check details). We have to use
a deferred breakpoint for this:

boot kmdb
::bp cpu_ms.AuthenticAMD.15`ao_ms_init
:c

When the breakpoint hits on the boot cpu you'll return to kmdb. Now

ao_nb_watchdog_policy/W1
:z
:c

That sets policy AO_NB_WDOG_DISABLE which will unconditionally disable the
watchdog. The :z clears breakpoints so we don't hit them on other cpus.

3) If the BIOS is enabling the watchdog Solaris does not touch it by default.
We can force Solaris to apply its chosen watchdog rate (longest possible
timeout) with AO_NB_WDOG_ENABLE_FORCE_RATE:

boot kmdb
::bp cpu_ms.AuthenticAMD.15`ao_ms_init
:c

When the breakpoint hits on the boot cpu you'll return to kmdb. Now

ao_nb_watchdog_policy/W3
:z
:c

4) Now here's a a real stab in the dark. If your BIOS offers an option
to present your SATA disks as AHCI devices (rather than the old
and busted IDE mode) make sure that is set. I guess ata may still
be involved if you have an IDE DVD drive, but we may get further.
Not sure if path to disk devices will change if you do this - machine
may not boot for other reasons!

Thanks

Gavin

Post by Ed Kaczmarek
I searched thru BIOS screens for any mention of any watchdog timer
settings.
None.

Gavin Maltby

2009-01-20 00:57:34 UTC

Permalink

Post by Gavin Maltby
A few more tests to try

One more. If any of these workaround allows you to boot
could you 'svcadm disable hal' and reboot without
any workarounds and see if that does the job - re-enable
hal afterwards.

This is based on unsolved CR 6491248 in which a couple of
machines with snv_50/51 had similar symptoms and for one disabling hal
avoided the issue.

Thanks

Gavin