fm and memory errors

Discussion:

Paul B. Henson

2009-09-16 04:39:42 UTC

I've been seeing a smattering of these in the error logs on an x4500:

Sep 11 15:12:08.7007 ereport.cpu.amd.nb.mem_ce
Sep 12 03:27:19.1011 ereport.cpu.amd.nb.mem_ce
Sep 12 09:34:49.3013 ereport.cpu.amd.nb.mem_ce
Sep 12 15:42:19.4911 ereport.cpu.amd.nb.mem_ce
Sep 12 21:49:59.6950 ereport.cpu.amd.nb.mem_ce
Sep 13 03:57:29.8841 ereport.cpu.amd.nb.mem_ce
Sep 13 10:05:00.0976 ereport.cpu.amd.nb.mem_ce
Sep 13 16:12:40.2817 ereport.cpu.amd.nb.mem_ce
Sep 13 22:20:10.4972 ereport.cpu.amd.nb.mem_ce
Sep 14 10:35:20.8700 ereport.cpu.amd.nb.mem_ce
Sep 14 22:50:21.2817 ereport.cpu.amd.nb.mem_ce
Sep 15 04:58:01.4661 ereport.cpu.amd.nb.mem_ce
Sep 15 11:05:31.6787 ereport.cpu.amd.nb.mem_ce

Nothing's been reported as faulted, but they are occurring on a pretty
regular basis. I was digging around trying to find out what these mean, the
only thing I really found was:

http://opensolaris.org/os/project/generic-mca/docs/portfolio/diagnosis/

I think they're correctable ECC errors? The link above indicates they are
diagnosed by "amd64.esc", but I haven't been able to find any details on
how many correctable errors have to occur before something is considered
faulted.

In another part the above page says "The number of such page_sb faults is
counted for each chip-select, and when any chip-select has more than 64
pages faulted in this way we fault the chip-select with a
fault.memory.generic-x86.dimm_sb", which I think indicates there has to be
more than 64 correctable failures before a fault is generated?

Is that correct? Is there a better source of documentation on this?

Thanks...

--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768

Srihari Venkatesan

2009-09-16 07:10:25 UTC

Permalink

Hi Paul,

The cpu.amd.nb.mem_ce ereport signature denotes that there is AMD Family
0xF - Model Specific support on this x4500.

Post by Paul B. Henson
Nothing's been reported as faulted, but they are occurring on a pretty
regular basis. I was digging around trying to find out what these mean, the
http://opensolaris.org/os/project/generic-mca/docs/portfolio/diagnosis/
I think they're correctable ECC errors? The link above indicates they are
diagnosed by "amd64.esc", but I haven't been able to find any details on
how many correctable errors have to occur before something is considered
faulted.

Yes, mem_ce means single bit correctable ECC memory error or single
symbol(x4) correctable ChipKill ECC memory error (fmdump -eV should till
you the sydrome-type, whether its a ChipKill or 64/8 ECC), I am looking
at amd64.esc -
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/eversholt/files/i386/i86pc/amd64.esc
- i dont know eversholt language guessing here.. from lines 140 to 143
- a page will be faulted, if there are 2 correctable errors within 72
hours (from the same page..?! ), again a threshold of "RANK_PGFLT_MAX"
page faults (page fault == pages will be attempted for page-retire),
please see, line 217, 257 of amd64.esc, so 128 page faults will fault
the rank and a threshold beyond that will fault the dimm. Faulting of a
dimm means a fault record will be logged for that dimm, but any
correctable errors that comes from it, will continue to cause page
faults/page-retire attempts.

What does fmdump -V / fmadm faulty show - no fault records at all ?
what does " kstat -n page_retire " show ?

Post by Paul B. Henson
In another part the above page says "The number of such page_sb faults is
counted for each chip-select, and when any chip-select has more than 64
pages faulted in this way we fault the chip-select with a
fault.memory.generic-x86.dimm_sb", which I think indicates there has to be
more than 64 correctable failures before a fault is generated?

That is correct, but the page-faults per chip-select &
generic-x86.dimm_sb all apply to AMD Family 0x10 CPUs, that has the
Family 0x10-Model Specific support. gcpu_amd.esc file describes the
rules for this. This does not apply to the x4500's case. (While on
Family 0xF topology is derived in terms of ranks & dimms, on Family
0x10, it is derived in terms of channels & chip-selects and a poller
polls the NB's online spare control register for ECCs per
channel/chip-select and the Diagnosis Engine faults the pages and faults
the chip-select according to thresholds..)

-Srihari

Post by Paul B. Henson
Is that correct? Is there a better source of documentation on this?
Thanks...

Paul B. Henson

2009-09-17 01:19:54 UTC

Permalink

Post by Srihari Venkatesan
at amd64.esc -
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/eversholt/files/i386/i86pc/amd64.esc

Ah, thanks for the pointer to this; I don't particularly understand it yet
either ;), but it does seem to be the definitive source for the behavior.

Post by Srihari Venkatesan
What does fmdump -V / fmadm faulty show - no fault records at all ?

Correct, there are currently no faults.

Post by Srihari Venkatesan
what does " kstat -n page_retire " show ?

module: unix instance: 0
name: page_retire class: misc
crtime 132.970994182
pages_deferred 1
pages_deferred_kernel 0
pages_fma 1
pages_limit 4192
pages_limit_exceeded 0
pages_multiple_ce 0
pages_notdequeued 0
pages_notenqueued 0
pages_pending 0
pages_pending_kas 0
pages_retire_request 0
pages_retire_request_free 0
pages_retired 1
pages_ue 0
pages_ue_cleared_freed 0
pages_ue_cleared_retired 0
pages_ue_persistent 0
pages_unretired 0
snaptime 482866.580166261

Looks like it's only retired one page, so not much of a problem yet. Are
these retired pages kept track of across reboots, or are they all placed in
use again until another error occurs?

Thanks much for the info...

--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768

Srihari Venkatesan

2009-09-17 16:15:19 UTC

Permalink

Post by Paul B. Henson

Post by Srihari Venkatesan
What does fmdump -V / fmadm faulty show - no fault records at all ?

Correct, there are currently no faults.

Post by Srihari Venkatesan
what does " kstat -n page_retire " show ?

module: unix instance: 0
name: page_retire class: misc
crtime 132.970994182
pages_deferred 1
pages_deferred_kernel 0
pages_fma 1
pages_limit 4192
pages_limit_exceeded 0
pages_multiple_ce 0
pages_notdequeued 0
pages_notenqueued 0
pages_pending 0
pages_pending_kas 0
pages_retire_request 0
pages_retire_request_free 0
pages_retired 1
pages_ue 0
pages_ue_cleared_freed 0
pages_ue_cleared_retired 0
pages_ue_persistent 0
pages_unretired 0
snaptime 482866.580166261
Looks like it's only retired one page, so not much of a problem yet. Are
these retired pages kept track of across reboots, or are they all placed in
use again until another error occurs?

Retired Pages will become usable after a reboot (once a page is retired,
it is not released to the free list of pages, so remains unusable till
the next reboot)

-Srihari

Steve Hanson

2009-09-17 16:35:58 UTC

Permalink

Post by Srihari Venkatesan

Post by Paul B. Henson

Post by Srihari Venkatesan
What does fmdump -V / fmadm faulty show - no fault records at all ?

Correct, there are currently no faults.

Post by Srihari Venkatesan
what does " kstat -n page_retire " show ?

module: unix instance: 0
name: page_retire class: misc
crtime 132.970994182
pages_deferred 1
pages_deferred_kernel 0
pages_fma 1
pages_limit 4192
pages_limit_exceeded 0
pages_multiple_ce 0
pages_notdequeued 0
pages_notenqueued 0
pages_pending 0
pages_pending_kas 0
pages_retire_request 0
pages_retire_request_free 0
pages_retired 1
pages_ue 0
pages_ue_cleared_freed 0
pages_ue_cleared_retired 0
pages_ue_persistent 0
pages_unretired 0
snaptime 482866.580166261
Looks like it's only retired one page, so not much of a problem yet. Are
these retired pages kept track of across reboots, or are they all placed in
use again until another error occurs?

Retired Pages will become usable after a reboot (once a page is
retired, it is not released to the free list of pages, so remains
unusable till the next reboot)

There is a brief period after a reboot (prior to the fmd daemon
restarting) where the page can be used. However once fmd starts up it
will immediately re-retire the page.

Steve

Post by Srihari Venkatesan
-Srihari
_______________________________________________
fm-discuss mailing list

Paul B. Henson

2009-09-17 20:11:21 UTC

Permalink

Post by Steve Hanson
There is a brief period after a reboot (prior to the fmd daemon
restarting) where the page can be used. However once fmd starts up it
will immediately re-retire the page.

Ok, so it maintains state. The page will continue to be retired until at
some point the entire DIMM is marked faulty (if enough failures occur to do
that), presumably replaced, and when the fault is marked as resolved the
pages will no longer be retired? What happens if the memory is swapped out,
for example a memory upgrade? Will it notice that the DIMM has changed and
reset the fault information? Do DIMMs have serial numbers? Hypothetically
if the DIMM is swapped out for the exact same model (for whatever reason)
will fm know that it is a different DIMM and reset the fault state?

--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768

Rob Johnston

2009-09-17 22:43:41 UTC

Permalink

Post by Paul B. Henson

Post by Steve Hanson
There is a brief period after a reboot (prior to the fmd daemon
restarting) where the page can be used. However once fmd starts up it
will immediately re-retire the page.

Yes, DIMM's do have serial numbers and FMA is able to use that serial number to
detect DIMM replacement and affect an automatic repair of memory faults on most
platforms - with the definition of "most platforms" being Intel and Sparc
platforms and a subset of AMD platforms. I wrote about how some of this works
in the following blog entry:

http://blogs.sun.com/robj/entry/fma_and_dimm_serial_numbers

rob

Gavin Maltby

2009-09-16 07:25:15 UTC

Permalink

Hi,

Post by Paul B. Henson
Sep 11 15:12:08.7007 ereport.cpu.amd.nb.mem_ce
Sep 12 03:27:19.1011 ereport.cpu.amd.nb.mem_ce
Sep 12 09:34:49.3013 ereport.cpu.amd.nb.mem_ce
Sep 12 15:42:19.4911 ereport.cpu.amd.nb.mem_ce
Sep 12 21:49:59.6950 ereport.cpu.amd.nb.mem_ce
Sep 13 03:57:29.8841 ereport.cpu.amd.nb.mem_ce
Sep 13 10:05:00.0976 ereport.cpu.amd.nb.mem_ce
Sep 13 16:12:40.2817 ereport.cpu.amd.nb.mem_ce
Sep 13 22:20:10.4972 ereport.cpu.amd.nb.mem_ce
Sep 14 10:35:20.8700 ereport.cpu.amd.nb.mem_ce
Sep 14 22:50:21.2817 ereport.cpu.amd.nb.mem_ce
Sep 15 04:58:01.4661 ereport.cpu.amd.nb.mem_ce
Sep 15 11:05:31.6787 ereport.cpu.amd.nb.mem_ce
Nothing's been reported as faulted, but they are occurring on a pretty
regular basis. I was digging around trying to find out what these mean, the
http://opensolaris.org/os/project/generic-mca/docs/portfolio/diagnosis/
I think they're correctable ECC errors? The link above indicates they are
diagnosed by "amd64.esc", but I haven't been able to find any details on
how many correctable errors have to occur before something is considered
faulted.

That link documents the generic machine check work (mostly broadening it
to Intel chips) so doesn't describe in detail the "legacy" amd work.

Yes, these are memory ECC errors.

Again that is describing the later generic work, in this case for
AMD family 0x10 (you have family 0xf, the original Opteron family).

You shouldn't have to understand the thresholds applied - the whole
idea is that the software decides for you once there are enough
to be a problem. There are some corner cases where you may want
to look into early replacement - like if your error log was filling
up with many many such ereports (which it isn't I think since the
times above are sporadic not every 10s which some form of stuck-at
fault may produce).

If the above is the sum total of nb.mem_ce logged then you have no
problem, ie nothing to replace. A smallish number of correctable
errors from a DIMM is not uncommon or even unhealthy (Solaris reports
them but many other OS do not; firmware of 3rd party system often will
not report any until a threshold is met and replacement deemed
necessary).

The amd algorithm observes ECC errors for each dimm rank and begins
to retire pages from further use, counting the number retired for
each dimm/rank. The hope is that by attempting not to use the
page (we put it on a kernel page list that we never retrieve from)
the errors will stop. If the number of pages retired for a particular
rank on a dimm exceeds some threshold (minimum 128 pages, scaled by
dimm size but not exceeding 512 I think) then we fault the dimm.
There's a lot of brute-force in that algorithm, but it tends to
get to the correct answer so we've never gotten around to implementing
some of the more-sophisticated syndrome analysis algorithms used on
some sparc platforms. You can do those manually through looking
at fmdump -eV output - each erport has a resource member which
identifies the dimm and rank, and other payload tells you the
syndrome-type (E for 64/8 ECC, C for 128/16 ChipKill) and syndrome.
You can see the counts to date using fmstat -m eft

On an X4500, if and when we do fault a dimm then fmadm faulty
(on OpenSolaris, and one more recent S10 Updates) will include
the DIMM FRU label such as "CPU 1 DIMM 2".

Hope that helps. If you need to know more then mail the output of fmdump -eV
to me (or the alias if not too big).

Cheers

Gavin

Post by Paul B. Henson
Is that correct? Is there a better source of documentation on this?
Thanks...