Paul B. Henson
2009-09-16 04:39:42 UTC
I've been seeing a smattering of these in the error logs on an x4500:
Sep 11 15:12:08.7007 ereport.cpu.amd.nb.mem_ce
Sep 12 03:27:19.1011 ereport.cpu.amd.nb.mem_ce
Sep 12 09:34:49.3013 ereport.cpu.amd.nb.mem_ce
Sep 12 15:42:19.4911 ereport.cpu.amd.nb.mem_ce
Sep 12 21:49:59.6950 ereport.cpu.amd.nb.mem_ce
Sep 13 03:57:29.8841 ereport.cpu.amd.nb.mem_ce
Sep 13 10:05:00.0976 ereport.cpu.amd.nb.mem_ce
Sep 13 16:12:40.2817 ereport.cpu.amd.nb.mem_ce
Sep 13 22:20:10.4972 ereport.cpu.amd.nb.mem_ce
Sep 14 10:35:20.8700 ereport.cpu.amd.nb.mem_ce
Sep 14 22:50:21.2817 ereport.cpu.amd.nb.mem_ce
Sep 15 04:58:01.4661 ereport.cpu.amd.nb.mem_ce
Sep 15 11:05:31.6787 ereport.cpu.amd.nb.mem_ce
Nothing's been reported as faulted, but they are occurring on a pretty
regular basis. I was digging around trying to find out what these mean, the
only thing I really found was:
http://opensolaris.org/os/project/generic-mca/docs/portfolio/diagnosis/
I think they're correctable ECC errors? The link above indicates they are
diagnosed by "amd64.esc", but I haven't been able to find any details on
how many correctable errors have to occur before something is considered
faulted.
In another part the above page says "The number of such page_sb faults is
counted for each chip-select, and when any chip-select has more than 64
pages faulted in this way we fault the chip-select with a
fault.memory.generic-x86.dimm_sb", which I think indicates there has to be
more than 64 correctable failures before a fault is generated?
Is that correct? Is there a better source of documentation on this?
Thanks...
Sep 11 15:12:08.7007 ereport.cpu.amd.nb.mem_ce
Sep 12 03:27:19.1011 ereport.cpu.amd.nb.mem_ce
Sep 12 09:34:49.3013 ereport.cpu.amd.nb.mem_ce
Sep 12 15:42:19.4911 ereport.cpu.amd.nb.mem_ce
Sep 12 21:49:59.6950 ereport.cpu.amd.nb.mem_ce
Sep 13 03:57:29.8841 ereport.cpu.amd.nb.mem_ce
Sep 13 10:05:00.0976 ereport.cpu.amd.nb.mem_ce
Sep 13 16:12:40.2817 ereport.cpu.amd.nb.mem_ce
Sep 13 22:20:10.4972 ereport.cpu.amd.nb.mem_ce
Sep 14 10:35:20.8700 ereport.cpu.amd.nb.mem_ce
Sep 14 22:50:21.2817 ereport.cpu.amd.nb.mem_ce
Sep 15 04:58:01.4661 ereport.cpu.amd.nb.mem_ce
Sep 15 11:05:31.6787 ereport.cpu.amd.nb.mem_ce
Nothing's been reported as faulted, but they are occurring on a pretty
regular basis. I was digging around trying to find out what these mean, the
only thing I really found was:
http://opensolaris.org/os/project/generic-mca/docs/portfolio/diagnosis/
I think they're correctable ECC errors? The link above indicates they are
diagnosed by "amd64.esc", but I haven't been able to find any details on
how many correctable errors have to occur before something is considered
faulted.
In another part the above page says "The number of such page_sb faults is
counted for each chip-select, and when any chip-select has more than 64
pages faulted in this way we fault the chip-select with a
fault.memory.generic-x86.dimm_sb", which I think indicates there has to be
more than 64 correctable failures before a fault is generated?
Is that correct? Is there a better source of documentation on this?
Thanks...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768