Discussion:
errlog growing out of control; PCIe errors on NHM/IOH mobo
Chris Worley
2009-07-08 18:56:59 UTC
Permalink
(Sorry for the misleading "Subject" in the initial post. would like
to know a more appropriate place to post, since fm is just the
messenger here.)

More to add: fmadm faulty may be saying something about a bad PCIe
slot or device (is there an "lspci" in OpenSolaris?):

# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP Major

Fault class : fault.io.pciex.device-interr-corr max 29%
fault.io.pciex.bus-linkerr-corr max 14%
Affects : dev:////***@0,0/pci8086,***@1/pci15d9,***@0
dev:////***@0,0/pci8086,***@1/pci15d9,***@0,1
dev:////***@0,0/pci8086,***@1
faulted but still in service
FRU : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
faulty

Description : Too many recovered bus errors have been detected, which indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with
this fault

Action : If a plug-in card is involved check for badly-seated cards or
bent pins. Otherwise schedule a repair procedure to replace the
affected device. Use fmadm faulty to identify the device or
contact Sun for support.

How bad is this error? I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).

It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot. Is this possible?

Thanks,

Chris
Please tell me if this is the wrong group to post to (including a
better group to post to)...
http://supermicro.com/products/motherboard/QPI/5500/X8DTH-6F.cfm
...in order to get the latest igb driver to recognize the NIC.
The upgrade worked for that, but on boot, the cylon-stare
"OpenSolaris" splash screen doesn't go away w/o hitting "escape", and
I get a message "svc.startd: system/xvm/ipagent: default failed
repeatedly" and  "...failed to abandon contract 66: permission denied"
in the console.
"svcs -xv" returns nothing.
/var/fm/fmd/errlog is growing out of control, and "fmdump -e" is
Jul 08 11:17:04.3593 ereport.io.pciex.dl.btlp
Jul 08 11:17:05.0165 ereport.io.pci.fabric
Jul 08 11:17:04.3595 ereport.io.pciex.dl.rto
Jul 08 11:17:04.3595 ereport.io.pciex.rc.ce-msg
# fmdump  ;fmdump  -eVu 016cf20c-d572-42c1-f217-9eb8d439b73c
TIME                 UUID                                 SUNW-MSG-ID
Jul 07 07:55:42.6832 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
TIME                           CLASS
/var/adm/messages doesn't show any errors.
I had other issues w/ the MGA driver. It worked before the upgrade,
but not after.  deleting the driver defaults to the vesa driver, which
works.  I don't know if that's salient to this issue, but thought I'd
make sure to relay it.
Can anybody tell me what's wrong, how to fix it, or how I should
investigate further?
Thanks,
Chris
Erwin Tsaur
2009-07-08 19:10:24 UTC
Permalink
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post. would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP Major
Fault class : fault.io.pciex.device-interr-corr max 29%
fault.io.pciex.bus-linkerr-corr max 14%
faulted but still in service
FRU : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
faulty
Description : Too many recovered bus errors have been detected, which indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with
this fault
Action : If a plug-in card is involved check for badly-seated cards or
bent pins. Otherwise schedule a repair procedure to replace the
affected device. Use fmadm faulty to identify the device or
contact Sun for support.
How bad is this error? I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue. This is
likely to eventually cause your system to panic or fill up your hard
drive. Assuming you are seeing a lot of btlp and rto errors.. If
anything these errors are performance killer. Not only is the RTO/BTLP
error telling you that many packets require retransmit, the OS also has
to constantly go out and scan and clean up the fabric.
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot. Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue. It really could just be a
badly seated card.
Post by Chris Worley
Thanks,
Chris
Please tell me if this is the wrong group to post to (including a
better group to post to)...
http://supermicro.com/products/motherboard/QPI/5500/X8DTH-6F.cfm
...in order to get the latest igb driver to recognize the NIC.
The upgrade worked for that, but on boot, the cylon-stare
"OpenSolaris" splash screen doesn't go away w/o hitting "escape", and
I get a message "svc.startd: system/xvm/ipagent: default failed
repeatedly" and "...failed to abandon contract 66: permission denied"
in the console.
"svcs -xv" returns nothing.
/var/fm/fmd/errlog is growing out of control, and "fmdump -e" is
Jul 08 11:17:04.3593 ereport.io.pciex.dl.btlp
Jul 08 11:17:05.0165 ereport.io.pci.fabric
Jul 08 11:17:04.3595 ereport.io.pciex.dl.rto
Jul 08 11:17:04.3595 ereport.io.pciex.rc.ce-msg
# fmdump ;fmdump -eVu 016cf20c-d572-42c1-f217-9eb8d439b73c
TIME UUID SUNW-MSG-ID
Jul 07 07:55:42.6832 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
TIME CLASS
/var/adm/messages doesn't show any errors.
I had other issues w/ the MGA driver. It worked before the upgrade,
but not after. deleting the driver defaults to the vesa driver, which
works. I don't know if that's salient to this issue, but thought I'd
make sure to relay it.
Can anybody tell me what's wrong, how to fix it, or how I should
investigate further?
Thanks,
Chris
_______________________________________________
fm-discuss mailing list
Chris Worley
2009-07-08 19:56:22 UTC
Permalink
(Sorry for the misleading "Subject" in the initial post.  would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c  PCIEX-8000-KP  Major
Fault class : fault.io.pciex.device-interr-corr max 29%
             fault.io.pciex.bus-linkerr-corr max 14%
                 faulted but still in service
FRU         : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
                 faulty
Description : Too many recovered bus errors have been detected, which indicates
             a problem with the specified bus or with the specified
             transmitting device. This may degrade into an unrecoverable
             fault.
             Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device instances associated
with
             this fault
Action      : If a plug-in card is involved check for badly-seated cards
or
             bent pins. Otherwise schedule a repair procedure to replace
the
             affected device.  Use fmadm faulty to identify the device or
             contact Sun for support.
How bad is this error?  I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.  This is
likely to eventually cause your system to panic or fill up your hard drive.
 Assuming you are seeing a lot of btlp and rto errors..  If anything these
errors are performance killer.  Not only is the RTO/BTLP error telling you
that many packets require retransmit, the OS also has to constantly go out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.

The errors in OpenSolaris occur if no cards are installed in the bus.

The other OSes don't report any errors w/ or w/o cards in the bus.
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot.  Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue.  It really could just be a badly
seated card.
How do I disable the errors?

Thanks,

Chris
Thanks,
Chris
Please tell me if this is the wrong group to post to (including a
better group to post to)...
http://supermicro.com/products/motherboard/QPI/5500/X8DTH-6F.cfm
...in order to get the latest igb driver to recognize the NIC.
The upgrade worked for that, but on boot, the cylon-stare
"OpenSolaris" splash screen doesn't go away w/o hitting "escape", and
I get a message "svc.startd: system/xvm/ipagent: default failed
repeatedly" and  "...failed to abandon contract 66: permission denied"
in the console.
"svcs -xv" returns nothing.
/var/fm/fmd/errlog is growing out of control, and "fmdump -e" is
Jul 08 11:17:04.3593 ereport.io.pciex.dl.btlp
Jul 08 11:17:05.0165 ereport.io.pci.fabric
Jul 08 11:17:04.3595 ereport.io.pciex.dl.rto
Jul 08 11:17:04.3595 ereport.io.pciex.rc.ce-msg
# fmdump  ;fmdump  -eVu 016cf20c-d572-42c1-f217-9eb8d439b73c
TIME                 UUID                                 SUNW-MSG-ID
Jul 07 07:55:42.6832 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
TIME                           CLASS
/var/adm/messages doesn't show any errors.
I had other issues w/ the MGA driver. It worked before the upgrade,
but not after.  deleting the driver defaults to the vesa driver, which
works.  I don't know if that's salient to this issue, but thought I'd
make sure to relay it.
Can anybody tell me what's wrong, how to fix it, or how I should
investigate further?
Thanks,
Chris
_______________________________________________
fm-discuss mailing list
Erwin Tsaur
2009-07-08 20:42:55 UTC
Permalink
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post. would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP Major
Fault class : fault.io.pciex.device-interr-corr max 29%
fault.io.pciex.bus-linkerr-corr max 14%
faulted but still in service
FRU : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
faulty
Description : Too many recovered bus errors have been detected, which indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with
this fault
Action : If a plug-in card is involved check for badly-seated cards or
bent pins. Otherwise schedule a repair procedure to replace the
affected device. Use fmadm faulty to identify the device or
contact Sun for support.
How bad is this error? I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue. This is
likely to eventually cause your system to panic or fill up your hard drive.
Assuming you are seeing a lot of btlp and rto errors.. If anything these
errors are performance killer. Not only is the RTO/BTLP error telling you
that many packets require retransmit, the OS also has to constantly go out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error
is literally complaining about a packets received between 2 devices.
Are you sure it's you are correctly identifying the right slot?

I believe only OpenSolaris even detects these errors, which is why the
other OSes don't report any errors. It doesn't mean that errors aren't
occurring though.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot. Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue. It really could just be a badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide
the "fmdump -eV" log. If it is huge, just tail the last 500-1000 lines
should be enough.
Post by Chris Worley
Thanks,
Chris
Post by Erwin Tsaur
Post by Chris Worley
Thanks,
Chris
Please tell me if this is the wrong group to post to (including a
better group to post to)...
http://supermicro.com/products/motherboard/QPI/5500/X8DTH-6F.cfm
...in order to get the latest igb driver to recognize the NIC.
The upgrade worked for that, but on boot, the cylon-stare
"OpenSolaris" splash screen doesn't go away w/o hitting "escape", and
I get a message "svc.startd: system/xvm/ipagent: default failed
repeatedly" and "...failed to abandon contract 66: permission denied"
in the console.
"svcs -xv" returns nothing.
/var/fm/fmd/errlog is growing out of control, and "fmdump -e" is
Jul 08 11:17:04.3593 ereport.io.pciex.dl.btlp
Jul 08 11:17:05.0165 ereport.io.pci.fabric
Jul 08 11:17:04.3595 ereport.io.pciex.dl.rto
Jul 08 11:17:04.3595 ereport.io.pciex.rc.ce-msg
# fmdump ;fmdump -eVu 016cf20c-d572-42c1-f217-9eb8d439b73c
TIME UUID SUNW-MSG-ID
Jul 07 07:55:42.6832 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
TIME CLASS
/var/adm/messages doesn't show any errors.
I had other issues w/ the MGA driver. It worked before the upgrade,
but not after. deleting the driver defaults to the vesa driver, which
works. I don't know if that's salient to this issue, but thought I'd
make sure to relay it.
Can anybody tell me what's wrong, how to fix it, or how I should
investigate further?
Thanks,
Chris
_______________________________________________
fm-discuss mailing list
Chris Worley
2009-07-08 20:52:19 UTC
Permalink
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post.  would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c  PCIEX-8000-KP  Major
Fault class : fault.io.pciex.device-interr-corr max 29%
            fault.io.pciex.bus-linkerr-corr max 14%
                faulted but still in service
FRU         : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
                faulty
Description : Too many recovered bus errors have been detected, which indicates
            a problem with the specified bus or with the specified
            transmitting device. This may degrade into an unrecoverable
            fault.
            Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device instances associated
with
            this fault
Action      : If a plug-in card is involved check for badly-seated cards
or
            bent pins. Otherwise schedule a repair procedure to replace
the
            affected device.  Use fmadm faulty to identify the device or
            contact Sun for support.
How bad is this error?  I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.  This is
likely to eventually cause your system to panic or fill up your hard drive.
 Assuming you are seeing a lot of btlp and rto errors..  If anything these
errors are performance killer.  Not only is the RTO/BTLP error telling you
that many packets require retransmit, the OS also has to constantly go out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error is
literally complaining about a packets received between 2 devices.  Are you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the other
OSes don't report any errors.  It doesn't mean that errors aren't occurring
though.
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot.  Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue.  It really could just be a badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000 lines should
be enough.
Would that produce the same as the incantation shown earlier?:

fmdump ;fmdump -eVu 016cf20c-d572-42c1-f217-9eb8d439b73c
TIME UUID SUNW-MSG-ID
Jul 07 07:55:42.6832 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
TIME CLASS

(... as the system is currently running benchmarks on another OS.)

Thanks,

Chris
Post by Chris Worley
Thanks,
Chris
Thanks,
Chris
Please tell me if this is the wrong group to post to (including a
better group to post to)...
http://supermicro.com/products/motherboard/QPI/5500/X8DTH-6F.cfm
...in order to get the latest igb driver to recognize the NIC.
The upgrade worked for that, but on boot, the cylon-stare
"OpenSolaris" splash screen doesn't go away w/o hitting "escape", and
I get a message "svc.startd: system/xvm/ipagent: default failed
repeatedly" and  "...failed to abandon contract 66: permission denied"
in the console.
"svcs -xv" returns nothing.
/var/fm/fmd/errlog is growing out of control, and "fmdump -e" is
Jul 08 11:17:04.3593 ereport.io.pciex.dl.btlp
Jul 08 11:17:05.0165 ereport.io.pci.fabric
Jul 08 11:17:04.3595 ereport.io.pciex.dl.rto
Jul 08 11:17:04.3595 ereport.io.pciex.rc.ce-msg
# fmdump  ;fmdump  -eVu 016cf20c-d572-42c1-f217-9eb8d439b73c
TIME                 UUID                                 SUNW-MSG-ID
Jul 07 07:55:42.6832 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
TIME                           CLASS
/var/adm/messages doesn't show any errors.
I had other issues w/ the MGA driver. It worked before the upgrade,
but not after.  deleting the driver defaults to the vesa driver, which
works.  I don't know if that's salient to this issue, but thought I'd
make sure to relay it.
Can anybody tell me what's wrong, how to fix it, or how I should
investigate further?
Thanks,
Chris
_______________________________________________
fm-discuss mailing list
Erwin Tsaur
2009-07-08 20:57:13 UTC
Permalink
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post. would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
Major
Fault class : fault.io.pciex.device-interr-corr max 29%
fault.io.pciex.bus-linkerr-corr max 14%
faulted but still in service
FRU : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
faulty
Description : Too many recovered bus errors have been detected, which indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances
associated
with
this fault
Action : If a plug-in card is involved check for badly-seated cards or
bent pins. Otherwise schedule a repair procedure to replace the
affected device. Use fmadm faulty to identify the device or
contact Sun for support.
How bad is this error? I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue. This is
likely to eventually cause your system to panic or fill up your hard drive.
Assuming you are seeing a lot of btlp and rto errors.. If anything these
errors are performance killer. Not only is the RTO/BTLP error telling you
that many packets require retransmit, the OS also has to constantly go out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error is
literally complaining about a packets received between 2 devices. Are you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the other
OSes don't report any errors. It doesn't mean that errors aren't occurring
though.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot. Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue. It really could just be a badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide the
"fmdump -eV" log. If it is huge, just tail the last 500-1000 lines should
be enough.
I think all the CE's would produce the same message. I also need to
know the exact device.
fmdump ;fmdump -eVu 016cf20c-d572-42c1-f217-9eb8d439b73c
TIME UUID SUNW-MSG-ID
Jul 07 07:55:42.6832 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
TIME CLASS
(... as the system is currently running benchmarks on another OS.)
Thanks,
Chris
Post by Chris Worley
Thanks,
Chris
Post by Erwin Tsaur
Post by Chris Worley
Thanks,
Chris
Please tell me if this is the wrong group to post to (including a
better group to post to)...
http://supermicro.com/products/motherboard/QPI/5500/X8DTH-6F.cfm
...in order to get the latest igb driver to recognize the NIC.
The upgrade worked for that, but on boot, the cylon-stare
"OpenSolaris" splash screen doesn't go away w/o hitting "escape", and
I get a message "svc.startd: system/xvm/ipagent: default failed
repeatedly" and "...failed to abandon contract 66: permission denied"
in the console.
"svcs -xv" returns nothing.
/var/fm/fmd/errlog is growing out of control, and "fmdump -e" is
Jul 08 11:17:04.3593 ereport.io.pciex.dl.btlp
Jul 08 11:17:05.0165 ereport.io.pci.fabric
Jul 08 11:17:04.3595 ereport.io.pciex.dl.rto
Jul 08 11:17:04.3595 ereport.io.pciex.rc.ce-msg
# fmdump ;fmdump -eVu 016cf20c-d572-42c1-f217-9eb8d439b73c
TIME UUID SUNW-MSG-ID
Jul 07 07:55:42.6832 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
TIME CLASS
/var/adm/messages doesn't show any errors.
I had other issues w/ the MGA driver. It worked before the upgrade,
but not after. deleting the driver defaults to the vesa driver, which
works. I don't know if that's salient to this issue, but thought I'd
make sure to relay it.
Can anybody tell me what's wrong, how to fix it, or how I should
investigate further?
Thanks,
Chris
_______________________________________________
fm-discuss mailing list
Chris Worley
2009-07-08 21:20:08 UTC
Permalink
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post.  would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c  PCIEX-8000-KP  Major
Fault class : fault.io.pciex.device-interr-corr max 29%
           fault.io.pciex.bus-linkerr-corr max 14%
               faulted but still in service
FRU         : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
               faulty
Description : Too many recovered bus errors have been detected, which indicates
           a problem with the specified bus or with the specified
           transmitting device. This may degrade into an unrecoverable
           fault.
           Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device instances associated
with
           this fault
Action      : If a plug-in card is involved check for badly-seated cards
or
           bent pins. Otherwise schedule a repair procedure to replace
the
           affected device.  Use fmadm faulty to identify the device
or
           contact Sun for support.
How bad is this error?  I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.  This is
likely to eventually cause your system to panic or fill up your hard drive.
 Assuming you are seeing a lot of btlp and rto errors..  If anything these
errors are performance killer.  Not only is the RTO/BTLP error telling you
that many packets require retransmit, the OS also has to constantly go out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error is
literally complaining about a packets received between 2 devices.  Are you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the other
OSes don't report any errors.  It doesn't mean that errors aren't occurring
though.
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot.  Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue.  It really could just be a badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000 lines should
be enough.
I think all the CE's would produce the same message.  I also need to know
the exact device.
Last 1000 lines (of ~20 million) attached.

There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded. Everything
else is built-in to the motherboard.

Thanks,

Chris
<snip>
Chris Worley
2009-07-08 21:36:43 UTC
Permalink
Post by Chris Worley
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post.  would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c  PCIEX-8000-KP  Major
Fault class : fault.io.pciex.device-interr-corr max 29%
           fault.io.pciex.bus-linkerr-corr max 14%
               faulted but still in service
FRU         : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
               faulty
Description : Too many recovered bus errors have been detected, which indicates
           a problem with the specified bus or with the specified
           transmitting device. This may degrade into an unrecoverable
           fault.
           Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device instances associated
with
           this fault
Action      : If a plug-in card is involved check for badly-seated cards
or
           bent pins. Otherwise schedule a repair procedure to replace
the
           affected device.  Use fmadm faulty to identify the device
or
           contact Sun for support.
How bad is this error?  I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.  This is
likely to eventually cause your system to panic or fill up your hard drive.
 Assuming you are seeing a lot of btlp and rto errors..  If anything these
errors are performance killer.  Not only is the RTO/BTLP error telling you
that many packets require retransmit, the OS also has to constantly go out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error is
literally complaining about a packets received between 2 devices.  Are you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the other
OSes don't report any errors.  It doesn't mean that errors aren't occurring
though.
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot.  Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue.  It really could just be a badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000 lines should
be enough.
I think all the CE's would produce the same message.  I also need to know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded.  Everything
else is built-in to the motherboard.
I rebooted w/o any adapters in any slots; the last 1000 lines are attached.

Thanks

Chris
Post by Chris Worley
Thanks,
Chris
<snip>
Erwin Tsaur
2009-07-08 21:42:24 UTC
Permalink
ok I severely underestimated how much 1000 lines are, but I got the info
I needed, mostly. Now I'm wondering if there is an errata in the Root
Port causing this.

/***@0,0/pci8086,***@1/pci15d9,10c9, I know it's a nic device. Both
notes are complaining of CE errors.

The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.

pcie_ce_mask=0x1040;

It requires reboot.

If you need to do it on a live system let me know, the instructions are
a bit more complicated.
Post by Chris Worley
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post. would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
Major
Fault class : fault.io.pciex.device-interr-corr max 29%
fault.io.pciex.bus-linkerr-corr max 14%
faulted but still in service
FRU : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
faulty
Description : Too many recovered bus errors have been detected, which indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances
associated
with
this fault
Action : If a plug-in card is involved check for badly-seated
cards
or
bent pins. Otherwise schedule a repair procedure to replace the
affected device. Use fmadm faulty to identify the device or
contact Sun for support.
How bad is this error? I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue. This is
likely to eventually cause your system to panic or fill up your hard drive.
Assuming you are seeing a lot of btlp and rto errors.. If anything these
errors are performance killer. Not only is the RTO/BTLP error telling you
that many packets require retransmit, the OS also has to constantly go out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error is
literally complaining about a packets received between 2 devices. Are you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the other
OSes don't report any errors. It doesn't mean that errors aren't occurring
though.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot. Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue. It really could just be a badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide the
"fmdump -eV" log. If it is huge, just tail the last 500-1000 lines should
be enough.
I think all the CE's would produce the same message. I also need to know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded. Everything
else is built-in to the motherboard.
Thanks,
Chris
<snip>
Chris Worley
2009-07-08 22:03:56 UTC
Permalink
ok I severely underestimated how much 1000 lines are, but I got the info I
needed, mostly.  Now I'm wondering if there is an errata in the Root Port
causing this.
are complaining of CE errors.
The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.
This is the igb driver (SUNWigb package). It doesn't have a conf file.

It's for the Intel 82576 Dual-Port GigE on-board NIC. This driver
didn't work prior to the update.

So, I made a .conf file thusly and rebooted:

/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;

... no difference. Still lots of errors reported (attached last 1000 lines).

Chris
pcie_ce_mask=0x1040;
It requires reboot.
If you need to do it on a live system let me know, the instructions are a
bit more complicated.
Post by Chris Worley
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post.  would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c  PCIEX-8000-KP  Major
Fault class : fault.io.pciex.device-interr-corr max 29%
          fault.io.pciex.bus-linkerr-corr max 14%
              faulted but still in service
FRU         : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
              faulty
Description : Too many recovered bus errors have been detected,
which
indicates
          a problem with the specified bus or with the specified
          transmitting device. This may degrade into an
unrecoverable
          fault.
          Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device instances associated
with
          this fault
Action      : If a plug-in card is involved check for badly-seated cards
or
          bent pins. Otherwise schedule a repair procedure to
replace
the
          affected device.  Use fmadm faulty to identify the device
or
          contact Sun for support.
How bad is this error?  I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.  This is
likely to eventually cause your system to panic or fill up your hard drive.
 Assuming you are seeing a lot of btlp and rto errors..  If anything these
errors are performance killer.  Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to constantly
go
out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error is
literally complaining about a packets received between 2 devices.  Are you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the other
OSes don't report any errors.  It doesn't mean that errors aren't occurring
though.
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot.  Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue.  It really could just be a badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000 lines should
be enough.
I think all the CE's would produce the same message.  I also need to know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded.  Everything
else is built-in to the motherboard.
Thanks,
Chris
<snip>
Erwin Tsaur
2009-07-08 22:29:15 UTC
Permalink
Post by Chris Worley
ok I severely underestimated how much 1000 lines are, but I got the info I
needed, mostly. Now I'm wondering if there is an errata in the Root Port
causing this.
are complaining of CE errors.
The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.
This is the igb driver (SUNWigb package). It doesn't have a conf file.
It's for the Intel 82576 Dual-Port GigE on-board NIC. This driver
didn't work prior to the update.
/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;
... no difference. Still lots of errors reported (attached last 1000 lines).
It's not picking up the conf property...
According to the pkgdef, I think the correct place is /kernel/drv/igb.conf

It should be in the same place as the igb driver.
Post by Chris Worley
Chris
pcie_ce_mask=0x1040;
It requires reboot.
If you need to do it on a live system let me know, the instructions are a
bit more complicated.
Post by Chris Worley
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post. would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------ --------------
---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------ --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
Major
Fault class : fault.io.pciex.device-interr-corr max 29%
fault.io.pciex.bus-linkerr-corr max 14%
faulted but still in service
FRU : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
faulty
Description : Too many recovered bus errors have been detected,
which
indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances
associated
with
this fault
Action : If a plug-in card is involved check for badly-seated
cards
or
bent pins. Otherwise schedule a repair procedure to
replace
the
affected device. Use fmadm faulty to identify the device or
contact Sun for support.
How bad is this error? I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue. This is
likely to eventually cause your system to panic or fill up your hard drive.
Assuming you are seeing a lot of btlp and rto errors.. If anything these
errors are performance killer. Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to constantly
go
out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error is
literally complaining about a packets received between 2 devices. Are you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the other
OSes don't report any errors. It doesn't mean that errors aren't occurring
though.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot. Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue. It really could just be a badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide the
"fmdump -eV" log. If it is huge, just tail the last 500-1000 lines should
be enough.
I think all the CE's would produce the same message. I also need to know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded. Everything
else is built-in to the motherboard.
Thanks,
Chris
<snip>
Chris Worley
2009-07-08 22:49:12 UTC
Permalink
Post by Erwin Tsaur
ok I severely underestimated how much 1000 lines are, but I got the info I
needed, mostly.  Now I'm wondering if there is an errata in the Root Port
causing this.
are complaining of CE errors.
The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.
This is the igb driver (SUNWigb package).  It doesn't have a conf file.
It's for the Intel 82576 Dual-Port GigE on-board NIC.  This driver
didn't work prior to the update.
/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;
... no difference.  Still lots of errors reported (attached last 1000
lines).
It's not picking up the conf property...
According to the pkgdef, I think the correct place is /kernel/drv/igb.conf
It should be in the same place as the igb driver.
Okay, it was there, and I changed it and rebooted:

***@opensolaris:~# tail /kernel/drv/igb.conf
# For example, if you see,
# "/***@0,0/pci10de,***@d/pci8086,***@0" 0 "igb"
# "/***@0,0/pci10de,***@d/pci8086,***@0,1" 1 "igb"
#
# name = "pciex8086,10a7" parent = "/***@0,0/pci10de,***@d" unit-address = "0"
# flow_control = 1;
# name = "pciex8086,10a7" parent = "/***@0,0/pci10de,***@d" unit-address = "0,1"
# flow_control = 3;
pcie_ce_mask=0x1040;

Still, no joy... the last 1K lines attached.

Thanks,

Chris
Post by Erwin Tsaur
Chris
pcie_ce_mask=0x1040;
It requires reboot.
If you need to do it on a live system let me know, the instructions are a
bit more complicated.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
(Sorry for the misleading "Subject" in the initial post.  would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------
 --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------
 --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c
 PCIEX-8000-KP
 Major
Fault class : fault.io.pciex.device-interr-corr max 29%
         fault.io.pciex.bus-linkerr-corr max 14%
             faulted but still in service
FRU         : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
             faulty
Description : Too many recovered bus errors have been detected,
which
indicates
         a problem with the specified bus or with the specified
         transmitting device. This may degrade into an
unrecoverable
         fault.
         Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device instances
associated
with
         this fault
Action      : If a plug-in card is involved check for badly-seated
cards
or
         bent pins. Otherwise schedule a repair procedure to
replace
the
         affected device.  Use fmadm faulty to identify the device
or
         contact Sun for support.
How bad is this error?  I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really
detaining
a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.
 This
is
likely to eventually cause your system to panic or fill up your hard
drive.
 Assuming you are seeing a lot of btlp and rto errors..  If
anything
these
errors are performance killer.  Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to constantly
go
out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error
is
literally complaining about a packets received between 2 devices.
 Are
you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the
other
OSes don't report any errors.  It doesn't mean that errors aren't occurring
though.
Post by Chris Worley
Post by Erwin Tsaur
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot.  Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue.  It really could just be
a
badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide
the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000 lines should
be enough.
I think all the CE's would produce the same message.  I also need to know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded.  Everything
else is built-in to the motherboard.
Thanks,
Chris
Post by Erwin Tsaur
<snip>
Erwin Tsaur
2009-07-08 23:02:45 UTC
Permalink
I didn't realize that the Root Port was seeing the same thing.. :(

Add the same line to pcie_pci.conf

Good news is that the leaf device is no longer spamming with CEs.

You can also limit which RP's CE's get turned off. see "driver.conf"
man page. This will mask 0x1040 on all the RPs.

I have to warn again, though they are technically CEs and no damage was
done, there are probably performance impacts. Masking the CE's won't
correct it but will save your harddrive and also significantly improve
performance since the OS won't be interrupted hundreds of times a
second. Unfortunately don't know of any fix, unless there is a vendor
specific method to fix the underlying HW issue.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
ok I severely underestimated how much 1000 lines are, but I got the info I
needed, mostly. Now I'm wondering if there is an errata in the Root Port
causing this.
are complaining of CE errors.
The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.
This is the igb driver (SUNWigb package). It doesn't have a conf file.
It's for the Intel 82576 Dual-Port GigE on-board NIC. This driver
didn't work prior to the update.
/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;
... no difference. Still lots of errors reported (attached last 1000 lines).
It's not picking up the conf property...
According to the pkgdef, I think the correct place is /kernel/drv/igb.conf
It should be in the same place as the igb driver.
# For example, if you see,
#
# flow_control = 1;
# flow_control = 3;
pcie_ce_mask=0x1040;
Still, no joy... the last 1K lines attached.
Thanks,
Chris
Post by Erwin Tsaur
Post by Chris Worley
Chris
pcie_ce_mask=0x1040;
It requires reboot.
If you need to do it on a live system let me know, the instructions are a
bit more complicated.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post. would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------
--------------
---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------
--------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c
PCIEX-8000-KP
Major
Fault class : fault.io.pciex.device-interr-corr max 29%
fault.io.pciex.bus-linkerr-corr max 14%
faulted but still in service
FRU : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
faulty
Description : Too many recovered bus errors have been detected,
which
indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances
associated
with
this fault
Action : If a plug-in card is involved check for badly-seated
cards
or
bent pins. Otherwise schedule a repair procedure to
replace
the
affected device. Use fmadm faulty to identify the device or
contact Sun for support.
How bad is this error? I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really
detaining
a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.
This
is
likely to eventually cause your system to panic or fill up your hard
drive.
Assuming you are seeing a lot of btlp and rto errors.. If
anything
these
errors are performance killer. Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to constantly
go
out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error
is
literally complaining about a packets received between 2 devices.
Are
you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the
other
OSes don't report any errors. It doesn't mean that errors aren't occurring
though.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot. Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue. It really could just be a
badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide
the
"fmdump -eV" log. If it is huge, just tail the last 500-1000 lines should
be enough.
I think all the CE's would produce the same message. I also need to know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded. Everything
else is built-in to the motherboard.
Thanks,
Chris
Post by Erwin Tsaur
<snip>
Chris Worley
2009-07-08 23:19:13 UTC
Permalink
Post by Erwin Tsaur
I didn't realize that the Root Port was seeing the same thing.. :(
Add the same line to pcie_pci.conf
I did, and rebooted:

# tail /kernel/drv/pcie_pci.conf
# Force load driver to support hotplug activity
#
ddi-forceattach=1;

#
# force interrupt priorities to be one
# otherwise this driver binds as bridge device with priority 12
#
interrupt-priorities=1;
pcie_ce_mask=0x1040;

... still errors being logged (see attached).
Post by Erwin Tsaur
Good news is that the leaf device is no longer spamming with CEs.
Yes, I've made a lot of progress so far on this, thanks!
Post by Erwin Tsaur
You can also limit which RP's CE's get turned off.  see "driver.conf" man
page.  This will mask 0x1040 on all the RPs.
I have to warn again, though they are technically CEs and no damage was
done, there are probably performance impacts.  Masking the CE's won't
correct it but will save your harddrive and also significantly improve
performance since the OS won't be interrupted hundreds of times a second.
 Unfortunately don't know of any fix, unless there is a vendor specific
method to fix the underlying HW issue.
Intel is usually pretty good about fixing issues like this (unless
it's caused by Supermicro's layout).

I'm not to worried about the NIC's performance, as long as it works...
I do need to measure system performance in other respects, so
decreasing the shower of interrupts (CPU overhead) is important.

Chris
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
Post by Erwin Tsaur
ok I severely underestimated how much 1000 lines are, but I got the info
I
needed, mostly.  Now I'm wondering if there is an errata in the Root Port
causing this.
are complaining of CE errors.
The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.
This is the igb driver (SUNWigb package).  It doesn't have a conf file.
It's for the Intel 82576 Dual-Port GigE on-board NIC.  This driver
didn't work prior to the update.
/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;
... no difference.  Still lots of errors reported (attached last 1000
lines).
It's not picking up the conf property...
According to the pkgdef, I think the correct place is
/kernel/drv/igb.conf
It should be in the same place as the igb driver.
# For example, if you see,
#
# flow_control = 1;
# flow_control = 3;
pcie_ce_mask=0x1040;
Still, no joy... the last 1K lines attached.
Thanks,
Chris
Post by Erwin Tsaur
Chris
Post by Erwin Tsaur
pcie_ce_mask=0x1040;
It requires reboot.
If you need to do it on a live system let me know, the instructions are a
bit more complicated.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
(Sorry for the misleading "Subject" in the initial post.  would like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c
 PCIEX-8000-KP
 Major
Fault class : fault.io.pciex.device-interr-corr max 29%
        fault.io.pciex.bus-linkerr-corr max 14%
            faulted but still in service
FRU         : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
            faulty
Description : Too many recovered bus errors have been detected,
which
indicates
        a problem with the specified bus or with the specified
        transmitting device. This may degrade into an
unrecoverable
        fault.
        Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device instances
associated
with
        this fault
Action      : If a plug-in card is involved check for badly-seated
cards
or
        bent pins. Otherwise schedule a repair procedure to
replace
the
        affected device.  Use fmadm faulty to identify the
device
or
        contact Sun for support.
How bad is this error?  I need to put some adapters in, but it
sounds
like the OS doesn't handle the NHM's IOH (or is it really
detaining
a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.
 This
is
likely to eventually cause your system to panic or fill up your hard
drive.
 Assuming you are seeing a lot of btlp and rto errors..  If
anything
these
errors are performance killer.  Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to constantly
go
out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error
is
literally complaining about a packets received between 2 devices.
 Are
you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the
other
OSes don't report any errors.  It doesn't mean that errors aren't
occurring
though.
Post by Chris Worley
Post by Erwin Tsaur
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot.  Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue.  It really could just be
a
badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide
the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000 lines should
be enough.
I think all the CE's would produce the same message.  I also need to know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded.  Everything
else is built-in to the motherboard.
Thanks,
Chris
Post by Erwin Tsaur
<snip>
Erwin Tsaur
2009-07-08 23:40:01 UTC
Permalink
Post by Chris Worley
Post by Erwin Tsaur
I didn't realize that the Root Port was seeing the same thing.. :(
Add the same line to pcie_pci.conf
# tail /kernel/drv/pcie_pci.conf
# Force load driver to support hotplug activity
#
ddi-forceattach=1;
#
# force interrupt priorities to be one
# otherwise this driver binds as bridge device with priority 12
#
interrupt-priorities=1;
pcie_ce_mask=0x1040;
... still errors being logged (see attached).
geesh.. yet another type of CE. Change the mask from 0x1040 to 0x1041.
If you get tired of this, change the mask to -1. :)

With this you shouldn't see any more ereports from these devices, unless
there was a UE from that link.
Post by Chris Worley
Post by Erwin Tsaur
Good news is that the leaf device is no longer spamming with CEs.
Yes, I've made a lot of progress so far on this, thanks!
Post by Erwin Tsaur
You can also limit which RP's CE's get turned off. see "driver.conf" man
page. This will mask 0x1040 on all the RPs.
I have to warn again, though they are technically CEs and no damage was
done, there are probably performance impacts. Masking the CE's won't
correct it but will save your harddrive and also significantly improve
performance since the OS won't be interrupted hundreds of times a second.
Unfortunately don't know of any fix, unless there is a vendor specific
method to fix the underlying HW issue.
Intel is usually pretty good about fixing issues like this (unless
it's caused by Supermicro's layout).
Intel is pretty good about this. These are low level physical layer
errors, so it could very well be a layout issue.
Post by Chris Worley
I'm not to worried about the NIC's performance, as long as it works...
I do need to measure system performance in other respects, so
decreasing the shower of interrupts (CPU overhead) is important.
Chris
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
ok I severely underestimated how much 1000 lines are, but I got the info
I
needed, mostly. Now I'm wondering if there is an errata in the Root Port
causing this.
are complaining of CE errors.
The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.
This is the igb driver (SUNWigb package). It doesn't have a conf file.
It's for the Intel 82576 Dual-Port GigE on-board NIC. This driver
didn't work prior to the update.
/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;
... no difference. Still lots of errors reported (attached last 1000 lines).
It's not picking up the conf property...
According to the pkgdef, I think the correct place is
/kernel/drv/igb.conf
It should be in the same place as the igb driver.
# For example, if you see,
#
# flow_control = 1;
# flow_control = 3;
pcie_ce_mask=0x1040;
Still, no joy... the last 1K lines attached.
Thanks,
Chris
Post by Erwin Tsaur
Post by Chris Worley
Chris
Post by Erwin Tsaur
pcie_ce_mask=0x1040;
It requires reboot.
If you need to do it on a live system let me know, the instructions are a
bit more complicated.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post. would
like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad PCIe
# fmadm faulty
--------------- ------------------------------------
--------------
---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------
--------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c
PCIEX-8000-KP
Major
Fault class : fault.io.pciex.device-interr-corr max 29%
fault.io.pciex.bus-linkerr-corr max 14%
faulted but still in service
FRU : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
faulty
Description : Too many recovered bus errors have been detected,
which
indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances
associated
with
this fault
Action : If a plug-in card is involved check for badly-seated
cards
or
bent pins. Otherwise schedule a repair procedure to
replace
the
affected device. Use fmadm faulty to identify the
device
or
contact Sun for support.
How bad is this error? I need to put some adapters in, but it
sounds
like the OS doesn't handle the NHM's IOH (or is it really
detaining
a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.
This
is
likely to eventually cause your system to panic or fill up your hard
drive.
Assuming you are seeing a lot of btlp and rto errors.. If
anything
these
errors are performance killer. Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to constantly
go
out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error
is
literally complaining about a packets received between 2 devices.
Are
you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is why the
other
OSes don't report any errors. It doesn't mean that errors aren't
occurring
though.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot. Is this possible?
no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue. It really could just
be
a
badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide
the
"fmdump -eV" log. If it is huge, just tail the last 500-1000 lines
should
be enough.
I think all the CE's would produce the same message. I also need to know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded. Everything
else is built-in to the motherboard.
Thanks,
Chris
Post by Erwin Tsaur
<snip>
Chris Worley
2009-07-08 23:45:31 UTC
Permalink
Post by Chris Worley
Post by Erwin Tsaur
I didn't realize that the Root Port was seeing the same thing.. :(
Add the same line to pcie_pci.conf
# tail /kernel/drv/pcie_pci.conf
# Force load driver to support hotplug activity
#
ddi-forceattach=1;
#
# force interrupt priorities to be one
# otherwise this driver binds as bridge device with priority 12
#
interrupt-priorities=1;
pcie_ce_mask=0x1040;
... still errors being logged (see attached).
geesh.. yet another type of CE.  Change the mask from 0x1040 to 0x1041.  If
you get tired of this, change the mask to -1. :)
With this you shouldn't see any more ereports from these devices, unless
there was a UE from that link.
Just so I might learn something, how are you correlating the errors to
the bitmask?

Thanks,

Chris
Post by Chris Worley
Post by Erwin Tsaur
Good news is that the leaf device is no longer spamming with CEs.
Yes, I've made a lot of progress so far on this, thanks!
Post by Erwin Tsaur
You can also limit which RP's CE's get turned off.  see "driver.conf" man
page.  This will mask 0x1040 on all the RPs.
I have to warn again, though they are technically CEs and no damage was
done, there are probably performance impacts.  Masking the CE's won't
correct it but will save your harddrive and also significantly improve
performance since the OS won't be interrupted hundreds of times a second.
 Unfortunately don't know of any fix, unless there is a vendor specific
method to fix the underlying HW issue.
Intel is usually pretty good about fixing issues like this (unless
it's caused by Supermicro's layout).
Intel is pretty good about this.  These are low level physical layer errors,
so it could very well be a layout issue.
Post by Chris Worley
I'm not to worried about the NIC's performance, as long as it works...
I do need to measure system performance in other respects, so
decreasing the shower of interrupts (CPU overhead) is important.
Chris
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
Post by Erwin Tsaur
ok I severely underestimated how much 1000 lines are, but I got the info
I
needed, mostly.  Now I'm wondering if there is an errata in the Root Port
causing this.
are complaining of CE errors.
The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.
This is the igb driver (SUNWigb package).  It doesn't have a conf file.
It's for the Intel 82576 Dual-Port GigE on-board NIC.  This driver
didn't work prior to the update.
/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;
... no difference.  Still lots of errors reported (attached last 1000
lines).
It's not picking up the conf property...
According to the pkgdef, I think the correct place is
/kernel/drv/igb.conf
It should be in the same place as the igb driver.
# For example, if you see,
#
=
"0"
# flow_control = 1;
=
"0,1"
# flow_control = 3;
pcie_ce_mask=0x1040;
Still, no joy... the last 1K lines attached.
Thanks,
Chris
Post by Erwin Tsaur
Chris
Post by Erwin Tsaur
pcie_ce_mask=0x1040;
It requires reboot.
If you need to do it on a live system let me know, the instructions are
a
bit more complicated.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post.
 would
like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad
PCIe
# fmadm faulty
--------------- ------------------------------------
 --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------
 --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c
 PCIEX-8000-KP
 Major
Fault class : fault.io.pciex.device-interr-corr max 29%
       fault.io.pciex.bus-linkerr-corr max 14%
           faulted but still in service
FRU         : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
           faulty
Description : Too many recovered bus errors have been detected,
which
indicates
       a problem with the specified bus or with the specified
       transmitting device. This may degrade into an
unrecoverable
       fault.
       Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device
instances
associated
with
       this fault
Action      : If a plug-in card is involved check for
badly-seated
cards
or
       bent pins. Otherwise schedule a repair procedure to
replace
the
       affected device.  Use fmadm faulty to identify the
device
or
       contact Sun for support.
How bad is this error?  I need to put some adapters in, but it
sounds
like the OS doesn't handle the NHM's IOH (or is it really
detaining
a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.
 This
is
likely to eventually cause your system to panic or fill up your
hard
drive.
 Assuming you are seeing a lot of btlp and rto errors..  If
anything
these
errors are performance killer.  Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to constantly
go
out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error
is
literally complaining about a packets received between 2 devices.
 Are
you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is
why
the
other
OSes don't report any errors.  It doesn't mean that errors aren't
occurring
though.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't
fill
the
disk an hour after boot.  Is this possible?
no throttling possible, but you could turn it off, though
highly
not
recommended, it's better to fix the issue.  It really could
just
be
a
badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please provide
the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000
lines
should
be enough.
I think all the CE's would produce the same message.  I also need
to
know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded.  Everything
else is built-in to the motherboard.
Thanks,
Chris
Post by Erwin Tsaur
<snip>
Erwin Tsaur
2009-07-08 23:55:52 UTC
Permalink
Post by Chris Worley
Post by Chris Worley
Post by Erwin Tsaur
I didn't realize that the Root Port was seeing the same thing.. :(
Add the same line to pcie_pci.conf
# tail /kernel/drv/pcie_pci.conf
# Force load driver to support hotplug activity
#
ddi-forceattach=1;
#
# force interrupt priorities to be one
# otherwise this driver binds as bridge device with priority 12
#
interrupt-priorities=1;
pcie_ce_mask=0x1040;
... still errors being logged (see attached).
geesh.. yet another type of CE. Change the mask from 0x1040 to 0x1041. If
you get tired of this, change the mask to -1. :)
With this you shouldn't see any more ereports from these devices, unless
there was a UE from that link.
Just so I might learn something, how are you correlating the errors to
the bitmask?
sorry probably should have explained that earlier.
The "fabric" ereports are basically a register dump of all the error
registers of the device. In this case look for pcie_ce_status and
pcie_ce_mask. Too many CE errors in a given amount of time causes the
FMD to fault this device or link in this case.

In there ereport you'll also find the dev path, as well as other
interesting information such as the BDF and dev/ven id.

For quick searching look for the "severity" flag. Any ereport with a
severity > 0x1 means there was some sort of error in it.
Post by Chris Worley
Thanks,
Chris
Post by Chris Worley
Post by Erwin Tsaur
Good news is that the leaf device is no longer spamming with CEs.
Yes, I've made a lot of progress so far on this, thanks!
Post by Erwin Tsaur
You can also limit which RP's CE's get turned off. see "driver.conf" man
page. This will mask 0x1040 on all the RPs.
I have to warn again, though they are technically CEs and no damage was
done, there are probably performance impacts. Masking the CE's won't
correct it but will save your harddrive and also significantly improve
performance since the OS won't be interrupted hundreds of times a second.
Unfortunately don't know of any fix, unless there is a vendor specific
method to fix the underlying HW issue.
Intel is usually pretty good about fixing issues like this (unless
it's caused by Supermicro's layout).
Intel is pretty good about this. These are low level physical layer errors,
so it could very well be a layout issue.
Post by Chris Worley
I'm not to worried about the NIC's performance, as long as it works...
I do need to measure system performance in other respects, so
decreasing the shower of interrupts (CPU overhead) is important.
Chris
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
ok I severely underestimated how much 1000 lines are, but I got the info
I
needed, mostly. Now I'm wondering if there is an errata in the Root Port
causing this.
are complaining of CE errors.
The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.
This is the igb driver (SUNWigb package). It doesn't have a conf file.
It's for the Intel 82576 Dual-Port GigE on-board NIC. This driver
didn't work prior to the update.
/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;
... no difference. Still lots of errors reported (attached last 1000 lines).
It's not picking up the conf property...
According to the pkgdef, I think the correct place is
/kernel/drv/igb.conf
It should be in the same place as the igb driver.
# For example, if you see,
#
=
"0"
# flow_control = 1;
=
"0,1"
# flow_control = 3;
pcie_ce_mask=0x1040;
Still, no joy... the last 1K lines attached.
Thanks,
Chris
Post by Erwin Tsaur
Post by Chris Worley
Chris
Post by Erwin Tsaur
pcie_ce_mask=0x1040;
It requires reboot.
If you need to do it on a live system let me know, the instructions are
a
bit more complicated.
Post by Chris Worley
Post by Erwin Tsaur
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post.
would
like
to know a more appropriate place to post, since fm is just the
messenger here.)
More to add: fmadm faulty may be saying something about a bad
PCIe
# fmadm faulty
--------------- ------------------------------------
--------------
---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------
--------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c
PCIEX-8000-KP
Major
Fault class : fault.io.pciex.device-interr-corr max 29%
fault.io.pciex.bus-linkerr-corr max 14%
faulted but still in service
FRU : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
faulty
Description : Too many recovered bus errors have been detected,
which
indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an
unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances
associated
with
this fault
Action : If a plug-in card is involved check for
badly-seated
cards
or
bent pins. Otherwise schedule a repair procedure to
replace
the
affected device. Use fmadm faulty to identify the
device
or
contact Sun for support.
How bad is this error? I need to put some adapters in, but it
sounds
like the OS doesn't handle the NHM's IOH (or is it really
detaining
a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.
This
is
likely to eventually cause your system to panic or fill up your
hard
drive.
Assuming you are seeing a lot of btlp and rto errors.. If
anything
these
errors are performance killer. Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to
constantly
go
out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the
bus.
The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the
error
is
literally complaining about a packets received between 2 devices.
Are
you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is
why
the
other
OSes don't report any errors. It doesn't mean that errors aren't
occurring
though.
Post by Chris Worley
Post by Erwin Tsaur
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't
fill
the
disk an hour after boot. Is this possible?
no throttling possible, but you could turn it off, though
highly
not
recommended, it's better to fix the issue. It really could just
be
a
badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please
provide
the
"fmdump -eV" log. If it is huge, just tail the last 500-1000 lines
should
be enough.
I think all the CE's would produce the same message. I also need to
know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded. Everything
else is built-in to the motherboard.
Thanks,
Chris
Post by Erwin Tsaur
Post by Erwin Tsaur
<snip>
Chris Worley
2009-07-09 00:02:43 UTC
Permalink
Post by Erwin Tsaur
Post by Chris Worley
Post by Chris Worley
Post by Erwin Tsaur
I didn't realize that the Root Port was seeing the same thing.. :(
Add the same line to pcie_pci.conf
# tail /kernel/drv/pcie_pci.conf
# Force load driver to support hotplug activity
#
ddi-forceattach=1;
#
# force interrupt priorities to be one
# otherwise this driver binds as bridge device with priority 12
#
interrupt-priorities=1;
pcie_ce_mask=0x1040;
... still errors being logged (see attached).
geesh.. yet another type of CE.  Change the mask from 0x1040 to 0x1041.  If
you get tired of this, change the mask to -1. :)
With this you shouldn't see any more ereports from these devices, unless
there was a UE from that link.
Just so I might learn something, how are you correlating the errors to
the bitmask?
sorry probably should have explained that earlier.
The "fabric" ereports are basically a register dump of all the error
registers of the device.  In this case look for pcie_ce_status and
pcie_ce_mask.  Too many CE errors in a given amount of time causes the FMD
to fault this device or link in this case.
In there ereport you'll also find the dev path, as well as other interesting
information such as the BDF and dev/ven id.
For quick searching look for the "severity" flag.  Any ereport with a
severity > 0x1 means there was some sort of error in it.
Thanks for all the help on this... I stopped at a setting of 0x10c1 in
the igb.conf file. There are a few errors at boot to from the root,
bot seem to not persist, so I kept pcie_pci.conf at 0x1041.

Chris
Post by Erwin Tsaur
Post by Chris Worley
Thanks,
Chris
Post by Chris Worley
Post by Erwin Tsaur
Good news is that the leaf device is no longer spamming with CEs.
Yes, I've made a lot of progress so far on this, thanks!
Post by Erwin Tsaur
You can also limit which RP's CE's get turned off.  see "driver.conf" man
page.  This will mask 0x1040 on all the RPs.
I have to warn again, though they are technically CEs and no damage was
done, there are probably performance impacts.  Masking the CE's won't
correct it but will save your harddrive and also significantly improve
performance since the OS won't be interrupted hundreds of times a second.
 Unfortunately don't know of any fix, unless there is a vendor specific
method to fix the underlying HW issue.
Intel is usually pretty good about fixing issues like this (unless
it's caused by Supermicro's layout).
Intel is pretty good about this.  These are low level physical layer errors,
so it could very well be a layout issue.
Post by Chris Worley
I'm not to worried about the NIC's performance, as long as it works...
I do need to measure system performance in other respects, so
decreasing the shower of interrupts (CPU overhead) is important.
Chris
Post by Erwin Tsaur
Post by Chris Worley
Post by Erwin Tsaur
Post by Erwin Tsaur
ok I severely underestimated how much 1000 lines are, but I got the info
I
needed, mostly.  Now I'm wondering if there is an errata in the
Root
Port
causing this.
 Both
notes
are complaining of CE errors.
The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.
This is the igb driver (SUNWigb package).  It doesn't have a conf file.
It's for the Intel 82576 Dual-Port GigE on-board NIC.  This driver
didn't work prior to the update.
/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;
... no difference.  Still lots of errors reported (attached last 1000
lines).
It's not picking up the conf property...
According to the pkgdef, I think the correct place is
/kernel/drv/igb.conf
It should be in the same place as the igb driver.
# For example, if you see,
#
=
"0"
# flow_control = 1;
=
"0,1"
# flow_control = 3;
pcie_ce_mask=0x1040;
Still, no joy... the last 1K lines attached.
Thanks,
Chris
Post by Erwin Tsaur
Chris
Post by Erwin Tsaur
pcie_ce_mask=0x1040;
It requires reboot.
If you need to do it on a live system let me know, the instructions are
a
bit more complicated.
Post by Chris Worley
Post by Erwin Tsaur
On Wed, Jul 8, 2009 at 1:10 PM, Erwin
Post by Erwin Tsaur
Post by Chris Worley
(Sorry for the misleading "Subject" in the initial post.
 would
like
to know a more appropriate place to post, since fm is just
the
messenger here.)
More to add: fmadm faulty may be saying something about a bad
PCIe
# fmadm faulty
--------------- ------------------------------------
 --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------
 --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c
 PCIEX-8000-KP
 Major
Fault class : fault.io.pciex.device-interr-corr max 29%
      fault.io.pciex.bus-linkerr-corr max 14%
          faulted but still in service
FRU         : "MB"
(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
          faulty
Description : Too many recovered bus errors have been
detected,
which
indicates
      a problem with the specified bus or with the specified
      transmitting device. This may degrade into an
unrecoverable
      fault.
      Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.
Response    : One or more device instances may be disabled
Impact      : Loss of services provided by the device
instances
associated
with
      this fault
Action      : If a plug-in card is involved check for
badly-seated
cards
or
      bent pins. Otherwise schedule a repair procedure to
replace
the
      affected device.  Use fmadm faulty to identify the
device
or
      contact Sun for support.
How bad is this error?  I need to put some adapters in, but
it
sounds
like the OS doesn't handle the NHM's IOH (or is it really
detaining
a
HW issue?).
OS does handle these issues and unfortunately it is a HW issue.
 This
is
likely to eventually cause your system to panic or fill up your
hard
drive.
 Assuming you are seeing a lot of btlp and rto errors..  If
anything
these
errors are performance killer.  Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to
constantly
go
out
and scan and clean up the fabric.
This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
The errors in OpenSolaris occur if no cards are installed in the
bus.
The other OSes don't report any errors w/ or w/o cards in the
bus.
This doesn't happen when there are no cards installed, since the
error
is
literally complaining about a packets received between 2 devices.
 Are
you
sure it's you are correctly identifying the right slot?
I believe only OpenSolaris even detects these errors, which is
why
the
other
OSes don't report any errors.  It doesn't mean that errors aren't
occurring
though.
Post by Erwin Tsaur
Post by Chris Worley
It would also be nice to throttle the errlog so it doesn't
fill
the
disk an hour after boot.  Is this possible?
no throttling possible, but you could turn it off, though
highly
not
recommended, it's better to fix the issue.  It really could
just
be
a
badly
seated card.
How do I disable the errors?
We need to figure out exactly what your error is first, please
provide
the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000
lines
should
be enough.
I think all the CE's would produce the same message.  I also need
to
know
the exact device.
Last 1000 lines (of ~20 million) attached.
There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded.  Everything
else is built-in to the motherboard.
Thanks,
Chris
Post by Erwin Tsaur
<snip>
Loading...