Discussion:
"PCE-E fabric" and "interconnect;lfu-l" errors
Eric Sun
2008-04-09 17:56:07 UTC
Permalink
Hi,

Recently we got some error from N2/VF system, if anyone on this board
could give a pointer as where to further FA is appreciated.

1. On N2, Glendale, system panic due to "PCIe fabric". On this system,
FEM (fabric express module) is not installed.

2. On VF/Batoka, FMA logged
"fault.asic.ultraSPARC-T2plus.interconnect.lfu-f", any suggestion as
where FMA decument can be found on this error?

Thanks.

Eric
Scott Davenport
2008-04-09 19:58:28 UTC
Permalink
Post by Eric Sun
Hi,
Recently we got some error from N2/VF system, if anyone on this board
could give a pointer as where to further FA is appreciated.
1. On N2, Glendale, system panic due to "PCIe fabric". On this system,
FEM (fabric express module) is not installed.
I don't know the specific layout of this system, but the T2
(Niagara-2) has a built-in PIU. And if it's connected to at
least one PLX switch, that constitutes a fabric.

Do you have more info you can share? The panic itself would produce
a general FMA message (SUNOS-8000-0G, IIRC) - basically saying the
system had to panic. But if the panic was due to HW problem, I would
expect subsequent telemetry and another diagnosis. Look for FMA
message codes in the /var/adm/messages file, check 'fmadm faulty',
or run an 'fmdump'.
Post by Eric Sun
2. On VF/Batoka, FMA logged
"fault.asic.ultraSPARC-T2plus.interconnect.lfu-f", any suggestion as
where FMA decument can be found on this error?
Each fault has a corresponding message id. That is printed to
the console and /var/adm/messages when the diagnosis is
issued. This particular one is http://www.sun.com/msg/SUN4V-8001-UH.
But in short you've had a single lane failure on a VF coherency
channel.

-- scott
http://blogs.sun.com/sdaven
Eric Sun
2008-04-09 21:30:46 UTC
Permalink
Scott;

Below is the log, any info is appreciated.

Eric


=====

SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
EVENT-TIME: 0x47e88814.0xa220262 (0xafd46d33c7d)
PLATFORM: SUNW,Sun-Blade-T6320, CSN: -, HOSTNAME: gd65-19-1
SOURCE: SunOS, REV: 5.10 glendale-on10-nightly_nightly:08/07/2007
DESC: Errors have been detected that require a reboot to ensure system
integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information.
AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry
IMPACT: The system will sync files, save a crash dump if needed, and reboot
REC-ACTION: Save the error summary below in case telemetry cannot be saved

ereport.io.fire.fabric ena=afd46c71b2e0a001 detector=[ version=0
scheme="dev"
device-path="/***@0/***@0/***@c/***@0" ] req_id=7a00 device_id=105e
vendor_id=8086 rev_id=6 dev_type=0 cap_off=e0 aer_off=100 sts_reg=4010
sts_sreg=0 pcix_sts_reg=0 pcix_bdg_sts_reg=0 dev_sts_reg=4 aer_ce=0 aer_ue=
40000 aer_sev=62011 aer_ctr=12 aer_h1=a002000 aer_h2=8200 aer_h3=7a010870
aer_h4=0 saer_ue=0 saer_sev=0 saer_ctr=0 saer_h1=0 saer_h2=0 saer_h3=0
saer_h4=0 remainder=2 severity=40

ereport.io.fire.fabric ena=afd46c71b2e0a001 detector=[ version=0
scheme="dev"
device-path="/***@0/***@0/***@c" ] req_id=360 device_id=8548 vendor_id=10b5
rev_id=aa dev_type=60 cap_off=68 aer_off=fb4 sts_reg=10 sts_sreg=0
pcix_sts_reg=0 pcix_bdg_sts_reg=0 dev_sts_reg=0 aer_ce=0 aer_ue=0 aer_sev=
62030 aer_ctr=1ff aer_h1=0 aer_h2=0 aer_h3=0 aer_h4=0 saer_ue=0 saer_sev=0
saer_ctr=0 saer_h1=0 saer_h2=0 saer_h3=0 saer_h4=0 remainder=1 severity=1

ereport.io.fire.fabric ena=afd46c71b2e0a001 detector=[ version=0
scheme="dev"
device-path="/***@0/***@0" ] req_id=200 device_id=8548 vendor_id=10b5
rev_id=
aa dev_type=50 cap_off=68 aer_off=fb4 sts_reg=10 sts_sreg=0 pcix_sts_reg=0
pcix_bdg_sts_reg=0 dev_sts_reg=0 aer_ce=0 aer_ue=0 aer_sev=62030
aer_ctr=1ff
aer_h1=0 aer_h2=0 aer_h3=0 aer_h4=0 saer_ue=0 saer_sev=0 saer_ctr=0
saer_h1=0
saer_h2=0 saer_h3=0 saer_h4=0 remainder=0 severity=1


panic[cpu40]/thread=2a10c289cc0: Fatal error has occured in: PCIe fabric.

000002a10c2b1c40 px:px_err_panic+174 (0, 1337c00, 2a10c2b1cf0, 41,
2a10c2b1cf1, 0)
%l0-3: 0000000000000034 00000000018fe000 0000000000000000 0000000000000001
%l4-7: 00000000018fe000 0000000000000000 0000000001846c00 ffffffffffffffff
000002a10c2b1d50 px:px_err_fabric_intr+b8 (300008e9e00, 33, 7a00,
300008d0210, 300008e9f50, 7a00000000000000)
%l0-3: 0000000000000001 0000060005ac4000 0000000000000020 0000060005bdccc0
%l4-7: 0000000000000041 0000000000000000 0000000000000000 0000000000000001
000002a10c2b1e40 px:px_msiq_intr+1c0 (300008e7ce8, 300008d0210, 132c9bc,
0, 300008e14f0, 60001c2a3e0) %l0-3: 0000000000000000 000002a10c2b1f10
0000000000000000 0000000000000003
%l4-7: 000002a10c2b1f40 00000600025fc000 0000000000000000 0000000000000033
000002a10c2b1f50 unix:current_thread+170 (16, 10000000000,
fffffefffffffeff, fffffefffffffeff, 0, 12) %l0-3: 000000000100985c
000002a10c289021 000000000000000e 000000000000003a
%l4-7: ffffffffffffffff 0000000000000000 0000000000000000 000002a10c2898d0
000002a10c289970 unix:cpu_halt+114 (30001b32000, 10000000000, 184d100,
28, 30001b32000, 60005b0b35c)
%l0-3: 0000000000000016 00000300005afa80 00000300005afa80 0000000000000000
%l4-7: 0000000000000000 0000000000000000 0000000000000001 0000000000000001
000002a10c289a20 unix:idle+128 (1819c00, 0, 30001b32000,
ffffffffffffffff, 29, 1818c00)
%l0-3: 0000060005b0b338 000000000000001b 0000000000000000 ffffffffffffffff
%l4-7: 0000060005b0b338 ffffffffffffffff 000000000184d100 000000000103afbc

syncing file systems...
======
Post by Scott Davenport
Post by Eric Sun
Hi,
Recently we got some error from N2/VF system, if anyone on this board
could give a pointer as where to further FA is appreciated.
1. On N2, Glendale, system panic due to "PCIe fabric". On this system,
FEM (fabric express module) is not installed.
I don't know the specific layout of this system, but the T2
(Niagara-2) has a built-in PIU. And if it's connected to at
least one PLX switch, that constitutes a fabric.
Do you have more info you can share? The panic itself would produce
a general FMA message (SUNOS-8000-0G, IIRC) - basically saying the
system had to panic. But if the panic was due to HW problem, I would
expect subsequent telemetry and another diagnosis. Look for FMA
message codes in the /var/adm/messages file, check 'fmadm faulty',
or run an 'fmdump'.
Post by Eric Sun
2. On VF/Batoka, FMA logged
"fault.asic.ultraSPARC-T2plus.interconnect.lfu-f", any suggestion as
where FMA decument can be found on this error?
Each fault has a corresponding message id. That is printed to
the console and /var/adm/messages when the diagnosis is
issued. This particular one is http://www.sun.com/msg/SUN4V-8001-UH.
But in short you've had a single lane failure on a VF coherency
channel.
-- scott
http://blogs.sun.com/sdaven
_______________________________________________
fm-discuss mailing list
Tarik Soydan - Sun BOS Software
2008-04-10 16:46:40 UTC
Permalink
Post by Eric Sun
Scott;
Below is the log, any info is appreciated.
Eric
=====
SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
EVENT-TIME: 0x47e88814.0xa220262 (0xafd46d33c7d)
PLATFORM: SUNW,Sun-Blade-T6320, CSN: -, HOSTNAME: gd65-19-1
SOURCE: SunOS, REV: 5.10 glendale-on10-nightly_nightly:08/07/2007
DESC: Errors have been detected that require a reboot to ensure system
integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information.
AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry
IMPACT: The system will sync files, save a crash dump if needed, and reboot
REC-ACTION: Save the error summary below in case telemetry cannot be saved
ereport.io.fire.fabric ena=afd46c71b2e0a001 detector=[ version=0
scheme="dev"
vendor_id=8086 rev_id=6 dev_type=0 cap_off=e0 aer_off=100 sts_reg=4010
sts_sreg=0 pcix_sts_reg=0 pcix_bdg_sts_reg=0 dev_sts_reg=4 aer_ce=0 aer_ue=
40000 aer_sev=62011 aer_ctr=12 aer_h1=a002000 aer_h2=8200 aer_h3=7a010870
aer_h4=0 saer_ue=0 saer_sev=0 saer_ctr=0 saer_h1=0 saer_h2=0 saer_h3=0
saer_h4=0 remainder=2 severity=40
The network device /***@0/***@0/***@c/***@0 detected a malformed
completion
packet according to the AER registers and the header log registers. It
must have been doing
some DMA operation. I don't know whats "malformed" about the packet, but
for some reason
the network device didn't like it.

PCI Status Reg = signalled system error
AER_UE = malformed TLP

I would suspect the upstream switch device /***@0/***@0/***@c.

I would also expect there to be a fault diagnosed to that effect.
Post by Eric Sun
ereport.io.fire.fabric ena=afd46c71b2e0a001 detector=[ version=0
scheme="dev"
rev_id=aa dev_type=60 cap_off=68 aer_off=fb4 sts_reg=10 sts_sreg=0
pcix_sts_reg=0 pcix_bdg_sts_reg=0 dev_sts_reg=0 aer_ce=0 aer_ue=0 aer_sev=
62030 aer_ctr=1ff aer_h1=0 aer_h2=0 aer_h3=0 aer_h4=0 saer_ue=0 saer_sev=0
saer_ctr=0 saer_h1=0 saer_h2=0 saer_h3=0 saer_h4=0 remainder=1 severity=1
No errors.
Post by Eric Sun
ereport.io.fire.fabric ena=afd46c71b2e0a001 detector=[ version=0
scheme="dev"
rev_id=
aa dev_type=50 cap_off=68 aer_off=fb4 sts_reg=10 sts_sreg=0 pcix_sts_reg=0
pcix_bdg_sts_reg=0 dev_sts_reg=0 aer_ce=0 aer_ue=0 aer_sev=62030
aer_ctr=1ff
aer_h1=0 aer_h2=0 aer_h3=0 aer_h4=0 saer_ue=0 saer_sev=0 saer_ctr=0
saer_h1=0
saer_h2=0 saer_h3=0 saer_h4=0 remainder=0 severity=1
No errors.
Post by Eric Sun
panic[cpu40]/thread=2a10c289cc0: Fatal error has occured in: PCIe fabric.
000002a10c2b1c40 px:px_err_panic+174 (0, 1337c00, 2a10c2b1cf0, 41,
2a10c2b1cf1, 0)
%l0-3: 0000000000000034 00000000018fe000 0000000000000000 0000000000000001
%l4-7: 00000000018fe000 0000000000000000 0000000001846c00 ffffffffffffffff
000002a10c2b1d50 px:px_err_fabric_intr+b8 (300008e9e00, 33, 7a00,
300008d0210, 300008e9f50, 7a00000000000000)
%l0-3: 0000000000000001 0000060005ac4000 0000000000000020 0000060005bdccc0
%l4-7: 0000000000000041 0000000000000000 0000000000000000 0000000000000001
000002a10c2b1e40 px:px_msiq_intr+1c0 (300008e7ce8, 300008d0210, 132c9bc,
0, 300008e14f0, 60001c2a3e0) %l0-3: 0000000000000000 000002a10c2b1f10
0000000000000000 0000000000000003
%l4-7: 000002a10c2b1f40 00000600025fc000 0000000000000000 0000000000000033
000002a10c2b1f50 unix:current_thread+170 (16, 10000000000,
fffffefffffffeff, fffffefffffffeff, 0, 12) %l0-3: 000000000100985c
000002a10c289021 000000000000000e 000000000000003a
%l4-7: ffffffffffffffff 0000000000000000 0000000000000000 000002a10c2898d0
000002a10c289970 unix:cpu_halt+114 (30001b32000, 10000000000, 184d100,
28, 30001b32000, 60005b0b35c)
%l0-3: 0000000000000016 00000300005afa80 00000300005afa80 0000000000000000
%l4-7: 0000000000000000 0000000000000000 0000000000000001 0000000000000001
000002a10c289a20 unix:idle+128 (1819c00, 0, 30001b32000,
ffffffffffffffff, 29, 1818c00)
%l0-3: 0000060005b0b338 000000000000001b 0000000000000000 ffffffffffffffff
%l4-7: 0000060005b0b338 ffffffffffffffff 000000000184d100 000000000103afbc
syncing file systems...
======
Post by Scott Davenport
Post by Eric Sun
Hi,
Recently we got some error from N2/VF system, if anyone on this board
could give a pointer as where to further FA is appreciated.
1. On N2, Glendale, system panic due to "PCIe fabric". On this system,
FEM (fabric express module) is not installed.
I don't know the specific layout of this system, but the T2
(Niagara-2) has a built-in PIU. And if it's connected to at
least one PLX switch, that constitutes a fabric.
Do you have more info you can share? The panic itself would produce
a general FMA message (SUNOS-8000-0G, IIRC) - basically saying the
system had to panic. But if the panic was due to HW problem, I would
expect subsequent telemetry and another diagnosis. Look for FMA
message codes in the /var/adm/messages file, check 'fmadm faulty',
or run an 'fmdump'.
Post by Eric Sun
2. On VF/Batoka, FMA logged
"fault.asic.ultraSPARC-T2plus.interconnect.lfu-f", any suggestion as
where FMA decument can be found on this error?
Each fault has a corresponding message id. That is printed to
the console and /var/adm/messages when the diagnosis is
issued. This particular one is http://www.sun.com/msg/SUN4V-8001-UH.
But in short you've had a single lane failure on a VF coherency
channel.
-- scott
http://blogs.sun.com/sdaven
_______________________________________________
fm-discuss mailing list
_______________________________________________
fm-discuss mailing list
Loading...