Discussion:
self test failure on Intel X25-E SSD
Paul B. Henson
2009-06-09 00:57:00 UTC
Permalink
I've been testing out an Intel X25-E SSD as a slog device. Functionality
wise, it works great -- performance on some of my testing approaches the
same level as completely disabling the zil.

Unfortunately, fma isn't very happy with the drive. It keeps saying the
drive has failed self test and marking it as faulty. This does not impact
functionality, the drive remains available and works fine. However, the
fault indicator light on the chassis and on the drive is lit, IPMI
management asserts a drive failure status and the logs are cluttered with
erroneous failure notifications.

I was wondering if this was a problem with the drive not supporting
whatever SMART functionality fma is looking for, or some type of
incompatibility.

Here's the fault report:

***@ike ~ # fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jun 02 17:59:18 53ce5eda-716e-e5b2-870b-d5d5d2828f81 DISK-8000-2J
Critical

Fault class : fault.io.disk.self-test-failure
Affects :
dev:///:devid=id1,***@SATA_____SSDSA2SH032G1GN___CVEM902600J6032HGN//***@2,0/pci1022,***@8/pci11ab,***@1/***@0,0
degraded but still in service
FRU : "HD_ID_4"
(hc://:product-id=Sun-Fire-X4500:chassis-id=0819AMT059:server-id=ike:serial=CVEM902600J6032HGN:part=SSDSA2SH032G1GN-INTEL:revision=045C8626/bay=4/disk=0)
faulty


Also, it does not look like the drive includes a temperature sensor, or at
least the value is not being reported correctly:

c5t0d0p0 2600J6032HGN ATA SSDSA2SH032G1GN 8626 255 C (491 F)


If there is no way to get this drive to play happily with fma, is there any
way to disable self tests on the drive to prevent the erroneous fault and
error notifications?

Thanks...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Eric Schrock
2009-06-09 01:15:58 UTC
Permalink
Post by Paul B. Henson
I've been testing out an Intel X25-E SSD as a slog device. Functionality
wise, it works great -- performance on some of my testing approaches the
same level as completely disabling the zil.
Unfortunately, fma isn't very happy with the drive. It keeps saying the
drive has failed self test and marking it as faulty. This does not impact
functionality, the drive remains available and works fine. However, the
fault indicator light on the chassis and on the drive is lit, IPMI
management asserts a drive failure status and the logs are cluttered with
erroneous failure notifications.
I was wondering if this was a problem with the drive not supporting
whatever SMART functionality fma is looking for, or some type of
incompatibility.
This information comes from the self-test log page (section 7.2.10 of
SPC-3). In particular, it checks the most recent entry for the
Self-Test results field for values in the range 0x3-0x7. You can find
the code that analyzes this at logpage_selftest_analyze(). I'd
recommend writing a short test program that links against
/usr/lib/fm/libdiskstatus.so.1 and does a
disk_status_open()/disk_status_get()/nvlist_print(). This will dump out
the contents of the IE page as fmd is analyzing it (recent opensolaris
versions can also do this with fmtopo -m).

Looking at the code, the fact that it treats mode 0x3 (incomplete) as a
failure is a bug. But it should also be the case that a more recent
self-test should be run and we should be picking the later entry, which
should report success. Perhaps there's another bug where we're not
selecting the most recent entry correctly. I'd recommend poking around
with a debugger and seeing why this function believes that the self-test
has failed.

Hope that helps,

- Eric
Post by Paul B. Henson
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jun 02 17:59:18 53ce5eda-716e-e5b2-870b-d5d5d2828f81 DISK-8000-2J
Critical
Fault class : fault.io.disk.self-test-failure
degraded but still in service
FRU : "HD_ID_4"
(hc://:product-id=Sun-Fire-X4500:chassis-id=0819AMT059:server-id=ike:serial=CVEM902600J6032HGN:part=SSDSA2SH032G1GN-INTEL:revision=045C8626/bay=4/disk=0)
faulty
Also, it does not look like the drive includes a temperature sensor, or at
c5t0d0p0 2600J6032HGN ATA SSDSA2SH032G1GN 8626 255 C (491 F)
If there is no way to get this drive to play happily with fma, is there any
way to disable self tests on the drive to prevent the erroneous fault and
error notifications?
Thanks...
--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Paul B. Henson
2009-06-09 02:40:13 UTC
Permalink
Post by Eric Schrock
/usr/lib/fm/libdiskstatus.so.1 and does a
disk_status_open()/disk_status_get()/nvlist_print().
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0xe
timestamp = 0xea00
segment = 0x0
address = 0xea00ea00ea
(end self-test-failure)

If I'm reading this right, the self test result code is 0xe? Unless the
copy of the spec I found is out of date that's a reserved value and not
currently defined? Which would lead one to believe the fault lies with the
SSD.

My understanding is that the new X4540 SSD is a relabeled X25-E, presumably
it works correctly with fma. Anyone played with one of those yet? I wonder
what changes Sun might have made to the firmware. Doesn't look like there's
any firmware updates out yet for the X25-E from Intel.
Post by Eric Schrock
selecting the most recent entry correctly. I'd recommend poking around
with a debugger and seeing why this function believes that the self-test
has failed.
If I'm understanding the output correctly, it probably thinks it failed
because the self-test result is invalid, and I need to either RMA the drive
or go yell at Intel. Although another person I spoke with with an X25-E
says his is reported as failing selftest as well, which would indicate a
general issue with the drive and not a specific failure with my unit.

Thanks much for the help...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Eric Schrock
2009-06-09 02:47:06 UTC
Permalink
Post by Paul B. Henson
Post by Eric Schrock
/usr/lib/fm/libdiskstatus.so.1 and does a
disk_status_open()/disk_status_get()/nvlist_print().
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0xe
timestamp = 0xea00
segment = 0x0
address = 0xea00ea00ea
(end self-test-failure)
If I'm reading this right, the self test result code is 0xe? Unless the
copy of the spec I found is out of date that's a reserved value and not
currently defined? Which would lead one to believe the fault lies with the
SSD.
My understanding is that the new X4540 SSD is a relabeled X25-E, presumably
it works correctly with fma. Anyone played with one of those yet? I wonder
what changes Sun might have made to the firmware. Doesn't look like there's
any firmware updates out yet for the X25-E from Intel.
Post by Eric Schrock
selecting the most recent entry correctly. I'd recommend poking around
with a debugger and seeing why this function believes that the self-test
has failed.
If I'm understanding the output correctly, it probably thinks it failed
because the self-test result is invalid, and I need to either RMA the drive
or go yell at Intel. Although another person I spoke with with an X25-E
says his is reported as failing selftest as well, which would indicate a
general issue with the drive and not a specific failure with my unit.
Yes, that's quite strange. It's also possible that the code to walk the
individual log parameters is somehow getting out of sync, and we're
walking off into outer space. Certainly the predominance of '0xea' is
quite suspicious. I would set a breakpoint in
logpage_selftest_analyze() in MDB and do something (after stepping over
the pushl/movl to setup the stack frame) like:

$C
... get args to func ...
<arg1>,<arg2)::dump

Where 'arg0' is the first argument. This will dump out the raw data.
From there, we can walk over the parameter entries by hand and see if
they look legitimately strange. This may still be a software bug.
Someday it would also be nice to rewrite libdiskstatus to leverage
libscsi - it would eliminate a large amount of custom code that only
makes it more difficult.

- Eric
--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Paul B. Henson
2009-06-09 03:08:58 UTC
Permalink
quite suspicious. I would set a breakpoint in logpage_selftest_analyze()
in MDB and do something (after stepping over the pushl/movl to setup the
Hmm... I'm afraid I'm not too familiar with mdb. Looks like most of the
quick intro guides are more geared towards debugging kernel crash dumps.
I'll play with it and see if I can get what you're asking for.

Thanks...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Eric Schrock
2009-06-09 03:15:35 UTC
Permalink
Post by Paul B. Henson
quite suspicious. I would set a breakpoint in logpage_selftest_analyze()
in MDB and do something (after stepping over the pushl/movl to setup the
Hmm... I'm afraid I'm not too familiar with mdb. Looks like most of the
quick intro guides are more geared towards debugging kernel crash dumps.
I'll play with it and see if I can get what you're asking for.
Another thing you could do is set a breakpoint there and then run
"::gcore" which should dump a core that I could then look at. You'd
want to do:

$ mdb testprog
Post by Paul B. Henson
::bp logpage_selftest_analyze
::run <diskname>
-breakpoint hit-
Post by Paul B. Henson
::gcore
You can then send the core directly to me and I'll poke around.

- Eric
--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
David Zhang
2009-06-09 02:51:13 UTC
Permalink
Hi Paul,

Would you please take a look at:
http://developers.sun.com/solaris/articles/scsi_disk_fma2.html
It could give us steps to reproduce what was happened on your ssd disk.

With fmdump -eV, we can find what ereport cause this issue, and do
further analyze.

David
Post by Paul B. Henson
Post by Eric Schrock
/usr/lib/fm/libdiskstatus.so.1 and does a
disk_status_open()/disk_status_get()/nvlist_print().
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0xe
timestamp = 0xea00
segment = 0x0
address = 0xea00ea00ea
(end self-test-failure)
If I'm reading this right, the self test result code is 0xe? Unless the
copy of the spec I found is out of date that's a reserved value and not
currently defined? Which would lead one to believe the fault lies with the
SSD.
My understanding is that the new X4540 SSD is a relabeled X25-E, presumably
it works correctly with fma. Anyone played with one of those yet? I wonder
what changes Sun might have made to the firmware. Doesn't look like there's
any firmware updates out yet for the X25-E from Intel.
Post by Eric Schrock
selecting the most recent entry correctly. I'd recommend poking around
with a debugger and seeing why this function believes that the self-test
has failed.
If I'm understanding the output correctly, it probably thinks it failed
because the self-test result is invalid, and I need to either RMA the drive
or go yell at Intel. Although another person I spoke with with an X25-E
says his is reported as failing selftest as well, which would indicate a
general issue with the drive and not a specific failure with my unit.
Thanks much for the help...
Paul B. Henson
2009-06-09 03:17:51 UTC
Permalink
Post by David Zhang
http://developers.sun.com/solaris/articles/scsi_disk_fma2.html
It could give us steps to reproduce what was happened on your ssd disk.
With fmdump -eV, we can find what ereport cause this issue, and do
further analyze.
fmdump -V -u <uuid> worked fine, output at the bottom. However, the version
of fmdump in S10 doesn't have the '-n' option?

***@ike ~ # fmdump -ev -n ena=0xc7d80697a6402c01
fmdump: illegal option -- n

Is that something new in OpenSolaris? Any way under S10 to get the same
info?

Thanks...

***@ike ~ # fmdump -V -u 16e83c34-b478-60c0-964d-e9cbd7bf4422
TIME UUID SUNW-MSG-ID
Jun 04 17:45:55.9791 16e83c34-b478-60c0-964d-e9cbd7bf4422 DISK-8000-2J

TIME CLASS ENA
Jun 04 17:45:53.7297 ereport.io.scsi.disk.self-test-failure
0xc7d80697a6402c01

nvlist version: 0
version = 0x0
class = list.suspect
uuid = 16e83c34-b478-60c0-964d-e9cbd7bf4422
code = DISK-8000-2J
diag-time = 1244162755 947776
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = Sun Fire X4500
chassis-id = 0819AMT059
server-id = ike
(end authority)

mod-name = eft
mod-version = 1.16
(end de)

fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.io.disk.self-test-failure
certainty = 0x64
asru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /***@2,0/pci1022,***@8/pci11ab,***@1/***@4,0
devid = id1,***@SATA_____SSDSA2SH032G1GN___CVEM902600J6032HGN
(end asru)

fru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = CVEM902600J6032HGN
part = SSDSA2SH032G1GN-INTEL
revision = 045C8626
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
chassis-id = 0819AMT059
server-id = ike
(end authority)

hc-list-sz = 0x2
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = bay
hc-id = 5
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[1])

(end fru)

resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = CVEM902600J6032HGN
part = SSDSA2SH032G1GN-INTEL
revision = 045C8626
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
chassis-id = 0819AMT059
server-id = ike
(end authority)

hc-list-sz = 0x2
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = bay
hc-id = 5
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[1])

(end resource)

location = HD_ID_5
(end fault-list[0])

fault-status = 0x1
__ttl = 0x1
__tod = 0x4a286ac3 0x3a5d5c48
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Paul B. Henson
2009-06-09 21:45:45 UTC
Permalink
Post by David Zhang
With fmdump -eV, we can find what ereport cause this issue, and do
further analyze.
Here is another example of diagnostic output I got from someone running
NexentaStor (based on b104 ON bits) who also has an X25-E reported as
failing self-test, if it provides any other clues to what's going on. His
x4500 doesn't have any failure lights lit though, not sure why.


----------
# ./diskstat /dev/rdsk/c6t4d0
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
dexcpt = 0
logerr = 0
mrie = 0x6
test = 0
ewasc = 0
perf = 0
ebf = 0
interval-timer = 0x0
report-count = 0x0
changed = 0
(end informational-exceptions)

(end modepages)

logpages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
length = 0x8
general = 1
(end informational-exceptions)

self-test = (embedded nvlist)
nvlist version: 0
length = 0x190
(end self-test)

(end logpages)

(end status)

predictive-failure = (embedded nvlist)
nvlist version: 0
additional-sense-code = 0x0
additional-sense-code-qualifier = 0x0
(end predictive-failure)

self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0xb
timestamp = 0xb400
segment = 0x0
address = 0xb400b400b4
(end self-test-failure)

faults = (embedded nvlist)
nvlist version: 0
predictive-failure = 0
self-test-failure = 0
(end faults)
----------

# fmdump
TIME UUID SUNW-MSG-ID
May 19 19:33:38.5442 065181f3-7de1-4e72-f1c9-9acfbf5cd1c4 DISK-8000-2J

# fmdump -V -u 065181f3-7de1-4e72-f1c9-9acfbf5cd1c4
TIME UUID SUNW-MSG-ID
May 19 19:33:38.5442 065181f3-7de1-4e72-f1c9-9acfbf5cd1c4 DISK-8000-2J

TIME CLASS ENA
May 19 19:33:38.3054 ereport.io.scsi.disk.self-test-failure 0xf85c5e4265e05401

nvlist version: 0
version = 0x0
class = list.suspect
uuid = 065181f3-7de1-4e72-f1c9-9acfbf5cd1c4
code = DISK-8000-2J
diag-time = 1242776018 496539
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = Sun Fire X4500
chassis-id = XXXXXXXXXX
server-id = brick1
(end authority)

mod-name = eft
mod-version = 1.16
(end de)

fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.io.disk.self-test-failure
certainty = 0x64
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = CVEM91140085032HGN
part = SSDSA2SH032G1GN-INTEL
revision = 045C8790
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
chassis-id = XXXXXXXXXX
server-id = brick1
(end authority)

hc-list-sz = 0x3
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 0
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])

(end resource)

asru = (embedded nvlist)
nvlist version: 0
scheme = dev
version = 0x0
device-path =
/***@1,0/pci1022,***@4/pci11ab,***@1/***@0,0
devid = id1,***@SATA_____SSDSA2SH032G1GN___CVEM91140085032HGN
(end asru)

fru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = CVEM91140085032HGN
part = SSDSA2SH032G1GN-INTEL
revision = 045C8790
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
server-id = brick1
chassis-id = XXXXXXXXXX
(end authority)

hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 0
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])

(end fru)

location = HD_ID_0
(end fault-list[0])

fault-status = 0x1
__ttl = 0x1
__tod = 0x4a1341d2 0x20714858


# fmdump -ev -n ena=0xf85c5e4265e05401
TIME CLASS ENA
May 19 19:33:38.3054 ereport.io.scsi.disk.self-test-failure
0xf85c5e4265e05401

# fmdump -eV -n ena=0xf85c5e4265e05401 | grep driver-assessment
# fmdump -eV -n ena=0xf85c5e4265e05401 | grep op-code
# fmdump -eV -n ena=0xf85c5e4265e05401 | grep key
# fmdump -eV -n ena=0xf85c5e4265e05401
TIME CLASS
May 19 2009 19:33:38.305479833 ereport.io.scsi.disk.self-test-failure
nvlist version: 0
class = ereport.io.scsi.disk.self-test-failure
version = 0x0
ena = 0xf85c5e4265e05401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = CVEM91140085032HGN
part = SSDSA2SH032G1GN-INTEL
revision = 045C8790
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
server-id = brick1
chassis-id = XXXXXXXXXX
(end authority)

hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 0
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])

(end detector)

result-code = 0x7
timestamp = 0x7a00
segment = 0x0
address = 0x7a007a007a
__ttl = 0x1
__tod = 0x4a1341d2 0x12354099
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Chris Horne
2009-06-09 22:38:53 UTC
Permalink
Paul

This fault is not based on driver generated telemetry.
This is fault is associated with the following code
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/fm/libdiskstatus/common/ds_scsi.c#logpage_selftest_analyze
looking at "log_sense" information obtained from the device.
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/fm/libdiskstatus/common/ds_scsi.h#scsi_selftest_log_param

-Chris
Post by Paul B. Henson
Post by David Zhang
With fmdump -eV, we can find what ereport cause this issue, and do
further analyze.
Here is another example of diagnostic output I got from someone running
NexentaStor (based on b104 ON bits) who also has an X25-E reported as
failing self-test, if it provides any other clues to what's going on. His
x4500 doesn't have any failure lights lit though, not sure why.
----------
# ./diskstat /dev/rdsk/c6t4d0
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
dexcpt = 0
logerr = 0
mrie = 0x6
test = 0
ewasc = 0
perf = 0
ebf = 0
interval-timer = 0x0
report-count = 0x0
changed = 0
(end informational-exceptions)
(end modepages)
logpages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
length = 0x8
general = 1
(end informational-exceptions)
self-test = (embedded nvlist)
nvlist version: 0
length = 0x190
(end self-test)
(end logpages)
(end status)
predictive-failure = (embedded nvlist)
nvlist version: 0
additional-sense-code = 0x0
additional-sense-code-qualifier = 0x0
(end predictive-failure)
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0xb
timestamp = 0xb400
segment = 0x0
address = 0xb400b400b4
(end self-test-failure)
faults = (embedded nvlist)
nvlist version: 0
predictive-failure = 0
self-test-failure = 0
(end faults)
----------
# fmdump
TIME UUID SUNW-MSG-ID
May 19 19:33:38.5442 065181f3-7de1-4e72-f1c9-9acfbf5cd1c4 DISK-8000-2J
# fmdump -V -u 065181f3-7de1-4e72-f1c9-9acfbf5cd1c4
TIME UUID SUNW-MSG-ID
May 19 19:33:38.5442 065181f3-7de1-4e72-f1c9-9acfbf5cd1c4 DISK-8000-2J
TIME CLASS ENA
May 19 19:33:38.3054 ereport.io.scsi.disk.self-test-failure 0xf85c5e4265e05401
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 065181f3-7de1-4e72-f1c9-9acfbf5cd1c4
code = DISK-8000-2J
diag-time = 1242776018 496539
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = Sun Fire X4500
chassis-id = XXXXXXXXXX
server-id = brick1
(end authority)
mod-name = eft
mod-version = 1.16
(end de)
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.io.disk.self-test-failure
certainty = 0x64
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = CVEM91140085032HGN
part = SSDSA2SH032G1GN-INTEL
revision = 045C8790
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
chassis-id = XXXXXXXXXX
server-id = brick1
(end authority)
hc-list-sz = 0x3
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 0
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])
(end resource)
asru = (embedded nvlist)
nvlist version: 0
scheme = dev
version = 0x0
device-path =
(end asru)
fru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = CVEM91140085032HGN
part = SSDSA2SH032G1GN-INTEL
revision = 045C8790
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
server-id = brick1
chassis-id = XXXXXXXXXX
(end authority)
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 0
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])
(end fru)
location = HD_ID_0
(end fault-list[0])
fault-status = 0x1
__ttl = 0x1
__tod = 0x4a1341d2 0x20714858
# fmdump -ev -n ena=0xf85c5e4265e05401
TIME CLASS ENA
May 19 19:33:38.3054 ereport.io.scsi.disk.self-test-failure
0xf85c5e4265e05401
# fmdump -eV -n ena=0xf85c5e4265e05401 | grep driver-assessment
# fmdump -eV -n ena=0xf85c5e4265e05401 | grep op-code
# fmdump -eV -n ena=0xf85c5e4265e05401 | grep key
# fmdump -eV -n ena=0xf85c5e4265e05401
TIME CLASS
May 19 2009 19:33:38.305479833 ereport.io.scsi.disk.self-test-failure
nvlist version: 0
class = ereport.io.scsi.disk.self-test-failure
version = 0x0
ena = 0xf85c5e4265e05401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = CVEM91140085032HGN
part = SSDSA2SH032G1GN-INTEL
revision = 045C8790
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
server-id = brick1
chassis-id = XXXXXXXXXX
(end authority)
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 0
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])
(end detector)
result-code = 0x7
timestamp = 0x7a00
segment = 0x0
address = 0x7a007a007a
__ttl = 0x1
__tod = 0x4a1341d2 0x12354099
--
Thanks

--Chris
Paul B. Henson
2009-06-10 01:58:22 UTC
Permalink
Post by Chris Horne
This fault is not based on driver generated telemetry.
I'm not sure what you mean. Do you mean the fmdump output requested by
David isn't relevant?
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Tarik Soydan
2009-06-10 18:14:21 UTC
Permalink
Post by Paul B. Henson
Post by Chris Horne
This fault is not based on driver generated telemetry.
I'm not sure what you mean. Do you mean the fmdump output requested by
David isn't relevant?
There are two sources of disk related ereports/faults.

One source is a polling mechanism whereby FMD checks for firmware sourced
errors such as self-test, overtemp, and predictive failures. The other
is the sd driver
itself which reports things like media and device errors.

I believe Chris is pointing out that these errors/faults are due to the
fw polling
mechanism and not the sd driver.

-tarik
Eric Schrock
2009-06-09 23:50:36 UTC
Permalink
Post by Paul B. Henson
Post by David Zhang
With fmdump -eV, we can find what ereport cause this issue, and do
further analyze.
Here is another example of diagnostic output I got from someone running
NexentaStor (based on b104 ON bits) who also has an X25-E reported as
failing self-test, if it provides any other clues to what's going on. His
x4500 doesn't have any failure lights lit though, not sure why.
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0xb
timestamp = 0xb400
segment = 0x0
address = 0xb400b400b4
(end self-test-failure)
As with the last one, this data doesn't make any sense. Besides the
curious repetitive 0xb4, the result-code of '0xb' is completely invalid.
I still have to take a look at the core you gave me, but I do notice
the code isn't quite structured as I would have expected. In
particular, if there is a failed self test followed by a successful one,
then we don't log the error. So I've identified at least two problems
with the current code, but neither one explains exactly what is going on.

- Eric
--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Eric Schrock
2009-06-10 00:05:09 UTC
Permalink
Post by Eric Schrock
As with the last one, this data doesn't make any sense. Besides the
curious repetitive 0xb4, the result-code of '0xb' is completely invalid.
I still have to take a look at the core you gave me, but I do notice
the code isn't quite structured as I would have expected. In
particular, if there is a failed self test followed by a successful one,
then we don't log the error. So I've identified at least two problems
with the current code, but neither one explains exactly what is going on.
I took a look at the core you provided, and these devices are just
Post by Eric Schrock
8071a4c::print scsi_selftest_log_param_t
{
st_hdr = {
lph_param = 0x100
lph_lp = 0x1
lph_lbin = 0x1
lph_tmc = 0
lph_etc = 0
lph_tsd = 0
lph_ds = 0
lph_du = 0
lph_length = 0x10
}
st_results = 0xe
__reserved1 = 0
st_testcode = 0
st_number = 0
st_timestamp = 0xea
st_lba = 0xea00ea00ea000000
st_sensekey = 0
__reserved2 = 0
st_asc = 0
st_ascq = 0
st_vendor = 0
}
Post by Eric Schrock
8071a4c+0t20::print scsi_selftest_log_param_t
{
st_hdr = {
lph_param = 0x200
lph_lp = 0x1
lph_lbin = 0x1
lph_tmc = 0
lph_etc = 0
lph_tsd = 0
lph_ds = 0
lph_du = 0
lph_length = 0x10
}
st_results = 0xe
__reserved1 = 0
st_testcode = 0
st_number = 0
st_timestamp = 0xea
st_lba = 0xea00ea00ea000000
st_sensekey = 0
__reserved2 = 0
st_asc = 0
st_ascq = 0
st_vendor = 0
}
Post by Eric Schrock
8071a74::print scsi_selftest_log_param_t
{
st_hdr = {
lph_param = 0x300
lph_lp = 0x1
lph_lbin = 0x1
lph_tmc = 0
lph_etc = 0
lph_tsd = 0
lph_ds = 0
lph_du = 0
lph_length = 0x10
}
st_results = 0xe
__reserved1 = 0
st_testcode = 0
st_number = 0
st_timestamp = 0xea
st_lba = 0xea00ea00ea000000
st_sensekey = 0
__reserved2 = 0
st_asc = 0
st_ascq = 0
st_vendor = 0
}

Note that some of the fields are plausible (lph_lp, lph_lbin,
lph_length), but others are just bogus (0xea being the most obvious).
Of particular note is the fact that the 'lph_param' code is
monotonically increasing with each entry, which makes no sense whatsoever.

Since this is a SATA disk, one possibility is that there is a bug in the
SATL translation being done by the kernel. Do you have any way to put
one of these disks in an X4540 or J4500/J4400, where the translation is
done in hardware?

- Eric
--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Paul B. Henson
2009-06-10 02:13:43 UTC
Permalink
Post by Eric Schrock
Since this is a SATA disk, one possibility is that there is a bug in the
SATL translation being done by the kernel. Do you have any way to put
one of these disks in an X4540 or J4500/J4400, where the translation is
done in hardware?
Unfortunately not. I asked the other guy with these SSD's in an x4500 if he
happened to have access to one of those, haven't heard back yet. I could
hook the drive up to a linux box and see what smartctl has to say about it.
If it's an underlying issue with the drive itself presumably other
operating systems/utilities would display weird results. I'll try that out
tomorrow.

Maybe this is why Sun isn't certifying their OEM'd X25-E in x4500's ;)...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Phillip Steinbachs
2009-06-10 16:32:53 UTC
Permalink
Eric,

I'm the other guy Paul has been mentioning. I added an Intel 32GB X25-E
to a J4400 this morning. It's attached to an X4240 via an LSI 1068E
based HBA. The system is also running NexentaStor (again, based on b104
ON bits). I added the SSD as a slog device and have been running iozone
to generate activity, and FMA hasn't flagged the drive yet.

# ./diskstat /dev/rdsk/c3t48d0
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
dexcpt = 0
logerr = 0
mrie = 0x6
test = 0
ewasc = 0
perf = 0
ebf = 0
interval-timer = 0x8000000
report-count = 0x0
changed = 0
(end informational-exceptions)

(end modepages)

logpages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
length = 0xc
general = 1
(end informational-exceptions)

self-test = (embedded nvlist)
nvlist version: 0
(end self-test)

(end logpages)

gltsd = 1
(end status)

predictive-failure = (embedded nvlist)
nvlist version: 0
additional-sense-code = 0x0
additional-sense-code-qualifier = 0x0
(end predictive-failure)

faults = (embedded nvlist)
nvlist version: 0
predictive-failure = 0
(end faults)



This system also has an Intel 32GB X25-E installed in one of the
X4240's SAS slots, connected to the internal Adaptec RAID controller. No
FMA complaints there either.

# ./diskstat /dev/rdsk/c2t5d0
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
(end modepages)

(end status)

faults = (embedded nvlist)
nvlist version: 0
(end faults)


I'm in a position to run more tests if needed, as none of this hardware is in production yet.

-phillip
Eric Schrock
2009-06-10 16:45:52 UTC
Permalink
Post by Phillip Steinbachs
Eric,
I'm the other guy Paul has been mentioning. I added an Intel 32GB X25-E
to a J4400 this morning. It's attached to an X4240 via an LSI 1068E
based HBA. The system is also running NexentaStor (again, based on b104
ON bits). I added the SSD as a slog device and have been running iozone
to generate activity, and FMA hasn't flagged the drive yet.
That's very useful information. Is this the same drive that produces
bad data under diskstat when in a X4500, or is it just another drive of
the same type? If it's the former, then it's a clear indication that
there is a bug in the kernel SATL framework. I can try to come up with
a utility that constructs a SATA pass-through command to get the
underyling SATA datat (this will only work in the J4400).

Thanks,

- Eric
Post by Phillip Steinbachs
# ./diskstat /dev/rdsk/c3t48d0
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
dexcpt = 0
logerr = 0
mrie = 0x6
test = 0
ewasc = 0
perf = 0
ebf = 0
interval-timer = 0x8000000
report-count = 0x0
changed = 0
(end informational-exceptions)
(end modepages)
logpages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
length = 0xc
general = 1
(end informational-exceptions)
self-test = (embedded nvlist)
nvlist version: 0
(end self-test)
(end logpages)
gltsd = 1
(end status)
predictive-failure = (embedded nvlist)
nvlist version: 0
additional-sense-code = 0x0
additional-sense-code-qualifier = 0x0
(end predictive-failure)
faults = (embedded nvlist)
nvlist version: 0
predictive-failure = 0
(end faults)
This system also has an Intel 32GB X25-E installed in one of the X4240's
SAS slots, connected to the internal Adaptec RAID controller. No FMA
complaints there either.
# ./diskstat /dev/rdsk/c2t5d0
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
(end modepages)
(end status)
faults = (embedded nvlist)
nvlist version: 0
(end faults)
I'm in a position to run more tests if needed, as none of this hardware
is in production yet.
-phillip
_______________________________________________
fm-discuss mailing list
--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Phillip Steinbachs
2009-06-10 19:52:38 UTC
Permalink
That's very useful information. Is this the same drive that produces bad
data under diskstat when in a X4500, or is it just another drive of the same
type? If it's the former, then it's a clear indication that there is a bug
in the kernel SATL framework. I can try to come up with a utility that
constructs a SATA pass-through command to get the underyling SATA datat (this
will only work in the J4400).
It was another drive of the same type. But I just swapped the "bad
data" disk from the X4500 into the J4400 and diskstat shows the same.
fmdump says the log is empty.

# ./diskstat /dev/rdsk/c3t49d0
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
dexcpt = 0
logerr = 0
mrie = 0x6
test = 0
ewasc = 0
perf = 0
ebf = 0
interval-timer = 0x8000000
report-count = 0x0
changed = 0
(end informational-exceptions)

(end modepages)

logpages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
length = 0xc
general = 1
(end informational-exceptions)

self-test = (embedded nvlist)
nvlist version: 0
(end self-test)

(end logpages)

gltsd = 1
(end status)

predictive-failure = (embedded nvlist)
nvlist version: 0
additional-sense-code = 0x0
additional-sense-code-qualifier = 0x0
(end predictive-failure)

faults = (embedded nvlist)
nvlist version: 0
predictive-failure = 0
(end faults)


I find it somewhat interesting that I had two of these in the X4500 as
part of a mirrored log but it only flagged one of them as bad. In case it
matters, I still have the other one in there and it shows:

...
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0xb
timestamp = 0xb400
segment = 0x0
address = 0xb400b400b4
(end self-test-failure)

faults = (embedded nvlist)
nvlist version: 0
predictive-failure = 0
self-test-failure = 0
(end faults)

Presumably FMA will get around to flagging this one bad as well. I'll be
happy to run something on the J4400 system.

-phillip
Paul B. Henson
2009-06-11 00:41:39 UTC
Permalink
Post by Eric Schrock
bad data under diskstat when in a X4500, or is it just another drive of
the same type? If it's the former, then it's a clear indication that
there is a bug in the kernel SATL framework.
Okay, I pulled my SSD out of the x4500 and hooked it up to a linux system
to see what it had to say about the smart info.

All of the detailed output is at the bottom for anyone interested.

The initial query to the SSD indicated that the self test log was empty
and:

Self-test execution status: (32) The self-test routine was
interrupted
by the host with a hard or soft
reset.

I then initiated a couple of self tests, after which the data was:

Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-

I reinstalled the disk back into the x4500 hoping that now that there were
valid self test records it might be happy, but unfortunately:

self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0x4
timestamp = 0x48a5
segment = 0x0
address = 0xa548a548a548
(end self-test-failure)

Now it looks like a bunch of repetitions of '0x48a5'; different, but still
wrong.

According to smartctl, the Hitachi 1TB drive that had been in the system
for almost a year has never run a self test. Obviously fma looks for self
test logs, but does it ever actually initiate a self test? It seems kind of
a waste to review the logs for tests that are never made.

In any case, this definitely looks like a Solaris specific problem, not a
general problem with the drive. Please let me know if there are any other
tests you want me to try with the drive. I have a sun service contract, if
it is a driver bug, would there be any benefit in opening a case about it?

Is there any way to disable the self-check test in fma? As it appears self
checks are not being initiated, there seems no benefit in checking the logs
for them on the stock sun drives, and that would keep my SSD from being
erroneously faulted until the problem is resolved.

Thanks much...




Initial smart info from SSD:

--------------------------------------------------------------------------
=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number: CVEM902600J6032HGN
Firmware Version: 045C8626
User Capacity: 32,000,000,000 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 06:37:16 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection:
Disabled.
Self-test execution status: ( 32) The self-test routine was
interrupted
by the host with a hard or soft
reset.
Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0000 100 000 000 Old_age Offline
In_the_past 0
4 Start_Stop_Count 0x0000 100 000 000 Old_age Offline
In_the_past 0
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age Always
- 0
9 Power_On_Hours 0x0002 100 100 000 Old_age Always
- 67
12 Power_Cycle_Count 0x0002 100 100 000 Old_age Always
- 140
192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age Always
- 21
232 Unknown_Attribute 0x0003 100 100 010 Pre-fail Always
- 0
233 Unknown_Attribute 0x0002 099 099 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0000 200 200 000 Old_age Offline
- 50147

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective Self-Test Log Data Structure Revision Number (0) should be
1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure
revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

--------------------------------------------------------------------------

smart info after manually running self tests on SSD:

--------------------------------------------------------------------------

=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number: CVEM902600J6032HGN
Firmware Version: 045C8626
User Capacity: 32,000,000,000 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 07:05:30 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection:
Disabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0000 100 000 000 Old_age Offline
In_the_past 0
4 Start_Stop_Count 0x0000 100 000 000 Old_age Offline
In_the_past 0
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age Always
- 0
9 Power_On_Hours 0x0002 100 100 000 Old_age Always
- 68
12 Power_Cycle_Count 0x0002 100 100 000 Old_age Always
- 140
192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age Always
- 21
232 Unknown_Attribute 0x0003 100 100 010 Pre-fail Always
- 0
233 Unknown_Attribute 0x0002 099 099 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0000 200 200 000 Old_age Offline
- 50147

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-

SMART Selective Self-Test Log Data Structure Revision Number (0) should be
1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure
revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

-------------------------------------------------

x4500 SSD disk status after manual self tests:

-------------------------------------------------
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
dexcpt = 0
logerr = 0
mrie = 0x6
test = 0
ewasc = 0
perf = 0
ebf = 0
interval-timer = 0x0
report-count = 0x0
changed = 0
(end informational-exceptions)

(end modepages)

logpages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
length = 0x8
general = 1
(end informational-exceptions)

self-test = (embedded nvlist)
nvlist version: 0
length = 0x190
(end self-test)

(end logpages)

(end status)

predictive-failure = (embedded nvlist)
nvlist version: 0
additional-sense-code = 0x0
additional-sense-code-qualifier = 0x0
(end predictive-failure)

self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0x4
timestamp = 0x48a5
segment = 0x0
address = 0xa548a548a548
(end self-test-failure)

faults = (embedded nvlist)
nvlist version: 0
predictive-failure = 0
self-test-failure = 1
(end faults)
-------------------------------------------------

smart info from SSD after running selective test and filling up test log:

--------------------------------------------------
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number: CVEM902600J6032HGN
Firmware Version: 045C8626
User Capacity: 32,000,000,000 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 09:38:56 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection:
Disabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0000 100 000 000 Old_age Offline
In_the_past 0
4 Start_Stop_Count 0x0000 100 000 000 Old_age Offline
In_the_past 0
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age Always
- 0
9 Power_On_Hours 0x0002 100 100 000 Old_age Always
- 68
12 Power_Cycle_Count 0x0002 100 100 000 Old_age Always
- 143
192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age Always
- 22
232 Unknown_Attribute 0x0003 100 100 010 Pre-fail Always
- 0
233 Unknown_Attribute 0x0002 099 099 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0000 200 200 000 Old_age Offline
- 50147

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Short offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-
# 3 Short offline Completed without error 00% 68
-
# 4 Short offline Completed without error 00% 68
-
# 5 Short offline Completed without error 00% 68
-
# 6 Short offline Completed without error 00% 68
-
# 7 Short offline Completed without error 00% 68
-
# 8 Short offline Completed without error 00% 68
-
# 9 Short offline Completed without error 00% 68
-
#10 Short offline Completed without error 00% 68
-
#11 Short offline Completed without error 00% 68
-
#12 Short offline Completed without error 00% 68
-
#13 Short offline Completed without error 00% 68
-
#14 Short offline Completed without error 00% 68
-
#15 Short offline Completed without error 00% 68
-
#16 Short offline Completed without error 00% 68
-
#17 Short offline Completed without error 00% 68
-
#18 Selective offline Completed without error 00% 68
-
#19 Selective offline Completed without error 00% 68
-
#20 Selective offline Completed without error 00% 68
-
#21 Conveyance offline Completed without error 00% 68
-

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 20 30 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
--------------------------------------------------

smart info from Sun 1TB HD:

--------------------------------------------------

=== START OF INFORMATION SECTION ===
Device Model: HITACHI HUA7210SASUN1.0T 0814G8BRPF
Serial Number: GTF002PAJ8BRPF
Firmware Version: GKAOA90A
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 09:39:56 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection:
Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
Total time to complete Offline
data collection: (15354) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always
- 0
2 Throughput_Performance 0x0005 130 130 054 Pre-fail Offline
- 150
3 Spin_Up_Time 0x0007 105 105 024 Pre-fail Always
- 665 (Average 663)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always
- 20
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always
- 0
8 Seek_Time_Performance 0x0005 132 132 020 Pre-fail Offline
- 33
9 Power_On_Hours 0x0012 100 100 000 Old_age Always
- 4161
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 189
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always
- 189
194 Temperature_Celsius 0x0002 157 157 000 Old_age Always
- 38 (Lifetime Min/Max 22/40)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline
- 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always
- 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
----------------------------------------------------
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Dale Ghent
2009-06-11 01:41:20 UTC
Permalink
FWIW, smartctl does compile and work on solaris 10, if that might be
of any help.
Post by Paul B. Henson
Post by Eric Schrock
bad data under diskstat when in a X4500, or is it just another drive of
the same type? If it's the former, then it's a clear indication that
there is a bug in the kernel SATL framework.
Okay, I pulled my SSD out of the x4500 and hooked it up to a linux system
to see what it had to say about the smart info.
All of the detailed output is at the bottom for anyone interested.
The initial query to the SSD indicated that the self test log was empty
Self-test execution status: (32) The self-test routine was
interrupted
by the host with a hard or soft
reset.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-
I reinstalled the disk back into the x4500 hoping that now that there were
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0x4
timestamp = 0x48a5
segment = 0x0
address = 0xa548a548a548
(end self-test-failure)
Now it looks like a bunch of repetitions of '0x48a5'; different, but still
wrong.
According to smartctl, the Hitachi 1TB drive that had been in the system
for almost a year has never run a self test. Obviously fma looks for self
test logs, but does it ever actually initiate a self test? It seems kind of
a waste to review the logs for tests that are never made.
In any case, this definitely looks like a Solaris specific problem, not a
general problem with the drive. Please let me know if there are any other
tests you want me to try with the drive. I have a sun service
contract, if
it is a driver bug, would there be any benefit in opening a case about it?
Is there any way to disable the self-check test in fma? As it
appears self
checks are not being initiated, there seems no benefit in checking the logs
for them on the stock sun drives, and that would keep my SSD from being
erroneously faulted until the problem is resolved.
Thanks much...
--------------------------------------------------------------------------
=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number: CVEM902600J6032HGN
Firmware Version: 045C8626
User Capacity: 32,000,000,000 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 06:37:16 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
Offline data collection status: (0x00) Offline data collection activity
was never started.
Disabled.
Self-test execution status: ( 32) The self-test routine was
interrupted
by the host with a hard or soft
reset.
Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline
immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan
supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging
supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
SMART Attributes Data Structure revision number: 5
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED
WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0000 100 000 000 Old_age
Offline
In_the_past 0
4 Start_Stop_Count 0x0000 100 000 000 Old_age
Offline
In_the_past 0
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age
Always
- 0
9 Power_On_Hours 0x0002 100 100 000 Old_age
Always
- 67
12 Power_Cycle_Count 0x0002 100 100 000 Old_age
Always
- 140
192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age
Always
- 21
232 Unknown_Attribute 0x0003 100 100 010 Pre-fail
Always
- 0
233 Unknown_Attribute 0x0002 099 099 000 Old_age
Always
- 0
225 Load_Cycle_Count 0x0000 200 200 000 Old_age
Offline
- 50147
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective Self-Test Log Data Structure Revision Number (0) should be
1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure
revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number: CVEM902600J6032HGN
Firmware Version: 045C8626
User Capacity: 32,000,000,000 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 07:05:30 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
Offline data collection status: (0x00) Offline data collection activity
was never started.
Disabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline
immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan
supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging
supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
SMART Attributes Data Structure revision number: 5
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED
WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0000 100 000 000 Old_age
Offline
In_the_past 0
4 Start_Stop_Count 0x0000 100 000 000 Old_age
Offline
In_the_past 0
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age
Always
- 0
9 Power_On_Hours 0x0002 100 100 000 Old_age
Always
- 68
12 Power_Cycle_Count 0x0002 100 100 000 Old_age
Always
- 140
192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age
Always
- 21
232 Unknown_Attribute 0x0003 100 100 010 Pre-fail
Always
- 0
233 Unknown_Attribute 0x0002 099 099 000 Old_age
Always
- 0
225 Load_Cycle_Count 0x0000 200 200 000 Old_age
Offline
- 50147
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-
SMART Selective Self-Test Log Data Structure Revision Number (0) should be
1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure
revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
-------------------------------------------------
-------------------------------------------------
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
dexcpt = 0
logerr = 0
mrie = 0x6
test = 0
ewasc = 0
perf = 0
ebf = 0
interval-timer = 0x0
report-count = 0x0
changed = 0
(end informational-exceptions)
(end modepages)
logpages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
length = 0x8
general = 1
(end informational-exceptions)
self-test = (embedded nvlist)
nvlist version: 0
length = 0x190
(end self-test)
(end logpages)
(end status)
predictive-failure = (embedded nvlist)
nvlist version: 0
additional-sense-code = 0x0
additional-sense-code-qualifier = 0x0
(end predictive-failure)
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0x4
timestamp = 0x48a5
segment = 0x0
address = 0xa548a548a548
(end self-test-failure)
faults = (embedded nvlist)
nvlist version: 0
predictive-failure = 0
self-test-failure = 1
(end faults)
-------------------------------------------------
--------------------------------------------------
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number: CVEM902600J6032HGN
Firmware Version: 045C8626
User Capacity: 32,000,000,000 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 09:38:56 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
Offline data collection status: (0x00) Offline data collection activity
was never started.
Disabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline
immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan
supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging
supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
SMART Attributes Data Structure revision number: 5
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED
WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0000 100 000 000 Old_age
Offline
In_the_past 0
4 Start_Stop_Count 0x0000 100 000 000 Old_age
Offline
In_the_past 0
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age
Always
- 0
9 Power_On_Hours 0x0002 100 100 000 Old_age
Always
- 68
12 Power_Cycle_Count 0x0002 100 100 000 Old_age
Always
- 143
192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age
Always
- 22
232 Unknown_Attribute 0x0003 100 100 010 Pre-fail
Always
- 0
233 Unknown_Attribute 0x0002 099 099 000 Old_age
Always
- 0
225 Load_Cycle_Count 0x0000 200 200 000 Old_age
Offline
- 50147
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours)
LBA_of_first_error
# 1 Short offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-
# 3 Short offline Completed without error 00% 68
-
# 4 Short offline Completed without error 00% 68
-
# 5 Short offline Completed without error 00% 68
-
# 6 Short offline Completed without error 00% 68
-
# 7 Short offline Completed without error 00% 68
-
# 8 Short offline Completed without error 00% 68
-
# 9 Short offline Completed without error 00% 68
-
#10 Short offline Completed without error 00% 68
-
#11 Short offline Completed without error 00% 68
-
#12 Short offline Completed without error 00% 68
-
#13 Short offline Completed without error 00% 68
-
#14 Short offline Completed without error 00% 68
-
#15 Short offline Completed without error 00% 68
-
#16 Short offline Completed without error 00% 68
-
#17 Short offline Completed without error 00% 68
-
#18 Selective offline Completed without error 00% 68
-
#19 Selective offline Completed without error 00% 68
-
#20 Selective offline Completed without error 00% 68
-
#21 Conveyance offline Completed without error 00% 68
-
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 20 30 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
--------------------------------------------------
--------------------------------------------------
=== START OF INFORMATION SECTION ===
Device Model: HITACHI HUA7210SASUN1.0T 0814G8BRPF
Serial Number: GTF002PAJ8BRPF
Firmware Version: GKAOA90A
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 09:39:56 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Offline data collection status: (0x80) Offline data collection activity
was never started.
Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
Total time to complete Offline
data collection: (15354) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline
immediate.
Auto Offline data collection on/off
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test
supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging
supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail
Always
- 0
2 Throughput_Performance 0x0005 130 130 054 Pre-fail
Offline
- 150
3 Spin_Up_Time 0x0007 105 105 024 Pre-fail
Always
- 665 (Average 663)
4 Start_Stop_Count 0x0012 100 100 000 Old_age
Always
- 20
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail
Always
- 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail
Always
- 0
8 Seek_Time_Performance 0x0005 132 132 020 Pre-fail
Offline
- 33
9 Power_On_Hours 0x0012 100 100 000 Old_age
Always
- 4161
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail
Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always
- 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age
Always
- 189
193 Load_Cycle_Count 0x0012 100 100 000 Old_age
Always
- 189
194 Temperature_Celsius 0x0002 157 157 000 Old_age
Always
- 38 (Lifetime Min/Max 22/40)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age
Always
- 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age
Always
- 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age
Offline
- 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age
Always
- 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
----------------------------------------------------
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/
~henson/
California State Polytechnic University | Pomona CA 91768
_______________________________________________
fm-discuss mailing list
Tarik Soydan
2009-06-11 02:42:31 UTC
Permalink
Post by Dale Ghent
FWIW, smartctl does compile and work on solaris 10, if that might be
of any help.
Post by Paul B. Henson
Post by Eric Schrock
bad data under diskstat when in a X4500, or is it just another drive of
the same type? If it's the former, then it's a clear indication that
there is a bug in the kernel SATL framework.
Okay, I pulled my SSD out of the x4500 and hooked it up to a linux system
to see what it had to say about the smart info.
All of the detailed output is at the bottom for anyone interested.
The initial query to the SSD indicated that the self test log was empty
Self-test execution status: (32) The self-test routine was
interrupted
by the host with a hard or soft
reset.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-
I reinstalled the disk back into the x4500 hoping that now that there were
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0x4
timestamp = 0x48a5
segment = 0x0
address = 0xa548a548a548
(end self-test-failure)
Now it looks like a bunch of repetitions of '0x48a5'; different, but still
wrong.
According to smartctl, the Hitachi 1TB drive that had been in the system
for almost a year has never run a self test. Obviously fma looks for self
test logs, but does it ever actually initiate a self test?
No.
Post by Dale Ghent
Post by Paul B. Henson
It seems kind of
a waste to review the logs for tests that are never made.
From portfolio 2006/012.sfx4500-disk:

...
Finally, the results of the disk self-tests are analyzed (no
self-tests are triggered, since self-tests degrade performance
considerably -- we just analyze the previously-completed self-test
logs). If any of the self-test results indicate a self-test
failure, the DE generates an ereport.io.sata.disk.self-test-failure
error. Since SATA self-test errors are fatal, they are associated
with fault.io.disk.self-test-failure faults, with a critical severity.
...
Post by Dale Ghent
Post by Paul B. Henson
In any case, this definitely looks like a Solaris specific problem, not a
general problem with the drive. Please let me know if there are any other
tests you want me to try with the drive. I have a sun service
contract, if
it is a driver bug, would there be any benefit in opening a case about it?
Is there any way to disable the self-check test in fma?
You can unload the disk transport module.

#fmadm unload disk-transport

Or more permanently, you can move it so it won't get loaded the next time
FMD starts.

# mv /usr/lib/fm/fmd/plugins/disk-transport.so <some-where-else>

Note that this will disable the predictive-failure and over-temp checks
as well.

-tarik
Post by Dale Ghent
Post by Paul B. Henson
As it appears self
checks are not being initiated, there seems no benefit in checking the logs
for them on the stock sun drives, and that would keep my SSD from being
erroneously faulted until the problem is resolved.
Thanks much...
Paul B. Henson
2009-06-11 03:45:09 UTC
Permalink
FWIW, smartctl does compile and work on solaris 10, if that might be of
any help.
Well, smartctl sees the same erroneous output that fma is getting:

Serial number: CVEM902600J6032HGN
Device type: disk
Local Time is: Wed Jun 10 20:37:48 2009 PDT
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature: <not available>

Error Counter logging not supported

SMART Self-test log
Num Test Status segment LifeTime
LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
# 2 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
# 3 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
# 4 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
# 5 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
# 6 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
# 7 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
# 8 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
# 9 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#10 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#11 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#12 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#13 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#14 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#15 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#16 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#17 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#18 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#19 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]
#20 Default Completed, segment failed - 18597
181731429229896 [0x4 0x40 0x84]


It seems happy with the sun 1TB drive though:

Serial number: GTF002PAJ893DF
Device type: disk
Local Time is: Wed Jun 10 20:42:35 2009 PDT
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature: 28 C

Error Counter logging not supported
No self-tests have been logged


Seems like there's a lot less info available under Solaris than the linux
version.
Paul B. Henson
2009-06-11 03:12:04 UTC
Permalink
Is this document available publicly anywhere? All I was able to track down
with Google was the presumably internal only link:

http://fma.eng/documents/engineering/portfolios/2006/012.sfx4500-disk
Post by Tarik Soydan
Finally, the results of the disk self-tests are analyzed (no
self-tests are triggered, since self-tests degrade performance
considerably -- we just analyze the previously-completed self-test
logs).
Is there some assumption that somebody would initiate self tests? I don't
recall seeing any documentation about scheduling self tests. Does Solaris
include any utility allowing interaction with the smart interface on the
disk?
Post by Tarik Soydan
You can unload the disk transport module.
#fmadm unload disk-transport
Or more permanently, you can move it so it won't get loaded the next time
the system boots or FMD restarts.
# mv /usr/lib/fm/fmd/plugins/disk-transport.so <some-where-else>
Thanks for the information...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Tarik Soydan
2009-06-11 18:35:48 UTC
Permalink
Post by Paul B. Henson
Is this document available publicly anywhere?
Not that I know of.
Post by Paul B. Henson
All I was able to track down
http://fma.eng/documents/engineering/portfolios/2006/012.sfx4500-disk
Post by Tarik Soydan
Finally, the results of the disk self-tests are analyzed (no
self-tests are triggered, since self-tests degrade performance
considerably -- we just analyze the previously-completed self-test
logs).
Is there some assumption that somebody would initiate self tests?
I think the assumption is simply that someone may have initiated self
tests in the past,
and the results are still valid.
Post by Paul B. Henson
I don't
recall seeing any documentation about scheduling self tests. Does Solaris
include any utility allowing interaction with the smart interface on the
disk?
I don't know disk utilities are available under Solaris.
Hopefully someone (more disk savvy them me) will answer this question.

-tarik
Post by Paul B. Henson
Post by Tarik Soydan
You can unload the disk transport module.
#fmadm unload disk-transport
Or more permanently, you can move it so it won't get loaded the next time
the system boots or FMD restarts.
# mv /usr/lib/fm/fmd/plugins/disk-transport.so <some-where-else>
Thanks for the information...
Paul B. Henson
2009-06-12 02:19:46 UTC
Permalink
Post by Paul B. Henson
Is there some assumption that somebody would initiate self tests?
I think the assumption is simply that someone may have initiated self
tests in the past, and the results are still valid.
[...]
Post by Paul B. Henson
I don't know disk utilities are available under Solaris.
Hopefully someone (more disk savvy them me) will answer this question.
I don't think it is very likely anyone has initiated self tests on
x4500/x4540 systems. I tried smartctl, but evidently initiating self tests
is not supported under Solaris.

The smartmontools package under linux allows you to schedule self tests.
While extended self tests can impact performance, short self tests don't
really. It would be nice for a similar utility under Solaris to allow
scheduling of different self test types, perhaps a short one daily and an
extended one weekly or monthly during a low load period. That would
definitely make the self test log analysis currently performed by fma a
lot more useful. Perhaps an RFE is in order :)...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Eric Schrock
2009-06-11 17:18:12 UTC
Permalink
Post by Paul B. Henson
Post by Eric Schrock
bad data under diskstat when in a X4500, or is it just another drive of
the same type? If it's the former, then it's a clear indication that
there is a bug in the kernel SATL framework.
Okay, I pulled my SSD out of the x4500 and hooked it up to a linux system
to see what it had to say about the smart info.
All of the detailed output is at the bottom for anyone interested.
The initial query to the SSD indicated that the self test log was empty
Self-test execution status: (32) The self-test routine was
interrupted
by the host with a hard or soft
reset.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-
I reinstalled the disk back into the x4500 hoping that now that there were
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0x4
timestamp = 0x48a5
segment = 0x0
address = 0xa548a548a548
(end self-test-failure)
Now it looks like a bunch of repetitions of '0x48a5'; different, but still
wrong.
According to smartctl, the Hitachi 1TB drive that had been in the system
for almost a year has never run a self test. Obviously fma looks for self
test logs, but does it ever actually initiate a self test? It seems kind of
a waste to review the logs for tests that are never made.
In any case, this definitely looks like a Solaris specific problem, not a
general problem with the drive. Please let me know if there are any other
tests you want me to try with the drive. I have a sun service contract, if
it is a driver bug, would there be any benefit in opening a case about it?
I will open a CR to track this issue. Based on the incredibly useful
data you have provided, my best guess is that this is a bug in the SATA
translation being done by the sata module in the kernel. I'm going to
consult the SATL spec and see if I can figure out what is going on. I
may give you another binary to run that will pull the raw SATA data via
a pass-through command so that I can then walk through whatever the
kernel is doing.

Thanks for all your help, I'll let yo know what I figure out.

- Eric
Post by Paul B. Henson
Is there any way to disable the self-check test in fma? As it appears self
checks are not being initiated, there seems no benefit in checking the logs
for them on the stock sun drives, and that would keep my SSD from being
erroneously faulted until the problem is resolved.
Thanks much...
--------------------------------------------------------------------------
=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number: CVEM902600J6032HGN
Firmware Version: 045C8626
User Capacity: 32,000,000,000 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 06:37:16 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
Offline data collection status: (0x00) Offline data collection activity
was never started.
Disabled.
Self-test execution status: ( 32) The self-test routine was
interrupted
by the host with a hard or soft
reset.
Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
SMART Attributes Data Structure revision number: 5
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0000 100 000 000 Old_age Offline
In_the_past 0
4 Start_Stop_Count 0x0000 100 000 000 Old_age Offline
In_the_past 0
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age Always
- 0
9 Power_On_Hours 0x0002 100 100 000 Old_age Always
- 67
12 Power_Cycle_Count 0x0002 100 100 000 Old_age Always
- 140
192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age Always
- 21
232 Unknown_Attribute 0x0003 100 100 010 Pre-fail Always
- 0
233 Unknown_Attribute 0x0002 099 099 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0000 200 200 000 Old_age Offline
- 50147
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective Self-Test Log Data Structure Revision Number (0) should be
1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure
revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number: CVEM902600J6032HGN
Firmware Version: 045C8626
User Capacity: 32,000,000,000 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 07:05:30 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
Offline data collection status: (0x00) Offline data collection activity
was never started.
Disabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
SMART Attributes Data Structure revision number: 5
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0000 100 000 000 Old_age Offline
In_the_past 0
4 Start_Stop_Count 0x0000 100 000 000 Old_age Offline
In_the_past 0
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age Always
- 0
9 Power_On_Hours 0x0002 100 100 000 Old_age Always
- 68
12 Power_Cycle_Count 0x0002 100 100 000 Old_age Always
- 140
192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age Always
- 21
232 Unknown_Attribute 0x0003 100 100 010 Pre-fail Always
- 0
233 Unknown_Attribute 0x0002 099 099 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0000 200 200 000 Old_age Offline
- 50147
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-
SMART Selective Self-Test Log Data Structure Revision Number (0) should be
1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure
revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
-------------------------------------------------
-------------------------------------------------
nvlist version: 0
protocol = scsi
status = (embedded nvlist)
nvlist version: 0
command-length = 6
modepages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
dexcpt = 0
logerr = 0
mrie = 0x6
test = 0
ewasc = 0
perf = 0
ebf = 0
interval-timer = 0x0
report-count = 0x0
changed = 0
(end informational-exceptions)
(end modepages)
logpages = (embedded nvlist)
nvlist version: 0
informational-exceptions = (embedded nvlist)
nvlist version: 0
length = 0x8
general = 1
(end informational-exceptions)
self-test = (embedded nvlist)
nvlist version: 0
length = 0x190
(end self-test)
(end logpages)
(end status)
predictive-failure = (embedded nvlist)
nvlist version: 0
additional-sense-code = 0x0
additional-sense-code-qualifier = 0x0
(end predictive-failure)
self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0x4
timestamp = 0x48a5
segment = 0x0
address = 0xa548a548a548
(end self-test-failure)
faults = (embedded nvlist)
nvlist version: 0
predictive-failure = 0
self-test-failure = 1
(end faults)
-------------------------------------------------
--------------------------------------------------
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number: CVEM902600J6032HGN
Firmware Version: 045C8626
User Capacity: 32,000,000,000 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 09:38:56 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
Offline data collection status: (0x00) Offline data collection activity
was never started.
Disabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
Total time to complete Offline
data collection: ( 1) seconds.
Offline data collection
capabilities: (0x75) SMART execute Offline immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
SMART Attributes Data Structure revision number: 5
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0000 100 000 000 Old_age Offline
In_the_past 0
4 Start_Stop_Count 0x0000 100 000 000 Old_age Offline
In_the_past 0
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age Always
- 0
9 Power_On_Hours 0x0002 100 100 000 Old_age Always
- 68
12 Power_Cycle_Count 0x0002 100 100 000 Old_age Always
- 143
192 Power-Off_Retract_Count 0x0002 100 100 000 Old_age Always
- 22
232 Unknown_Attribute 0x0003 100 100 010 Pre-fail Always
- 0
233 Unknown_Attribute 0x0002 099 099 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0000 200 200 000 Old_age Offline
- 50147
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Short offline Completed without error 00% 68
-
# 2 Short offline Completed without error 00% 68
-
# 3 Short offline Completed without error 00% 68
-
# 4 Short offline Completed without error 00% 68
-
# 5 Short offline Completed without error 00% 68
-
# 6 Short offline Completed without error 00% 68
-
# 7 Short offline Completed without error 00% 68
-
# 8 Short offline Completed without error 00% 68
-
# 9 Short offline Completed without error 00% 68
-
#10 Short offline Completed without error 00% 68
-
#11 Short offline Completed without error 00% 68
-
#12 Short offline Completed without error 00% 68
-
#13 Short offline Completed without error 00% 68
-
#14 Short offline Completed without error 00% 68
-
#15 Short offline Completed without error 00% 68
-
#16 Short offline Completed without error 00% 68
-
#17 Short offline Completed without error 00% 68
-
#18 Selective offline Completed without error 00% 68
-
#19 Selective offline Completed without error 00% 68
-
#20 Selective offline Completed without error 00% 68
-
#21 Conveyance offline Completed without error 00% 68
-
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 20 30 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
--------------------------------------------------
--------------------------------------------------
=== START OF INFORMATION SECTION ===
Device Model: HITACHI HUA7210SASUN1.0T 0814G8BRPF
Serial Number: GTF002PAJ8BRPF
Firmware Version: GKAOA90A
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Wed Jun 10 09:39:56 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Offline data collection status: (0x80) Offline data collection activity
was never started.
Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has
ever
been run.
Total time to complete Offline
data collection: (15354) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always
- 0
2 Throughput_Performance 0x0005 130 130 054 Pre-fail Offline
- 150
3 Spin_Up_Time 0x0007 105 105 024 Pre-fail Always
- 665 (Average 663)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always
- 20
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always
- 0
8 Seek_Time_Performance 0x0005 132 132 020 Pre-fail Offline
- 33
9 Power_On_Hours 0x0012 100 100 000 Old_age Always
- 4161
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 189
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always
- 189
194 Temperature_Celsius 0x0002 157 157 000 Old_age Always
- 38 (Lifetime Min/Max 22/40)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline
- 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always
- 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
----------------------------------------------------
--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Paul B. Henson
2009-06-12 03:10:53 UTC
Permalink
Post by Eric Schrock
I will open a CR to track this issue.
Cool, thanks. Please let me know what the number is, at some point once it
is resolved I hope to get it backported to S10.

Interestingly enough, so far fma hasn't complained about self test failures
on the SSD since I initiated the self tests under Linux and filled up the
log. I don't know why, while previously there was garbage being returned,
currently the wrong status being returned indicates there is actually a
failure. Dunno, but I'm happy that it's not being faulted and I didn't have
to kludge it by moving the shared library for the module out of the way.
Post by Eric Schrock
I may give you another binary to run that will pull the raw SATA data via
a pass-through command so that I can then walk through whatever the
kernel is doing.
Sure thing, let me know what else you need to get this worked out.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Eric Schrock
2009-06-18 17:36:36 UTC
Permalink
Post by Paul B. Henson
Post by Eric Schrock
I will open a CR to track this issue.
Cool, thanks. Please let me know what the number is, at some point once it
is resolved I hope to get it backported to S10.
Interestingly enough, so far fma hasn't complained about self test failures
on the SSD since I initiated the self tests under Linux and filled up the
log. I don't know why, while previously there was garbage being returned,
currently the wrong status being returned indicates there is actually a
failure. Dunno, but I'm happy that it's not being faulted and I didn't have
to kludge it by moving the shared library for the module out of the way.
Post by Eric Schrock
I may give you another binary to run that will pull the raw SATA data via
a pass-through command so that I can then walk through whatever the
kernel is doing.
Sure thing, let me know what else you need to get this worked out.
Hey Paul -

Sorry for not geting back to you. Can you try running the attached
script while reading the selftest log parameters? This will let me know
what the sata translation code is doing and what the raw SATA data is.

Thanks,

- Eric
--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Paul B. Henson
2009-06-18 20:28:16 UTC
Permalink
Can you try running the attached script while reading the selftest log
parameters? This will let me know what the sata translation code is
doing and what the raw SATA data is.
Seems now the random pattern of the day is '0086' 8-/.


-------------------------------------------------------------
Tracing sata self test queries...
sata_ext_smart_selftest_read_log succeeded

0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded

0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded

0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded

0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded

0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded

0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
-------------------------------------------------------------
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Eric Schrock
2009-06-18 20:55:10 UTC
Permalink
Post by Paul B. Henson
Can you try running the attached script while reading the selftest log
parameters? This will let me know what the sata translation code is
doing and what the raw SATA data is.
Seems now the random pattern of the day is '0086' 8-/.
OK, I think I have some understanding about what's going on here. The
primary thing is that this drive is completely busted - it's reporting
totally invalid data in response to the ATA READ EXT LOG command for log
0x07 (Extended SMART self-test log). The spec defines that byte 0
must be 0x1 and that byte 1 is reserved. You can see this from your
previous smartctl output from Linux:

---
SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure
revision number = 1
---

This is apparently causing us to trip up in strange ways. I don't know
how the hardware SATL translation is not getting tripped up. Some more
investigation is necessary, but it's clear the firmware on this drive is
quite broken.

- Eric
Post by Paul B. Henson
-------------------------------------------------------------
Tracing sata self test queries...
sata_ext_smart_selftest_read_log succeeded
0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded
0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded
0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded
0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded
0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
sata_ext_smart_selftest_read_log succeeded
0 1 2 3 4 5 6 7 8 9 a b c d e f
0123456789abcdef
0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
10: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
20: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
30: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
40: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
50: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
60: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
70: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
80: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
90: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
100: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
110: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
120: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
130: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
140: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
150: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
160: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
170: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
180: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
190: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1a0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1b0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1c0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1d0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1e0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
1f0: 00 86 00 86 00 86 00 86 00 86 00 86 00 86 00 86
................
-------------------------------------------------------------
--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Paul B. Henson
2009-06-19 01:10:23 UTC
Permalink
Post by Eric Schrock
totally invalid data in response to the ATA READ EXT LOG command for log
0x07 (Extended SMART self-test log). The spec defines that byte 0
must be 0x1 and that byte 1 is reserved.
Yes, I had noticed that.
Post by Eric Schrock
This is apparently causing us to trip up in strange ways. I don't know
how the hardware SATL translation is not getting tripped up. Some more
investigation is necessary, but it's clear the firmware on this drive is
quite broken.
You don't happen to have a good contact at Intel I could complain to :)? I
somehow think my chances if I cold call their support line with this issue
are pretty slim to none :(.

smartctl evidently works around this issue, in fact, on reviewing the
documentation, it looks like a *lot* of drives aren't exactly spec
compliant and there are numerous workarounds to try and do the right thing.
Is this something you think you would work around in Solaris code, or would
end resolution require Intel to fix their buggy firmware?

Fortunately, after initiating the self tests under Linux, the incorrect
data being returned no longer causes a fault. And since nothing is
initiating self tests under Solaris, you don't really lose anything from
invalid self test results.

Thanks again, and let me know if you need anything else.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Paul B. Henson
2009-07-28 22:53:47 UTC
Permalink
Just wondering if you've made any further progress on handling the buggy
Intel firmware. So far I haven't had any further fma issues with the SSD,
but on general principle it would be nice if everything worked the way it's
supposed to :). Until then, I'll be sure to seed the self test log before
putting a new SSD in...
Post by Paul B. Henson
Post by Eric Schrock
totally invalid data in response to the ATA READ EXT LOG command for log
0x07 (Extended SMART self-test log). The spec defines that byte 0
must be 0x1 and that byte 1 is reserved.
Yes, I had noticed that.
Post by Eric Schrock
This is apparently causing us to trip up in strange ways. I don't know
how the hardware SATL translation is not getting tripped up. Some more
investigation is necessary, but it's clear the firmware on this drive is
quite broken.
You don't happen to have a good contact at Intel I could complain to :)? I
somehow think my chances if I cold call their support line with this issue
are pretty slim to none :(.
smartctl evidently works around this issue, in fact, on reviewing the
documentation, it looks like a *lot* of drives aren't exactly spec
compliant and there are numerous workarounds to try and do the right thing.
Is this something you think you would work around in Solaris code, or would
end resolution require Intel to fix their buggy firmware?
Fortunately, after initiating the self tests under Linux, the incorrect
data being returned no longer causes a fault. And since nothing is
initiating self tests under Solaris, you don't really lose anything from
invalid self test results.
Thanks again, and let me know if you need anything else.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Paul B. Henson
2009-08-20 02:53:00 UTC
Permalink
Hmm, I rebooted this server for the first time since I was testing the SSD,
and it marked the SSD faulty again :( --

***@ike ~ # fmadm faulty
--------------- ------------------------------------ --------------
---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------ --------------
---------
Aug 19 19:46:15 091fd12e-0e26-49c4-87df-85e6b46d78fd DISK-8000-2J
Critical

Fault class : fault.io.disk.self-test-failure
Affects :
dev:///:devid=id1,***@SATA_____SSDSA2SH032G1GN___CVEM902600J6032HGN//***@2,0/pci1022,***@8/pci11ab,***@1/***@0,0
faulted but still in service
FRU : "HD_ID_4"
(hc://:product-id=Sun-Fire-X4500:chassis-id=0819AMT059:server-id=ike:serial=CVEM902600J6032HGN:part=SSDSA2SH032G1GN-INTEL:revision=045C8626/bay=4/disk=0)
faulty

I'm going to mark it as repaired and see if it gets marked faulty again.
I never heard back from you as to a possible resolution to this? Any
progress?

Thanks...
Post by Paul B. Henson
Just wondering if you've made any further progress on handling the buggy
Intel firmware. So far I haven't had any further fma issues with the SSD,
but on general principle it would be nice if everything worked the way
it's supposed to :). Until then, I'll be sure to seed the self test log
before putting a new SSD in...
Post by Paul B. Henson
Post by Eric Schrock
totally invalid data in response to the ATA READ EXT LOG command for log
0x07 (Extended SMART self-test log). The spec defines that byte 0
must be 0x1 and that byte 1 is reserved.
Yes, I had noticed that.
Post by Eric Schrock
This is apparently causing us to trip up in strange ways. I don't know
how the hardware SATL translation is not getting tripped up. Some more
investigation is necessary, but it's clear the firmware on this drive is
quite broken.
You don't happen to have a good contact at Intel I could complain to :)? I
somehow think my chances if I cold call their support line with this issue
are pretty slim to none :(.
smartctl evidently works around this issue, in fact, on reviewing the
documentation, it looks like a *lot* of drives aren't exactly spec
compliant and there are numerous workarounds to try and do the right thing.
Is this something you think you would work around in Solaris code, or would
end resolution require Intel to fix their buggy firmware?
Fortunately, after initiating the self tests under Linux, the incorrect
data being returned no longer causes a fault. And since nothing is
initiating self tests under Solaris, you don't really lose anything from
invalid self test results.
Thanks again, and let me know if you need anything else.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Loading...