[smartmontools-support] General questions: self-tests / ATA attributes / SCSI sense / smart return status

Discussion:

Michael Woon

2016-11-23 17:58:00 UTC

Hi Smartmontools devs,

I'm writing for a bit of clarity about self-tests and what records they
produce, and the smart health call.

As I understand from the documentation, calling "smartctl -H":

-returns the result of the SMART RETURN STATUS command -*or- *checks if any
ATA attributes exceed thresholds in ATA drives
-checks for any error codes in the SCSI sense buffer.

As I understand from the documentation, fore- and back-ground checks update
the self-test error log and certain ATA attributes, when they run.

(does a self-test also update the SCSI sense buffer or any kind of stored
values for SCSI devices?)

My main questions are:

-what does SMART RETURN STATUS evaluate?

alternatively stated:
-does the command -only- look at ATA attributes stored in the table and
error codes in the SCSI sense buffer? or is the content of the self-test
error log also a factor?

bottom line:
-If I want to be sure of the health of a disk, can I trust the smart health
status (to include the result of the self-tests) or do I have to look at
-both- the health status and the self-test error log?

or do I have the wrong angle on this:
-simply watch for a '0' exit code for an "all okay"?

minor questions about the exit codes:
-is it possible to have a set bit 3 (device failing) without a set bit 4
(attributes over threshold), and vice versa?
-at what point does a SCSI drive set the 6th bit in the error code? I have
drives (SAS) that have some errors in their smartctl output, but don't set
this bit when smartctl is run on them.
-does bit 7 really only work for SATA drives? (SCSI drives have a self test
log too)

I'm monitoring order ~100 devices, and they're a range of things, from a
pair of 200G SATA SSDs behind a RAID controller to a shelf full of 4TB SAS
platters.
I've been looking at a few different nagios plugins and notice that about
half of them check only attributes and health status, the other half also
look at the error log, and only one watches the exit code.

CentOS 7, 3.10 kernel, smartmontools 6.2.4

Thanks!
Michael

Christian Franke

2016-11-26 13:29:07 UTC

Permalink

Post by Michael Woon
Hi Smartmontools devs,
I'm writing for a bit of clarity about self-tests and what records
they produce, and the smart health call.
-returns the result of the SMART RETURN STATUS command -/or- /checks
if any ATA attributes exceed thresholds in ATA drives

Yes.

Post by Michael Woon
-checks for any error codes in the SCSI sense buffer.

It checks ASC/ASCQ in SCSI IE log page (if supported) or in result of
REQUEST SENSE command.

Most remaining answers are for ATA/SATA only. SCSI/SAS differs
considerably. Some SCSI expert on this list might want to answer.

Post by Michael Woon
As I understand from the documentation, fore- and back-ground checks
update the self-test error log and certain ATA attributes, when they run.

There is no ATA "self-test error log". On completion of a self-test, a
new entry is usually added to the ATA self-test log(s). The ATA error
log(s) are typically not updated on read errors found during a self-test.

Post by Michael Woon
-what does SMART RETURN STATUS evaluate?

Anything the author of the drive firmware decided to evaluate :-)

Recent versions of ATA ACS standards say:
"The SMART RETURN STATUS command causes the device to communicate the
reliability status of the device to the host."
If command returns failure(0x2c,0xf4): "The device has detected a
threshold exceeded condition."

Note that ATA SMART Attributes are not part of the standard. The SMART
READ THRESHOLDS command was declared obsolete in ATA-4 (1998).

Post by Michael Woon
-does the command -only- look at ATA attributes stored in the table
and error codes in the SCSI sense buffer? or is the content of the
self-test error log also a factor?

SMART RETURN STATUS does not return failure if any Read/Write error
occured. It usually will return failure if the number of spare blocks
for reallocation is below some threshold.

Post by Michael Woon
-If I want to be sure of the health of a disk, can I trust the smart
health status (to include the result of the self-tests) or do I have
to look at -both- the health status and the self-test error log?

If you want to proactively replace drives, I would recommend to watch
the number of reallocated sectors (e.g. use smartd with '-R 9! -r 9!'
directive). A failing SMART STATUS may occur (too?) late.

Post by Michael Woon
-simply watch for a '0' exit code for an "all okay"?

It depends, see above.

Post by Michael Woon
-is it possible to have a set bit 3 (device failing) without a set bit
4 (attributes over threshold), and vice versa?

Yes: if SMART RETURN STATUS returned failure but there is no attribute
<= threshold in the SMART DATA block, the SMART READ THRESHOLD command
did not work, etc...

Post by Michael Woon
-at what point does a SCSI drive set the 6th bit in the error code? I
have drives (SAS) that have some errors in their smartctl output, but
don't set this bit when smartctl is run on them.

For some historic reason, bit 6 was never implemented for SCSI.

Post by Michael Woon
-does bit 7 really only work for SATA drives? (SCSI drives have a self
test log too)

Yes, it works "better" for ATA because newer long tests without error
clear the bit.

Thanks,
Christian

------------------------------------------------------------------------------

Michael Woon

2017-01-23 14:36:18 UTC

Permalink

I realised I didn't reply to the list with this question.

Just have a tiny clarification to ask for for this point:

- If SMART RETURN STATUS returns "OK" and attributes are >= threshold, does
smartctl still report healthas "OK / PASSED"? (and in this case, returned
with bit 3 not set and bit 4 set?)

I've put the rest of the questions in another, more accurately named thread.

Thanks!
Michael

Hi Christian,
Thanks for the quick and excellent answer, that really cleared up a lot of
things for me.
- If SMART RETURN STATUS returns "OK" and attributes are >= threshold,
does smartctl still report healthas "OK / PASSED"? (and in this case,
returned with bit 3 not set and bit 4 set?)
- How do I proactively watch drive health for replacement? With SATA, I
watch reallocated sectors, amongst other things, and there's generally a
lot of documentation and discussion about this, but with SCSI, I ______?
(really couldn't find anything at all)
- I've been asking all these questions with the assumption that smartctl
is the tool for this job. Could I wrong on this?
Thanks again!
Michael

Post by Michael Woon

Post by Michael Woon
Hi Smartmontools devs,
I'm writing for a bit of clarity about self-tests and what records they
produce, and the smart health call.
-returns the result of the SMART RETURN STATUS command -/or- /checks if
any ATA attributes exceed thresholds in ATA drives

Yes.
-checks for any error codes in the SCSI sense buffer.
It checks ASC/ASCQ in SCSI IE log page (if supported) or in result of
REQUEST SENSE command.
Most remaining answers are for ATA/SATA only. SCSI/SAS differs
considerably. Some SCSI expert on this list might want to answer.
As I understand from the documentation, fore- and back-ground checks

Post by Michael Woon
update the self-test error log and certain ATA attributes, when they run.

Post by Michael Woon
-what does SMART RETURN STATUS evaluate?

Anything the author of the drive firmware decided to evaluate :-)
"The SMART RETURN STATUS command causes the device to communicate the
reliability status of the device to the host."
If command returns failure(0x2c,0xf4): "The device has detected a
threshold exceeded condition."
Note that ATA SMART Attributes are not part of the standard. The SMART
READ THRESHOLDS command was declared obsolete in ATA-4 (1998).

Post by Michael Woon
-does the command -only- look at ATA attributes stored in the table and
error codes in the SCSI sense buffer? or is the content of the self-test
error log also a factor?

SMART RETURN STATUS does not return failure if any Read/Write error
occured. It usually will return failure if the number of spare blocks for
reallocation is below some threshold.

Post by Michael Woon
-If I want to be sure of the health of a disk, can I trust the smart
health status (to include the result of the self-tests) or do I have to
look at -both- the health status and the self-test error log?

If you want to proactively replace drives, I would recommend to watch the
number of reallocated sectors (e.g. use smartd with '-R 9! -r 9!'
directive). A failing SMART STATUS may occur (too?) late.

Post by Michael Woon
-simply watch for a '0' exit code for an "all okay"?

It depends, see above.

Post by Michael Woon
-is it possible to have a set bit 3 (device failing) without a set bit 4
(attributes over threshold), and vice versa?

Yes: if SMART RETURN STATUS returned failure but there is no attribute <=
threshold in the SMART DATA block, the SMART READ THRESHOLD command did not
work, etc...
-at what point does a SCSI drive set the 6th bit in the error code? I

Post by Michael Woon
have drives (SAS) that have some errors in their smartctl output, but don't
set this bit when smartctl is run on them.

For some historic reason, bit 6 was never implemented for SCSI.
-does bit 7 really only work for SATA drives? (SCSI drives have a self

Post by Michael Woon
test log too)

Yes, it works "better" for ATA because newer long tests without error
clear the bit.
Thanks,
Christian