Michael Woon
2016-11-23 17:58:00 UTC
Hi Smartmontools devs,
I'm writing for a bit of clarity about self-tests and what records they
produce, and the smart health call.
As I understand from the documentation, calling "smartctl -H":
-returns the result of the SMART RETURN STATUS command -*or- *checks if any
ATA attributes exceed thresholds in ATA drives
-checks for any error codes in the SCSI sense buffer.
As I understand from the documentation, fore- and back-ground checks update
the self-test error log and certain ATA attributes, when they run.
(does a self-test also update the SCSI sense buffer or any kind of stored
values for SCSI devices?)
My main questions are:
-what does SMART RETURN STATUS evaluate?
alternatively stated:
-does the command -only- look at ATA attributes stored in the table and
error codes in the SCSI sense buffer? or is the content of the self-test
error log also a factor?
bottom line:
-If I want to be sure of the health of a disk, can I trust the smart health
status (to include the result of the self-tests) or do I have to look at
-both- the health status and the self-test error log?
or do I have the wrong angle on this:
-simply watch for a '0' exit code for an "all okay"?
minor questions about the exit codes:
-is it possible to have a set bit 3 (device failing) without a set bit 4
(attributes over threshold), and vice versa?
-at what point does a SCSI drive set the 6th bit in the error code? I have
drives (SAS) that have some errors in their smartctl output, but don't set
this bit when smartctl is run on them.
-does bit 7 really only work for SATA drives? (SCSI drives have a self test
log too)
I'm monitoring order ~100 devices, and they're a range of things, from a
pair of 200G SATA SSDs behind a RAID controller to a shelf full of 4TB SAS
platters.
I've been looking at a few different nagios plugins and notice that about
half of them check only attributes and health status, the other half also
look at the error log, and only one watches the exit code.
CentOS 7, 3.10 kernel, smartmontools 6.2.4
Thanks!
Michael
I'm writing for a bit of clarity about self-tests and what records they
produce, and the smart health call.
As I understand from the documentation, calling "smartctl -H":
-returns the result of the SMART RETURN STATUS command -*or- *checks if any
ATA attributes exceed thresholds in ATA drives
-checks for any error codes in the SCSI sense buffer.
As I understand from the documentation, fore- and back-ground checks update
the self-test error log and certain ATA attributes, when they run.
(does a self-test also update the SCSI sense buffer or any kind of stored
values for SCSI devices?)
My main questions are:
-what does SMART RETURN STATUS evaluate?
alternatively stated:
-does the command -only- look at ATA attributes stored in the table and
error codes in the SCSI sense buffer? or is the content of the self-test
error log also a factor?
bottom line:
-If I want to be sure of the health of a disk, can I trust the smart health
status (to include the result of the self-tests) or do I have to look at
-both- the health status and the self-test error log?
or do I have the wrong angle on this:
-simply watch for a '0' exit code for an "all okay"?
minor questions about the exit codes:
-is it possible to have a set bit 3 (device failing) without a set bit 4
(attributes over threshold), and vice versa?
-at what point does a SCSI drive set the 6th bit in the error code? I have
drives (SAS) that have some errors in their smartctl output, but don't set
this bit when smartctl is run on them.
-does bit 7 really only work for SATA drives? (SCSI drives have a self test
log too)
I'm monitoring order ~100 devices, and they're a range of things, from a
pair of 200G SATA SSDs behind a RAID controller to a shelf full of 4TB SAS
platters.
I've been looking at a few different nagios plugins and notice that about
half of them check only attributes and health status, the other half also
look at the error log, and only one watches the exit code.
CentOS 7, 3.10 kernel, smartmontools 6.2.4
Thanks!
Michael