[smartmontools-support] SAS preventive disk replacement

Discussion:

Gandalf Corvotempesta

2016-11-26 18:04:00 UTC

I'm still here asking the following question because month ago nobody
replied

In a SAS disks, which values should i look for when prevenyively replacing
a disk?

Should i look for "elements in grown defect list"?
Should i look for the uncorrected errors in the below table reporting
writes/reads/verifies?

Should i look for something else in the "-x" output?

Can someone explain this to me?
Docs on smartmon page is not detailed about sas

Gandalf Corvotempesta

2016-11-27 09:36:54 UTC

Permalink

Do SAS disks have the same or similar fields for 'smartmon'
as ATA/SATA disks? FWIW -- nobody may have responded because,
like me, no one really knew an authoritative answer. That said,
I'll just spout off "unauthoritatively"... (YMMV)...

No, totally different output:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.10.0+2] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor: SEAGATE
Product: ST300MP0005
Revision: VT31
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
Logical Unit id: 0x5000c500962d6707
Serial number: S7K0XLW9
Device type: disk
Transport protocol: SAS
Local Time is: Sun Nov 27 10:34:14 2016 CET
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature: 35 C
Drive Trip Temperature: 60 C
Manufactured in week 03 of year 2016
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 13
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 242
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 2780454742
Blocks received from initiator = 93770794
Blocks read from cache and sent to initiator = 1077886413
Number of read and write commands whose size <= segment size = 386420548
Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 5519,43
number of minutes until next internal SMART test = 20

Error counter log:
Errors Corrected by Total Correction
Gigabytes Total
ECC rereads/ errors algorithm
processed uncorrected
fast | delayed rewrites corrected invocations [10^9
bytes] errors
read: 2030615210 1 0 2030615211 1
266092,169 0
write: 0 0 0 0 0
11228,247 0
verify: 4273860022 0 0 4273860022 0
10059,776 0

Non-medium error count: 3

Good luck in finding your answers. Did you google? ;-)

Googled a lot, without success.

------------------------------------------------------------------------------

Håkon Alstadheim

2016-11-28 21:15:02 UTC

Permalink

Post by Gandalf Corvotempesta
Should i look for "elements in grown defect list"?

Post by Gandalf Corvotempesta
Should i look for the uncorrected errors in the below table reporting
writes/reads/verifies?
Should i look for something else in the "-x" output?

----
Dang... that's one thing about smartmon, is that for better
or worse, it makes the "call" based on its recorded data. If SAS
doesn't have similar, someone would have to know how the various
parameters collected affect failure rate.
I think I read a report by google that said the single biggest
correlating factor in failed disks was temperature -- though I don't
know if it was 'max temperature' or 'daily-max-averaged' or what...

If you run your drives within the temperature tolerance, then what
matters most is temperature /variability/ . I have some seagate SAS
drives that have a max temperature gradient (10 deg./ hour ?) specified.
You should be able to keep temperature changes way lower than that.
Other than that, total max-min span in temperature could also be meaningful.

In addition to google, reading the spec.s on your drives could give some
insight.

------------------------------------------------------------------------------

Gandalf Corvotempesta

2016-11-29 07:20:08 UTC

Permalink

Post by HÃ¥kon Alstadheim
If you run your drives within the temperature tolerance, then what
matters most is temperature /variability/ . I have some seagate SAS
drives that have a max temperature gradient (10 deg./ hour ?) specified.
You should be able to keep temperature changes way lower than that.
Other than that, total max-min span in temperature could also be meaningful.
In addition to google, reading the spec.s on your drives could give some
insight.

Temperature is ok.
so, the grown list or the uncorrected errors reported by smart aren't
useful for proactive replacement?

I also see some strangeness coming from the extented output (-x) that i
don't know how to interpret

If someone can give me some advice....