[smartmontools-support] Which attributes can estimate number of bad blocks?

Discussion:

Norman Diamond

2016-07-25 07:28:12 UTC

To get a quick estimate of the number of bad blocks that have already been detected, I add the raw values of attributes 5 (Reallocated_Sector_Ct) and 197 (Current_Pending_Sector).

Should I also add the raw values of attributes 187 (Reported_Uncorrect) and 198 (Offline_Uncorrectable)? Or are these redundant, already included in attributes 5 and 197?

Are there others that I should add?

Of course this estimate isn't as reliable as ... well actually, other methods of testing aren't particularly reliable either[*]. Anyway the reason for getting a quick estimate is that if we know quickly that a drive belongs on the scrap heap, it saves time.

[* I have a drive with particularly bad firmware. Reading the entire drive causes Current_Pending_Sector to increase. Writing the bad sectors doesn't cause Reallocated_Sector_Ct to increase but does cause Current_Pending_Sector to go down to 0. I guess the drive thinks that a write operation fixed the sector so the sector doesn't need reallocation, but the next attempt to read shows that the sector didn't get fixed. If the drive were in warranty, I might not have it any more :-) ]

------------------------------------------------------------------------------

r***@spotswood-computer.net

2016-07-28 14:34:28 UTC

Permalink

Current Pending Sector can indicate one of 3 things [on platter drives]:

* A bad sector that hasn't been reallocated yet, i.e. written to (a hard
error).
* A soft error. One or 2 sectors might indicate this problem. Every drive
I've seen so far with 8+ turned out to be a bad drive.
* The most insidious of all, a medium error. The sector can't hold it's
charge, for long. You can write to it then immediately read from it OK,
but usually in less than an hour, it goes corrupt again. You'll see that
in the same sector(s) keeps failing. This is a drive you should trash.

It sounds like you have a drive with medium error(s).

As for trashing a drive, it varies depending on who you ask, but at 5+
(raw value) reallocated, I trash. At 8+ Pending, I trash.

In regards to 187 and 197, I will retire a drive with any raw value more
than 0. But this is based more on Backblaze's published results than my
own observations. Usually I don't see those, or the drive is already
failing based on Reallocated and Pending.

Post by Norman Diamond
To get a quick estimate of the number of bad blocks that have already been
detected, I add the raw values of attributes 5 (Reallocated_Sector_Ct) and
197 (Current_Pending_Sector).
Should I also add the raw values of attributes 187 (Reported_Uncorrect)
and 198 (Offline_Uncorrectable)? Or are these redundant, already included
in attributes 5 and 197?
Are there others that I should add?
Of course this estimate isn't as reliable as ... well actually, other
methods of testing aren't particularly reliable either[*]. Anyway the
reason for getting a quick estimate is that if we know quickly that a
drive belongs on the scrap heap, it saves time.
[* I have a drive with particularly bad firmware. Reading the entire
drive causes Current_Pending_Sector to increase. Writing the bad sectors
doesn't cause Reallocated_Sector_Ct to increase but does cause
Current_Pending_Sector to go down to 0. I guess the drive thinks that a
write operation fixed the sector so the sector doesn't need reallocation,
but the next attempt to read shows that the sector didn't get fixed. If
the drive were in warranty, I might not have it any more :-) ]
------------------------------------------------------------------------------
_______________________________________________
Smartmontools-support mailing list
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

------------------------------------------------------------------------------

Gandalf Corvotempesta

2016-07-28 14:47:09 UTC

Permalink

CUT

Post by r***@spotswood-computer.net
As for trashing a drive, it varies depending on who you ask, but at 5+
(raw value) reallocated, I trash. At 8+ Pending, I trash.
In regards to 187 and 197, I will retire a drive with any raw value more
than 0. But this is based more on Backblaze's published results than my
own observations. Usually I don't see those, or the drive is already
failing based on Reallocated and Pending.

r***@spotswood-computer.net

2016-08-01 19:48:41 UTC

Permalink

I don't have enough experience with SAS drives to answer your question.

All I can tell you is while 1 or 2 bad sectors (hard errors, not medium)
don't mean it's a bad drive, every time a sector gets reallocated, the
[platter] drive gets more fragmented. You can't defrag this. That's one of
the reasons why when I hear complaints about a slow system or hard drive,
the first thing I check is the SMART attributes. I'd say in at least 50%
of the cases, it is indeed the hard drive going bad.

Post by Gandalf Corvotempesta
CUT

side question: what about "Elements in grown defect list" with SAS drives?
Will you trash when this values is different from 0?
It should indicate the number of bad sectors, like "reallocated sector
count" (smart #5) for SATA drives
------------------------------------------------------------------------------
_______________________________________________
Smartmontools-support mailing list
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

------------------------------------------------------------------------------

Tim Small

2016-08-01 20:33:37 UTC

Permalink

Post by r***@spotswood-computer.net
All I can tell you is while 1 or 2 bad sectors (hard errors, not medium)
don't mean it's a bad drive, every time a sector gets reallocated, the
[platter] drive gets more fragmented. You can't defrag this. That's one of
the reasons why when I hear complaints about a slow system or hard drive,
the first thing I check is the SMART attributes. I'd say in at least 50%
of the cases, it is indeed the hard drive going bad.

Given that the reallocated sectors are such a minuscule percentage of
the total number of sectors on a modern drive, I think that even with

Post by r***@spotswood-computer.net
1000 sectors reallocated, the impact of fragmentation is likely to be

very slight.

On the other hand, a high number of reallocated sectors frequently goes
hand-in-hand with numerous read-retries (these are normally successful,
but time consuming if they are the 'offline' type), and this is more
likely to significantly hit performance.

In other words, the poor performance and the reallocated sectors may
both be symptoms of widespread (and potentially deteriorating) problems
on the drive, but I don't think the former is likely to be a direct
result of the latter...

Personally, I run monthly long self tests with smartd (these can both
catch bad sectors that you otherwise wouldn't have known about, and also
effectively "scrub" weak sectors, so are definitely worth doing), and
pro-actively replace drives which have rapidly or just steadily
increasing reallocated sector counts.

Although it's relatively rare, I've had drives which have been happily
operating for years with a reallocated sector count of a few tens,
without any further problems. I've put these down to a one-off event or
very localised manufacturing defect within the drive.

HTH,

Tim.

--
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309

------------------------------------------------------------------------------

Gandalf Corvotempesta

2016-08-02 06:59:47 UTC

Permalink

Post by Tim Small
Although it's relatively rare, I've had drives which have been happily
operating for years with a reallocated sector count of a few tens,
without any further problems. I've put these down to a one-off event or
very localised manufacturing defect within the drive.

Currently i have a sas drive with grown list set to 16 from some months ago.
everything else seems to be stable. Would you replace this drive with 16,
stable, bad sectors?

with sata disk i have some drives with 3 or 4 logged errors 4 years ago and
nothing more after that. Would you replace?

I also have a sata disk with raw read error rate set to 84, smart long test
running weekly doesn't catch any issue and is able to run properly every
time.
No reallocation or bad sectors, only this 84 raw read error rate.
Raid controller osd able to run weekly consistency check and patrol reads
(it will read the whole raid looking for bad sectors) with no issue at all.
Would you replace this disk?

Bruce Allen

2016-08-02 11:29:44 UTC

Permalink

Hi Gandalf,

Post by Gandalf Corvotempesta

Currently i have a sas drive with grown list set to 16 from some months ago.
everything else seems to be stable. Would you replace this drive with 16, stable, bad sectors?

I would not replace this, if the number of bad sectors is not growing. 16 is not a large number.

Post by Gandalf Corvotempesta
with sata disk i have some drives with 3 or 4 logged errors 4 years ago and nothing more after that. Would you replace?

Probably not. What types of errors.

Post by Gandalf Corvotempesta
I also have a sata disk with raw read error rate set to 84, smart long test running weekly doesn't catch any issue and is able to run properly every time. No reallocation or bad sectors, only this 84 raw read error rate.

I would not worry about this. It could have to do with the normalization of this raw value, which might render the actual value meaningless. Do you have other drives which are identical hardware and firmware to compare with

Post by Gandalf Corvotempesta
Raid controller osd able to run weekly consistency check and patrol reads (it will read the whole raid looking for bad sectors) with no issue at all.
Would you replace this disk?

No, I would not.

Cheers,
Bruce

Gandalf Corvotempesta

2016-08-02 12:50:25 UTC

Permalink

Post by Bruce Allen
I would not replace this, if the number of bad sectors is not growing. 16 is not a large number.

ok.

Post by Bruce Allen
Probably not. What types of errors.

I'm unable to read the smartctl output looking for errors.
This is the whole "-x" output:

# smartctl -x /dev/sg3
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: WDC WD6000HLHX-01JJPV0
Serial Number: WD-WXG1E2155577
Firmware Version: 04.05G04
User Capacity: 600,127,266,816 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Aug 2 14:46:38 2016 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
Error SMART Status command failed
Please get assistance from http://smartmontools.sourceforge.net/
Register values returned from SMART Status command are:
ERR=0x00, SC=0x00, LL=0x00, LM=0x00, LH=0x00, DEV=0x00, STS=0x00
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (8280) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 88) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 53
3 Spin_Up_Time 0x0027 233 233 021 Pre-fail
Always - 3325
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 25
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 042 042 000 Old_age
Always - 42590
10 Spin_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 23
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
Always - 22
193 Load_Cycle_Count 0x0032 200 200 000 Old_age
Always - 2
194 Temperature_Celsius 0x0022 115 108 000 Old_age
Always - 35
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0

General Purpose Logging (GPL) feature set supported
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
GP/S Log at address 0x00 has 1 sectors [Log Directory]
SMART Log at address 0x01 has 1 sectors [Summary SMART error log]
SMART Log at address 0x02 has 5 sectors [Comprehensive SMART error log]
GP Log at address 0x03 has 6 sectors [Ext. Comprehensive SMART error log]
SMART Log at address 0x06 has 1 sectors [SMART self-test log]
GP Log at address 0x07 has 1 sectors [Extended self-test log]
SMART Log at address 0x09 has 1 sectors [Selective self-test log]
GP Log at address 0x10 has 1 sectors [NCQ Command Error]
GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters]
GP/S Log at address 0x80 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x81 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x82 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x83 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x84 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x85 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x86 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x87 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x88 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x89 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8f has 16 sectors [Host vendor specific log]
GP/S Log at address 0x90 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x91 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x92 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x93 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x94 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x95 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x96 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x97 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x98 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x99 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9f has 16 sectors [Host vendor specific log]
GP/S Log at address 0xa0 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa1 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa2 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa3 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa4 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa5 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa6 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa7 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa8 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xa9 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xaa has 1 sectors [Device vendor specific log]
GP/S Log at address 0xab has 1 sectors [Device vendor specific log]
GP/S Log at address 0xac has 1 sectors [Device vendor specific log]
GP/S Log at address 0xad has 1 sectors [Device vendor specific log]
GP/S Log at address 0xae has 1 sectors [Device vendor specific log]
GP/S Log at address 0xaf has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb0 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb1 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb2 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb3 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb4 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb5 has 1 sectors [Device vendor specific log]
GP Log at address 0xb6 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb7 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xbd has 1 sectors [Device vendor specific log]
GP/S Log at address 0xc0 has 1 sectors [Device vendor specific log]
GP Log at address 0xc1 has 24 sectors [Device vendor specific log]
GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status]
GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer]

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 3
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] occurred at disk power-on lifetime: 26697 hours (1112 days
+ 9 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
01 -- 51 00 18 00 00 00 f0 8c d0 40 00

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 18 00 00 00 00 00 f0 8c d0 40 00 39d+04:52:53.462 READ FPDMA QUEUED
e5 00 00 00 00 00 00 00 00 00 00 00 00 39d+04:52:53.462 CHECK POWER MODE
2f 00 00 00 01 00 00 00 00 00 10 40 00 39d+04:52:53.461 READ LOG EXT
60 00 18 00 00 00 00 00 f0 8c d0 40 00 39d+04:52:52.135 READ FPDMA QUEUED
61 00 01 00 00 00 00 45 dc 6f af 40 00 39d+04:52:52.060 WRITE FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 26697 hours (1112 days
+ 9 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 18 00 00 00 f0 8c d0 40 00

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 18 00 00 00 00 00 f0 8c d0 40 00 39d+04:52:52.135 READ FPDMA QUEUED
61 00 01 00 00 00 00 45 dc 6f af 40 00 39d+04:52:52.060 WRITE FPDMA QUEUED
61 00 08 00 00 00 00 45 dc 6f b2 40 00 39d+04:52:52.055 WRITE FPDMA QUEUED
60 00 06 00 00 00 00 45 dc 6f b4 40 00 39d+04:52:52.054 READ FPDMA QUEUED
61 00 08 00 00 00 00 45 dc 6f b2 40 00 39d+04:52:52.050 WRITE FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 26697 hours (1112 days
+ 9 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 18 00 00 00 f0 8c d0 40 00

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 18 00 00 00 00 00 f0 8c d0 40 00 39d+04:52:50.694 READ FPDMA QUEUED
60 00 40 00 00 00 00 01 21 fb 40 40 00 39d+04:52:50.692 READ FPDMA QUEUED
60 00 08 00 00 00 00 01 21 fb 30 40 00 39d+04:52:50.686 READ FPDMA QUEUED
60 00 08 00 00 00 00 00 4b 2a 28 40 00 39d+04:52:50.682 READ FPDMA QUEUED
60 00 08 00 00 00 00 00 3b 55 90 40 00 39d+04:52:50.679 READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 42577 -
# 2 Short offline Completed without error 00% 42553 -
# 3 Short offline Completed without error 00% 42529 -
# 4 Extended offline Completed without error 00% 42508 -
# 5 Short offline Completed without error 00% 42505 -
# 6 Extended offline Completed without error 00% 42493 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 35 Celsius
Power Cycle Min/Max Temperature: 30/42 Celsius
Lifetime Min/Max Temperature: 30/42 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (427)

Index Estimated Time Temperature Celsius
428 2016-08-02 06:49 35 ****************
... ..(112 skipped). .. ****************
63 2016-08-02 08:42 35 ****************
64 2016-08-02 08:43 34 ***************
65 2016-08-02 08:44 35 ****************
... ..(203 skipped). .. ****************
269 2016-08-02 12:08 35 ****************
270 2016-08-02 12:09 36 *****************
271 2016-08-02 12:10 35 ****************
... ..( 66 skipped). .. ****************
338 2016-08-02 13:17 35 ****************
339 2016-08-02 13:18 36 *****************
340 2016-08-02 13:19 36 *****************
341 2016-08-02 13:20 35 ****************
... ..( 17 skipped). .. ****************
359 2016-08-02 13:38 35 ****************
360 2016-08-02 13:39 36 *****************
361 2016-08-02 13:40 35 ****************
... ..( 65 skipped). .. ****************
427 2016-08-02 14:46 35 ****************

SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)

SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x000a 2 5 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x8000 4 46869469 Vendor specific

Post by Bruce Allen
I would not worry about this. It could have to do with the normalization of this raw value, which might render the actual value meaningless. Do you have other drives which are identical hardware and firmware to compare with

I have 6 identical disks in this server. Only one has some raw read
error and some errors logged many years ago (about 194 hours since
power on). Nothing to do with the previous posted smartctl output.
This is another disk in another server.

------------------------------------------------------------------------------

Robert S

2016-08-02 12:32:59 UTC

Permalink

My thresholds are based mostly on the fact I don't get to monitor the
most of the hard drives. I get to see them once a month, if that. I also
know the data is not backed up often, if at all, so I tend to be very
conservative. It is cheaper to replace than to do data recovery.

And for those I can monitor regularly, they are in servers where
scheduling downtime is painful, and involves unpaid overtime for me
(salaried). Having a server fail in the middle of production due to a
dying hard drive is not good career-wise. Again, a whiff of problems and
I replace. Since these are hot plugable in a raid, I don't have to do
unpaid overtime. Things just get a little slow while the raid rebuilds.

If the drives are personal and you have multiple backups, then I might
let a few more errors go, especially if it is stable.

Post by Gandalf Corvotempesta

Currently i have a sas drive with grown list set to 16 from some months ago.
everything else seems to be stable. Would you replace this drive with
16, stable, bad sectors?
with sata disk i have some drives with 3 or 4 logged errors 4 years
ago and nothing more after that. Would you replace?
I also have a sata disk with raw read error rate set to 84, smart long
test running weekly doesn't catch any issue and is able to run
properly every time.
No reallocation or bad sectors, only this 84 raw read error rate.
Raid controller osd able to run weekly consistency check and patrol
reads (it will read the whole raid looking for bad sectors) with no
issue at all.
Would you replace this disk?
------------------------------------------------------------------------------
_______________________________________________
Smartmontools-support mailing list
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

Gandalf Corvotempesta

2016-08-02 12:54:16 UTC

Permalink

My thresholds are based mostly on the fact I don't get to monitor the most
of the hard drives. I get to see them once a month, if that. I also know the
data is not backed up often, if at all, so I tend to be very conservative.
It is cheaper to replace than to do data recovery.
And for those I can monitor regularly, they are in servers where scheduling
downtime is painful, and involves unpaid overtime for me (salaried). Having
a server fail in the middle of production due to a dying hard drive is not
good career-wise. Again, a whiff of problems and I replace. Since these are
hot plugable in a raid, I don't have to do unpaid overtime. Things just get
a little slow while the raid rebuilds.

These disks are on production servers.
Yes, I have backups, but as everybody, backups are never updated in
real time and recovering
a failed 3TB RAID-6 would become a real pain in the ass, as I have 22
virtual machines over it.
But recovering a 600GB disks in this raid-6 would take 28 hours and
these are very expensive disks.
Changing a working disks with another would mean trashing away about
300-400$ and I want
to be sure that this disk has really some issues.

------------------------------------------------------------------------------

Norman Diamond

2016-07-29 00:45:40 UTC

Permalink

Thank you for explaining that a current pending sector isn't always going to be reallocated by the drive. I don't see how soft error differs from bad sector though. Also a medium error is a bad sector but the drive thinks a write fixed the sector.

However, you didn't address my question. I'm trying to estimate the number of sectors already found to be bad. I was adding the raw value of attributes 5 and 197 to get a quick estimate. My question is whether to add raw values of attributes 187 and 198 as well, or are these redundant (already included in) attributes 5 and 197?

If the estimated number is below some psychological limit (not set by me) then I proceed to more intensive testing. If the estimate is over the psychological limit then the drive can be assigned to scrap without needing any further testing. So I'm trying to make a reasonable estimate.

----- Original Message -----

Date: 2016/7/28, Thu 23:34
Subject: Re: [smartmontools-support] Which attributes can estimate number of bad blocks?

* A bad sector that hasn't been reallocated yet, i.e. written to (a hard
error).
* A soft error. One or 2 sectors might indicate this problem. Every drive
I've seen so far with 8+ turned out to be a bad drive.
* The most insidious of all, a medium error. The sector can't hold it's
charge, for long. You can write to it then immediately read from it OK,
but usually in less than an hour, it goes corrupt again. You'll see that
in the same sector(s) keeps failing. This is a drive you should trash.
It sounds like you have a drive with medium error(s).
As for trashing a drive, it varies depending on who you ask, but at 5+
(raw value) reallocated, I trash. At 8+ Pending, I trash.
In regards to 187 and 197, I will retire a drive with any raw value more
than 0. But this is based more on Backblaze's published results than my
own observations. Usually I don't see those, or the drive is already
failing based on Reallocated and Pending.

------------------------------------------------------------------------------

r***@spotswood-computer.net

2016-08-01 19:42:26 UTC

Permalink

If you ASSUME that pending sectors are bad, then yes, add reallocated and
pending. Generally though, if there are only 1 or 2, they are not bad.

I wouldn't add 187 and 198, as they can indicate unrelated problems.

Post by Norman Diamond
Thank you for explaining that a current pending sector isn't always going
to be reallocated by the drive. I don't see how soft error differs from
bad sector though. Also a medium error is a bad sector but the drive
thinks a write fixed the sector.
However, you didn't address my question. I'm trying to estimate the
number of sectors already found to be bad. I was adding the raw value of
attributes 5 and 197 to get a quick estimate. My question is whether to
add raw values of attributes 187 and 198 as well, or are these redundant
(already included in) attributes 5 and 197?
If the estimated number is below some psychological limit (not set by me)
then I proceed to more intensive testing. If the estimate is over the
psychological limit then the drive can be assigned to scrap without
needing any further testing. So I'm trying to make a reasonable estimate.
----- Original Message -----

Date: 2016/7/28, Thu 23:34
Subject: Re: [smartmontools-support] Which attributes can estimate
number of bad blocks?

* A bad sector that hasn't been reallocated yet, i.e. written to (a hard
error).
* A soft error. One or 2 sectors might indicate this problem. Every drive
I've seen so far with 8+ turned out to be a bad drive.
* The most insidious of all, a medium error. The sector can't hold it's
charge, for long. You can write to it then immediately read from it OK,
but usually in less than an hour, it goes corrupt again. You'll see that
in the same sector(s) keeps failing. This is a drive you should trash.
It sounds like you have a drive with medium error(s).
As for trashing a drive, it varies depending on who you ask, but at 5+
(raw value) reallocated, I trash. At 8+ Pending, I trash.
In regards to 187 and 197, I will retire a drive with any raw value more
than 0. But this is based more on Backblaze's published results than my
own observations. Usually I don't see those, or the drive is already
failing based on Reallocated and Pending.

Post by Norman Diamond
To get a quick estimate of the number of bad blocks that have already been
detected, I add the raw values of attributes 5 (Reallocated_Sector_Ct) and
197 (Current_Pending_Sector).
Should I also add the raw values of attributes 187 (Reported_Uncorrect)
and 198 (Offline_Uncorrectable)? Or are these redundant, already included
in attributes 5 and 197?
Are there others that I should add?
Of course this estimate isn't as reliable as ... well actually, other
methods of testing aren't particularly reliable either[*]. Anyway the
reason for getting a quick estimate is that if we know quickly that a
drive belongs on the scrap heap, it saves time.
[* I have a drive with particularly bad firmware. Reading the entire
drive causes Current_Pending_Sector to increase. Writing the bad sectors
doesn't cause Reallocated_Sector_Ct to increase but does cause
Current_Pending_Sector to go down to 0. I guess the drive thinks that a
write operation fixed the sector so the sector doesn't need
reallocation,
but the next attempt to read shows that the sector didn't get fixed. If
the drive were in warranty, I might not have it any more :-) ]
------------------------------------------------------------------------------
_______________________________________________
Smartmontools-support mailing list
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

------------------------------------------------------------------------------
_______________________________________________
Smartmontools-support mailing list
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

------------------------------------------------------------------------------