[smartmontools-support] Trouble finding bad block on WDC WD1600SB-01KBA0

Discussion:

mathog

2017-01-10 01:12:49 UTC

Hi,

One system has a WDC WD1600SB-01KBA0 which shows

197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always
- 1
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always
- 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always
- 1

Sadly, it does not list the block number in any of the test results
(or the system log files).

Tried these steps to find the pending sector...

reboot (to clear cache)
# log in once it came back up
dd if=/dev/sda of=/dev/null bs=512

which completed without error. Then tried

smartctl -t long /dev/sda

and that also completed without error.

However "smartctl -a " still shows a pending sector.

Is there some other trick to find the thing?

Thanks,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Carlos E. R.

2017-01-10 10:07:49 UTC

Permalink

Post by mathog
Hi,
One system has a WDC WD1600SB-01KBA0 which shows
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always
- 1
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always
- 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always
- 1
Sadly, it does not list the block number in any of the test results
(or the system log files).
Tried these steps to find the pending sector...
reboot (to clear cache)

There is a way to clear it without reboot. Let me see... [...]

Post by mathog
To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free
dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free
pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches

Or issue "sync" at the end.

Post by mathog
/sbin/sysctl -q -w vm.drop_caches=3
using /sbin/sysctl is equivialent to the "echo >/proc/sys/..." line
above
# log in once it came back up
dd if=/dev/sda of=/dev/null bs=512
which completed without error. Then tried
smartctl -t long /dev/sda
and that also completed without error.
However "smartctl -a " still shows a pending sector.

The same thing happened to me recently.

Post by mathog
Is there some other trick to find the thing?

I run "badblocks" with the intention of locating them, and they
disappeared...

--
Cheers / Saludos,

Carlos E. R.
(from 42.2 x86_64 "Malachite" at Telcontar)

mathog

2017-01-11 01:15:16 UTC

Permalink

Post by Carlos E. R.

Post by mathog
Hi,
One system has a WDC WD1600SB-01KBA0 which shows
197 Current_Pending_Sector 0x0012 200 200 000 Old_age
Always
- 1
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age
Always
- 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age
Always
- 1
Is there some other trick to find the thing?

I run "badblocks" with the intention of locating them, and they
disappeared...

Carlos E. R.

2017-01-11 02:14:48 UTC

Permalink

Post by mathog

Post by Carlos E. R.
I run "badblocks" with the intention of locating them, and they
disappeared...

Yes, same thing here. I don't remember that parameter what value it had.

Post by mathog
So that worked.
Now, for the next time, is there a command one can use
while the OS is running and the disk mounted that can do something
similar?

Previously I figured it out from the point that the long test stopped.

Post by mathog
badblocks -n isn't happy running on mounted disks, and that badblocks
command took ~4.5 hours.

Yes, it runs for a very long time, yes. I'm unsure if my disk was
mounted or not.

--
Cheers / Saludos,

Carlos E. R.
(from 42.2 x86_64 "Malachite" at Telcontar)

Bruce Allen

2017-01-11 09:41:14 UTC

Permalink

David, I think the UDMA_CRC_ERROR_COUNT is referring to a crc error on the data bus. If that is right then it can be safely ignored; if it is recurring I would try and clean and replug the data connections to the drive. Cheers, Bruce

Post by mathog

Post by Carlos E. R.

Post by mathog
Hi,
One system has a WDC WD1600SB-01KBA0 which shows
197 Current_Pending_Sector 0x0012 200 200 000 Old_age
Always
- 1
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age
Always
- 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age
Always
- 1
Is there some other trick to find the thing?

I run "badblocks" with the intention of locating them, and they
disappeared...

Rebooted that node into PLD Rescue CD over the network, ssh'd into it
and ran
badblocks -nvs /dev/sda >/tmp/bb.log 2>&1 &
somewhere along the line the pending sector cleared, but there was no
message
giving the block number, and it said there were no errors. The
UDMA_CRC_ERROR_COUNT is still 1.
So that worked.
Now, for the next time, is there a command one can use
while the OS is running and the disk mounted that can do something
similar?
badblocks -n isn't happy running on mounted disks, and that badblocks
command took ~4.5 hours.
Thanks,
David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Smartmontools-support mailing list
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

r***@spotswood-computer.net

2017-01-11 15:25:34 UTC

Permalink

I've also seen the UDMA_CRC_ERROR_COUNT caused by a bad SATA cable. With
only 1, I wouldn't sweat it. The case I remember, the count was in the
hundreds. It stopped climbing after I switched out the cable. The existing
one was noticeably frayed.

Post by Bruce Allen
David, I think the UDMA_CRC_ERROR_COUNT is referring to a crc error on the
data bus. If that is right then it can be safely ignored; if it is
recurring I would try and clean and replug the data connections to the
drive. Cheers, Bruce

Post by mathog

Post by Carlos E. R.

Post by mathog
Hi,
One system has a WDC WD1600SB-01KBA0 which shows
197 Current_Pending_Sector 0x0012 200 200 000 Old_age
Always
- 1
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age
Always
- 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age
Always
- 1
Is there some other trick to find the thing?

I run "badblocks" with the intention of locating them, and they
disappeared...

--------------------------------------------------------------------
Bruce Allen, Adjunct Professor of Physics
Leonard E. Parker Center for Gravitation, Cosmology and Astrophysics
Physics Department
University of Wisconsin - Milwaukee
3135 N Maryland Ave
Milwaukee, 53211 USA
Tel: +1 414-229-4474
Fax: +1 414-229-5589
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.
http://sdm.link/xeonphi_______________________________________________
Smartmontools-support mailing list
https://lists.sourceforge.net/lists/listinfo/smartmontools-support

mathog

2017-01-11 17:42:14 UTC

Permalink

Post by r***@spotswood-computer.net
I've also seen the UDMA_CRC_ERROR_COUNT caused by a bad SATA cable.

OK, I will make a note to look at the cable if this unit ever has
problems like this again. (It could have been a gamma ray or something
hitting a gate, right?)

Now I'm trying to understand what happened here. My best guess is that
it went something like this (leaving out a few steps):

1. some issue with cable, connectors, radiation etc. arose.
2. a write to a specific block, presumably with new data, ran
into (1) and failed.
3. ??? the disk shuffled that data off to a temporary location
(spare physical block, flash, or ?) and set the pending and
UDMA_CRC_ERROR_COUNT.
4. Read of entire disk found no errors because the disk retrieved
either the [??? old or new] contents without problems.
5. System was powered down for several minutes and started back up.
The pending block and UDMA_CRC_ERROR_COUNT were still set.
Presumably
this means the pending data was stored in a nonvolatile location.
6. badblocks -nvs read the bad block [??? old or new] data and then
wrote it back to disk. It saw no errors while doing so because
(1) was not longer a problem. This time the write succeeded
and the pending block was reset to 0. The reallocated block
count stayed 0. Either it didn't reallocate the block or it did and
it didn't increment the counter.

So the question is - _which_ data is in that iffy block now? Is it the
data which caused the failed write in the first place, or whatever was
there
before the write? Hopefully it is the former!

Thanks,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Carlos E. R.

2017-01-11 20:58:54 UTC

Permalink

Post by mathog

Post by r***@spotswood-computer.net
I've also seen the UDMA_CRC_ERROR_COUNT caused by a bad SATA cable.

OK, I will make a note to look at the cable if this unit ever has
problems like this again. (It could have been a gamma ray or something
hitting a gate, right?)
Now I'm trying to understand what happened here. My best guess is that
1. some issue with cable, connectors, radiation etc. arose.
2. a write to a specific block, presumably with new data, ran
into (1) and failed.

If this happens during a write, the sector is reallocated. If it happens
during a read, reallocation is postponed and the sector noted. I don't
know if it creates a list and how to read that list.

Reallocation happens during an attempted write, to another permanent
location.

I don't know what happened during the badblock run, because it is a read
operation.

--
Cheers / Saludos,

Carlos E. R.
(from 42.2 x86_64 "Malachite" at Telcontar)

mathog

2017-01-11 21:12:39 UTC

Permalink

Post by Carlos E. R.
If this happens during a write, the sector is reallocated. If it happens
during a read, reallocation is postponed and the sector noted. I don't
know if it creates a list and how to read that list.

Subsequent reads - of the whole disk, did not log any errors, nor did
they clear the "current pending sector" count. That's the odd part -
the disk had somewhere stored "there is some problem with block N" and
incremented the pending sector count, but it seems that several reads
from block N (wherever that was) which completed without error were not
enough to change its mind.

It seems like a major shortcoming in the SMART protocol that there is no
"list the pending sectors" command. The disk must have this
information, otherwise we cannot explain the way it behaved in this
case.

Post by Carlos E. R.
Reallocation happens during an attempted write, to another permanent
location.

Agreed.

Post by Carlos E. R.
I don't know what happened during the badblock run, because it is a read
operation.

with -nvs there is also a write after the read. It presumably read the
iffy block successfully (for about the 6th time) and when it wrote it
back the flag finally cleared. It may or may not have been reallocated,
but if it was, the counter did not increment. It seems about as likely
that the disk just cleared the flag. Perhaps the firmware at that point
did a couple of read/write tests on its own and decided all was now OK.
We can't really know what the disk does "underneath" the level we
interact with.

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Bruce Allen

2017-01-12 01:38:18 UTC

Permalink

Hi David,

It seems like a major shortcoming in the SMART protocol that there is no "list the pending sectors" command. The disk must have this
information, otherwise we cannot explain the way it behaved in this case.

I agree.

The truth is that the entire SMART protocol is something of a hack. It was first implemented by a couple of vendors, then turned into an SFF "specification" which was subsequently actively withdrawn (meaning: the industry did its best to destroy every copy of the document in existence). Then VERY limited parts of that were included in the ATA specification, which were then gradually morphed into something with a different intent (on and off-line testing, rather than monitoring and failure prediction). All in all, SMART is useful, but it's also very flawed.

My personal hope is that over the coming ten years, the SSD will replace the HDD, and the devices and algorithms that underlie the SSD will become reliable enough that almost all of the SMART protocol and features become irrelevant and fade away. Time will tell.

Cheers,
Bruce

--------------------------------------------------------------------------
Prof. Dr. Bruce Allen, Director
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Callinstrasse 38
D-30167 Hannover, Germany
Tel +49-511-762-17145
Fax +49-511-762-17182
Email: ***@aei.mpg.de

Dan Lukes

2017-01-12 00:10:00 UTC

Permalink

Post by Carlos E. R.
If this happens during a write, the sector is reallocated.
Reallocation happens during an attempted write, to another permanent
location.

Note the relocation may not occur if write request doesn't cover entire
physical sector (it may happen on "advanced format" disk). Just an error
may be returned instead here.

This behavior has been observed on WDC disk (but I don't remember the
exact model and firmware version).

So physical sector size and location needs to be taken into consideration.

Dan

L.A. Walsh

2017-01-11 18:42:08 UTC

Permalink

Post by mathog
Hi,
One system has a WDC WD1600SB-01KBA0 which shows
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always
Sadly, it does not list the block number in any of the test results
(or the system log files).
Is there some other trick to find the thing?

----
Most of the time, you can't find the exact sector, but its an
indication that the disk may had to move the data to a backup sector.

Modern hard disks usually have 'tracks' of spare sectors that they can
reallocate (up to and including reallocating entire tracks) when they start
to detect weak and/or unreliable signals on _READ_. They are a sign that
the disk is nearing the end of its useful life.

The smart diagnostics are not intended to be exact diagnostics but an
_Early_Warning_ system -- meaning that you had better move that data off to
a safer location.

Before the advent of the SMART diags, you could often hear a disk going
bad, as what was supposed to be sequential, linear reads, weren't. You
could
hear the excess seeking as the disk had to seek over to the replacement
sectors
and back again.

Really -- you should be ready to replace this disk "soon" (as soon
as you can) and use any remaining life in it to make sure everything on
it is backed
up.