Discussion:
[smartmontools-support] HD dying?
David Niklas
2017-04-06 00:26:12 UTC
Permalink
Hello,
First of all, I *have* been backing up my data.
I'm going to post LOTS of details here, fell free to skim.

My problem is that once upon a time my drived failed < 6 months after I
bought my laptop. Sending it to a professional did not help, nor did
replacing the PCB, it was dead.
The symptoms leading up to the event was a sudden freeze of the OS. I was
not too bright about Linux at the time, so I thought that perhaps X froze.
Now I'm getting the identical thing, a sudden freeze. I can ping the
kernel, I cannot restore the frame buffer, sync, or umount the
file systems. My syslog metalog records no messages during this period,
it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq
to reboot. I'm using OpenRC.

This has happened twice or three times.
I just ran a self test and it says PASSED, I'm not seeing anything that
stands out.

smartmontools-6.4
Gentoo Linux 4.9.x

Below is my S.M.A.R.T. data. BTW: it is unwrapped.
What do you think?
Thanks, David


=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue Mobile
Device Model: WDC WD7500BPVX-22JC3T0
Serial Number: WD-WXC1A14E1823
LU WWN Device Id: 5 0014ee 209f3d675
Firmware Version: 01.01A01
User Capacity: 750,156,374,016 bytes [750 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Apr 4 14:47:54 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (13920) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 157) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 196 179 021 Pre-fail Always - 1166
4 Start_Stop_Count 0x0032 058 058 000 Old_age Always - 42367
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 13532
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1695
191 G-Sense_Error_Rate 0x0032 001 001 000 Old_age Always - 124
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 158
193 Load_Cycle_Count 0x0032 183 183 000 Old_age Always - 51774
194 Temperature_Celsius 0x0022 107 091 000 Old_age Always - 40
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 13529 -
# 2 Extended offline Completed without error 00% 12288 -
# 3 Extended offline Completed without error 00% 9247 -
# 4 Extended offline Completed without error 00% 7609 -
# 5 Extended offline Completed without error 00% 5469 -
# 6 Short offline Completed without error 00% 0 -
# 7 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Carlos E. R.
2017-04-06 10:09:38 UTC
Permalink
Post by David Niklas
Hello,
First of all, I *have* been backing up my data.
I'm going to post LOTS of details here, fell free to skim.
My problem is that once upon a time my drived failed < 6 months after I
bought my laptop. Sending it to a professional did not help, nor did
replacing the PCB, it was dead.
The symptoms leading up to the event was a sudden freeze of the OS. I was
not too bright about Linux at the time, so I thought that perhaps X froze.
Now I'm getting the identical thing, a sudden freeze. I can ping the
kernel, I cannot restore the frame buffer, sync, or umount the
file systems. My syslog metalog records no messages during this period,
it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq
to reboot. I'm using OpenRC.
This has happened twice or three times.
I just ran a self test and it says PASSED, I'm not seeing anything that
stands out.
smartmontools-6.4
Gentoo Linux 4.9.x
Below is my S.M.A.R.T. data. BTW: it is unwrapped.
What do you think?
No evidence of problem here, that I can see.

If it were the disk, you typically would see messages of the kernel
complaining in "dmesg".
--
Cheers / Saludos,

Carlos E. R.

(from 42.2 x86_64 "Malachite" (Minas Tirith))
r***@spotswood-computer.net
2017-04-06 14:29:17 UTC
Permalink
I'll second "Carlos E. R."'s verdict. I see nothing wrong either. However,
that does not guarantee there isn't something wrong. Somewhere I read a
study that said SMART only predicts about 60% of hard drive failures. The
other 40% give no warning.

Backups are always a good idea. They protect not only against hard drive
failures, but also accidental or malicious data loss. Now have you tested
those backups? I remember when I was free-lance going to a brand new
client (first visit). They needed me to do a restore. OK, got their backup
media (it was back in the zip disk days). Every disk was write-protected,
and blank. Needless to say, that day didn't go well.
Post by David Niklas
Hello,
First of all, I *have* been backing up my data.
I'm going to post LOTS of details here, fell free to skim.
My problem is that once upon a time my drived failed < 6 months after I
bought my laptop. Sending it to a professional did not help, nor did
replacing the PCB, it was dead.
The symptoms leading up to the event was a sudden freeze of the OS. I was
not too bright about Linux at the time, so I thought that perhaps X froze.
Now I'm getting the identical thing, a sudden freeze. I can ping the
kernel, I cannot restore the frame buffer, sync, or umount the
file systems. My syslog metalog records no messages during this period,
it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq
to reboot. I'm using OpenRC.
This has happened twice or three times.
I just ran a self test and it says PASSED, I'm not seeing anything that
stands out.
smartmontools-6.4
Gentoo Linux 4.9.x
Below is my S.M.A.R.T. data. BTW: it is unwrapped.
What do you think?
Thanks, David
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue Mobile
Device Model: WDC WD7500BPVX-22JC3T0
Serial Number: WD-WXC1A14E1823
LU WWN Device Id: 5 0014ee 209f3d675
Firmware Version: 01.01A01
User Capacity: 750,156,374,016 bytes [750 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Apr 4 14:47:54 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (13920) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 157) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always
- 0
3 Spin_Up_Time 0x0027 196 179 021 Pre-fail Always
- 1166
4 Start_Stop_Count 0x0032 058 058 000 Old_age Always
- 42367
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always
- 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always
- 0
9 Power_On_Hours 0x0032 082 082 000 Old_age Always
- 13532
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always
- 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always
- 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always
- 1695
191 G-Sense_Error_Rate 0x0032 001 001 000 Old_age Always
- 124
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always
- 158
193 Load_Cycle_Count 0x0032 183 183 000 Old_age Always
- 51774
194 Temperature_Celsius 0x0022 107 091 000 Old_age Always
- 40
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline
- 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline
- 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 13529
-
# 2 Extended offline Completed without error 00% 12288
-
# 3 Extended offline Completed without error 00% 9247
-
# 4 Extended offline Completed without error 00% 7609
-
# 5 Extended offline Completed without error 00% 5469
-
# 6 Short offline Completed without error 00% 0
-
# 7 Short offline Completed without error 00% 0
-
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Smartmontools-support mailing list
https://lists.sourceforge.net/lists/listinfo/smartmontools-support
Robin H. Johnson
2017-04-06 20:51:51 UTC
Permalink
Post by David Niklas
The symptoms leading up to the event was a sudden freeze of the OS. I was
not too bright about Linux at the time, so I thought that perhaps X froze.
Now I'm getting the identical thing, a sudden freeze. I can ping the
kernel, I cannot restore the frame buffer, sync, or umount the
file systems. My syslog metalog records no messages during this period,
it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq
to reboot. I'm using OpenRC.
Metalog would only be useful is writes to disk were succeeding. It's
certainly possible for the kernel to hang in such a state that there is
kernel panic, and writes to disk are not happening (this includes
sending the sysrq-sync command).

That you can ping the kernel simply says that there's enough left
running for the kernel to handle ICMP without going to userspace.

That you can't SSH says something in userspace failed (which could be a
myriad of reasons).

Just because the system seems to freeze does not mean that the drive is
faulty. Also entirely possible there is a logged drive event in dmesg
that you can't see.

If you can repeat it, consider some of the following to get a better
insight as to what's going on.
- set up serial kernel console or network kernel console logging.
- set up kdump or similar.

That's not to say that the drive isn't the source of the problem, just
that it's not likely based on the output you've shown.

You say this is a laptop, and the drive by power hours has racked up
~1.5 years of usage, so it possibly hasn't been opened in at least that
long. How much dust has built up inside it? Overheating of the graphics
CAN cause the symptoms you've described.
--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail : ***@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
David Niklas
2017-04-07 20:28:43 UTC
Permalink
On Thu, 6 Apr 2017 20:51:51 +0000
Post by Robin H. Johnson
Post by David Niklas
The symptoms leading up to the event was a sudden freeze of the OS. I
was not too bright about Linux at the time, so I thought that perhaps
X froze. Now I'm getting the identical thing, a sudden freeze. I can
ping the kernel, I cannot restore the frame buffer, sync, or umount
the file systems. My syslog metalog records no messages during this
period, it is set to sync the dmesg messages. I cannot ssh, but I can
uses sysreq to reboot. I'm using OpenRC.
Metalog would only be useful is writes to disk were succeeding. It's
certainly possible for the kernel to hang in such a state that there is
kernel panic, and writes to disk are not happening (this includes
sending the sysrq-sync command).
That you can ping the kernel simply says that there's enough left
running for the kernel to handle ICMP without going to userspace.
That you can't SSH says something in userspace failed (which could be a
myriad of reasons).
Just because the system seems to freeze does not mean that the drive is
faulty. Also entirely possible there is a logged drive event in dmesg
that you can't see.
If you can repeat it, consider some of the following to get a better
insight as to what's going on.
- set up serial kernel console or network kernel console logging.
- set up kdump or similar.
No, It's random so far.
Post by Robin H. Johnson
That's not to say that the drive isn't the source of the problem, just
that it's not likely based on the output you've shown.
Why not?
What else causes all writes to the drive to stop except a problem with
the drive or MB (my laptop has not cabling)?
Post by Robin H. Johnson
You say this is a laptop, and the drive by power hours has racked up
~1.5 years of usage, so it possibly hasn't been opened in at least that
long. How much dust has built up inside it? Overheating of the graphics
CAN cause the symptoms you've described.
The laptop is my primary way to get online, it's not be left off for more
than 2 days unless it's HW failed (the original drive died).


So, I'm not misreading the S.M.A.R.T. data? No values that aught to be
interpreted in HEX, OCTAL or something?


Thanks,
David
Robin H. Johnson
2017-04-07 20:55:01 UTC
Permalink
On Fri, Apr 07, 2017 at 04:28:43PM -0400, David Niklas wrote:
...
Post by David Niklas
Post by Robin H. Johnson
If you can repeat it, consider some of the following to get a better
insight as to what's going on.
- set up serial kernel console or network kernel console logging.
- set up kdump or similar.
No, It's random so far.
Ok, get yourself network console logging, since networking was still
working, and you can just let the kernel send a copy of all klog entries
over the network.

See in the kernel sources, see Documentation/networking/netconsole.txt
or examples in the Ubuntu & Arch wikis.
Post by David Niklas
Post by Robin H. Johnson
That's not to say that the drive isn't the source of the problem, just
that it's not likely based on the output you've shown.
Why not?
What else causes all writes to the drive to stop except a problem with
the drive or MB (my laptop has not cabling)?
Most failure modes of a spinning drive would cause various error
counters to be incremented. The few that I could think of that wouldn't
involve specific component failures on the drive PCB.

Drive PCB-originating failures should NOT cause your video to lock up,
but may stop the logging to disk of any errors.

I can start up a linux system, running off a sata drive, open a
terminal, suddenly disconnect the drive, and still be able to run dmesg
and/or see live kernel log entries (Provided that dmesg itself is at
least already cached and running doesn't need anything to be read off
disk).

So what we're looking for as root cause is some manner of error that
causes both video & drive to become unresponsive, but the kernel to
still respond to ICMP ping (ergo network stack is operational).

That root cause COULD have other effects (like a power spike that then
damages the drive PCB), but it's the root cause we care about.

Overheating causing a component fault (like causing a capacitor to go
out of tolerance or fail) on one of the PCI/PCIe busses, and therein
affecting the drive & graphics. The networking might be on a different
bus, and continues to function.
Post by David Niklas
Post by Robin H. Johnson
You say this is a laptop, and the drive by power hours has racked up
~1.5 years of usage, so it possibly hasn't been opened in at least that
long. How much dust has built up inside it? Overheating of the graphics
CAN cause the symptoms you've described.
The laptop is my primary way to get online, it's not be left off for more
than 2 days unless it's HW failed (the original drive died).
So, I'm not misreading the S.M.A.R.T. data? No values that aught to be
interpreted in HEX, OCTAL or something?
No, the drive data seems good, and representative of a health &
well-used drive. No reallocated sectors, no other issues, not that many
power cycles even for a laptop drive w/ aggressive power saving.
--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail : ***@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
Loading...