[Smartmontools-support]Multi_Zone_Error_Rate Attribute is in state FAILING

Discussion:

[Smartmontools-support]Multi_Zone_Error_Rate Attribute is in state FAILING_NOW

Roberto Nibali

2003-09-03 19:34:25 UTC

Hello list,

I'm not subscribed (yet) so please cc me.

I because suspicious when I saw a message in my kernel telling me to
backup my date because my disk was going to die. I had a look at the
SMART parameters and was presented with the following output.

smartctl version 5.1-18 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: FUJITSU MHR2040AT
Serial Number: NJ41T291391J
Firmware Version: 40BA
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 1
Local Time is: Wed Sep 3 21:14:32 2003 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Off-line data collection status: (0x00) Offline data collection activity was
never started.
Auto Off-line Data Collection:
Disabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test
has ever
been run.
Total time to complete off-line
data collection: ( 468) seconds.
Offline data collection
capabilities: (0x1b) SMART execute Offline immediate.
Automatic timer ON/OFF support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 60) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail
Always - 52665100205
2 Throughput_Performance 0x0005 100 100 020 Pre-fail
Offline - 0
3 Spin_Up_Time 0x0003 094 093 025 Pre-fail
Always - 24321
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 159
5 Reallocated_Sector_Ct 0x0033 099 099 024 Pre-fail
Always - 31
7 Seek_Error_Rate 0x000f 100 100 047 Pre-fail
Always - 131071
8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail
Offline - 0
9 Power_On_Hours 0x0032 092 092 000 Old_age
Always - 4746660
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 149
192 Power-Off_Retract_Count 0x0032 099 099 000 Old_age Always
- 7
193 Load_Cycle_Count 0x0032 074 074 000 Old_age Always
- 94875
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always
- 54
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always
- 9931171
196 Reallocated_Event_Count 0x0032 099 099 000 Old_age Always
- 30
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x000f 036 033 060 Pre-fail Always
FAILING_NOW 2883583
203 Run_Out_Cancel 0x0002 091 091 000 Old_age Always
- 2589894639599

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged

I'm trying to understand what the following means as there is no
indication of this drive on failing me:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
200 Multi_Zone_Error_Rate 0x000f 036 033 060 Pre-fail Always
FAILING_NOW 2883583

I think this has to do with the issues described in the following paper:

http://www.isi.edu/netstation/zcav/zcav.html

Now, I've been skimming throught the net and the ATAPI 6 (d1410r1.pdf)
documentation but I couldn't find any hint on what exactly is happening
to my harddisk. I also didn't find any valuable information in the
source code. To make it short :), I would like to get some information
about this problem of mine and if this is not possible I'd like to get a
contact.

Thanks in advance for any pointers in that matter.

Best regards,
Roberto Nibali, ratz

--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc

Bruce Allen

2003-09-04 04:56:34 UTC

Permalink

Hi Roberto,

Post by Roberto Nibali
I'm not subscribed (yet) so please cc me.

OK. Please continue to copy our correspondence to the list so that there
is a record of it in the archive.

Post by Roberto Nibali
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.

Take this warning seriously, please. In my experience (with other
vendors, howerver) in almost all cases it means that your disk will die
soon.

Post by Roberto Nibali
5 Reallocated_Sector_Ct 0x0033 099 099 024 Pre-fail
Always - 31

Your drive has already reallocated 31 bad sectors

Post by Roberto Nibali
9 Power_On_Hours 0x0032 092 092 000 Old_age
Always - 4746660

You need the -v 9,seconds options. Your drive is 4746660 seconds = 1318
hours old. It should still be under warranty. Since it has failing SMART
status, Fujitsu should replace it.

Post by Roberto Nibali
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always
- 54

Is this a laptop disk (it's hot)

Post by Roberto Nibali
196 Reallocated_Event_Count 0x0032 099 099 000 Old_age Always
- 30

This might mean that on one occaison, two sectors had to be reallocated at
the same time.

Post by Roberto Nibali
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0

Good -- no unreadable data yet.

Post by Roberto Nibali
200 Multi_Zone_Error_Rate 0x000f 036 033 060 Pre-fail Always
FAILING_NOW 2883583

I'm pretty sure that you also need
-v 200,writeerrorcount
so on this disk Attribute 200 is not Multi Zone Error Rate but Write Error
Count. If so, you should see the raw value incrementing in time.

Post by Roberto Nibali
SMART Self-test log structure revision number 1
No self-tests have been logged

You can get some additional confidence (that something is or is not
wrong) by running self-tests. Use -t short and -t long, then use -l
selftest to read the self-test log afterwards.

Warning: this read scans the disk. It's quite likely that if the disk is
failing it will provoke catastrophic failure.

Since these are write errors, though, read scanning the disk might not
detect them, which is what self-tests normally do.

Post by Roberto Nibali
I'm trying to understand what the following means as there is no
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
200 Multi_Zone_Error_Rate 0x000f 036 033 060 Pre-fail Always
FAILING_NOW 2883583

The disk firmware is trying hard to tell you that something serious is
going wrong with the disk's functioning.

Post by Roberto Nibali
http://www.isi.edu/netstation/zcav/zcav.html

As I said above, I don't think that for this disk it's multi-zone error
rate.

Post by Roberto Nibali
Now, I've been skimming throught the net and the ATAPI 6 (d1410r1.pdf)
documentation but I couldn't find any hint on what exactly is happening
to my harddisk. I also didn't find any valuable information in the
source code. To make it short :), I would like to get some information
about this problem of mine and if this is not possible I'd like to get a
contact.

The individual Atttribute's and their meanings are NOT specified by the
ATA standard. So you won't learn much from that. What it does say is this
(search for "SMART RETURN STATUS" in the .pdf file):

This command is used to communicate the reliability status of the device
to the host at the host s request.

and:

6.14.4 Threshold exceeded condition This condition occurs when the device
s SMART reliability status indicates an impending degrading or fault
condition.

You can read SFF-8035i to get a bit more insight.

I strongly recommend that you badk up any data this is just on this disk,
and if it's possible, ask Fujitsu to replace the drive. They may ask you
to run one of their utilities, but it will also report the bad SMART
status and might give you a "secret error code" that will persuade Fujitsu
to replace the disk right away.

Cheers,
Bruce

Roberto Nibali

2003-09-04 20:28:09 UTC

Permalink

Hi Bruce,

Post by Bruce Allen
OK. Please continue to copy our correspondence to the list so that there
is a record of it in the archive.

Sure thing.

Post by Bruce Allen

Post by Roberto Nibali
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.

Take this warning seriously, please. In my experience (with other
vendors, howerver) in almost all cases it means that your disk will die
soon.

I've been getting this message since July 10 in my logs, I only noticed
yesterday as I was browsing through the log files. This harddisk should
have died on me 2 months ago.

Post by Bruce Allen

Post by Roberto Nibali
5 Reallocated_Sector_Ct 0x0033 099 099 024 Pre-fail
Always - 31

Your drive has already reallocated 31 bad sectors

Post by Roberto Nibali
9 Power_On_Hours 0x0032 092 092 000 Old_age
Always - 4746660

You need the -v 9,seconds options. Your drive is 4746660 seconds = 1318

Yes, this improves ledgibility quite a bit, thanks ;).

Post by Bruce Allen
hours old. It should still be under warranty. Since it has failing SMART
status, Fujitsu should replace it.

The problem is that I've bought it last year after my first disk in my
laptop was dying. But I somehow could prolong the life of the former
disk by shutting the laptop down regularly. I did not have the option of
a backup as I was travelling from Symposium to Symposium. It was only 55
days ago when I installed the new disk.

Post by Bruce Allen

Post by Roberto Nibali
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always
- 54

Is this a laptop disk (it's hot)

Yes, it's a laptop disk. Maybe you want to add it to the smartctl
regexp? It's a FUJITSU MHR2040AT.

[snip explanations]

Post by Bruce Allen

Post by Roberto Nibali
200 Multi_Zone_Error_Rate 0x000f 036 033 060 Pre-fail Always
FAILING_NOW 2883583

I'm pretty sure that you also need
-v 200,writeerrorcount
so on this disk Attribute 200 is not Multi Zone Error Rate but Write Error
Count. If so, you should see the raw value incrementing in time.

Thanks for this hint.

Post by Bruce Allen

Post by Roberto Nibali
SMART Self-test log structure revision number 1
No self-tests have been logged

You can get some additional confidence (that something is or is not
wrong) by running self-tests. Use -t short and -t long, then use -l
selftest to read the self-test log afterwards.

Yes, that's what I've seen in the code but ...

Post by Bruce Allen
Warning: this read scans the disk. It's quite likely that if the disk is
failing it will provoke catastrophic failure.

... that's what I feared too.

Post by Bruce Allen
Since these are write errors, though, read scanning the disk might not
detect them, which is what self-tests normally do.

Right.

Post by Bruce Allen
As I said above, I don't think that for this disk it's multi-zone error
rate.

Ok.

Post by Bruce Allen

The individual Atttribute's and their meanings are NOT specified by the
ATA standard. So you won't learn much from that. What it does say is this
This command is used to communicate the reliability status of the device
to the host at the host s request.
6.14.4 Threshold exceeded condition This condition occurs when the device
s SMART reliability status indicates an impending degrading or fault
condition.
You can read SFF-8035i to get a bit more insight.

I see, thanks!

Post by Bruce Allen
I strongly recommend that you badk up any data this is just on this disk,
and if it's possible, ask Fujitsu to replace the drive. They may ask you

It's now very unfortunate that I actually do not remember when and where
I bought this disk. I'm far away from home where I could find a possible
invoice and calling up various possible vendors (Dell, COS and Fujitsu)
has not yet change anything with regard to my problem. I only hope that
the disk will make it another couple of day until I get back and can
find out where I bought it.

Post by Bruce Allen
to run one of their utilities, but it will also report the bad SMART
status and might give you a "secret error code" that will persuade Fujitsu
to replace the disk right away.

I doubt that I can contact them directly to replace my disk without an
invoice, right?.

I thank you for you insights and wish you a nice day,
Roberto Nibali, ratz

--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc

Bruce Allen

2003-09-04 21:18:16 UTC

Permalink

Post by Bruce Allen

Post by Roberto Nibali
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.

Take this warning seriously, please. In my experience (with other
vendors, howerver) in almost all cases it means that your disk will die
soon.

I've been getting this message since July 10 in my logs, I only noticed
yesterday as I was browsing through the log files. This harddisk should
have died on me 2 months ago.

Fujitsu's firmware may be very conservative. In my experience with Maxtor
and IBM disks, failing SMART status means "the end is near".

a backup as I was travelling from Symposium to Symposium. It was only 55
days ago when I installed the new disk.

Post by Bruce Allen

Post by Roberto Nibali
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always
- 54

Is this a laptop disk (it's hot)

Is the cooling fan in your laptop dead? That might be why the disks are
failing.

Yes, it's a laptop disk. Maybe you want to add it to the smartctl
regexp? It's a FUJITSU MHR2040AT.

I'll do this, sure.

Post by Bruce Allen
I'm pretty sure that you also need
-v 200,writeerrorcount
so on this disk Attribute 200 is not Multi Zone Error Rate but Write Error
Count. If so, you should see the raw value incrementing in time.

Thanks for this hint.

I'd keep an eye on the raw value -- if it starts to climb again that's a
bad sign.

Post by Bruce Allen
I strongly recommend that you badk up any data this is just on this disk,
and if it's possible, ask Fujitsu to replace the drive. They may ask you

I've never dealt with Fujitsu so I don't know if they'd be helpful or
not. But the disk is living on borrowed time.

Post by Bruce Allen
to run one of their utilities, but it will also report the bad SMART
status and might give you a "secret error code" that will persuade Fujitsu
to replace the disk right away.

I doubt that I can contact them directly to replace my disk without an
invoice, right?.

I'm not sure. Anyway if it's managed like this since July 10th, the odds
are good that it will live a few more days at least!

Cheers,
Bruce