Discussion:
[smartmontools-support] Long test taking ages
Gandalf Corvotempesta
2016-10-08 07:15:48 UTC
Permalink
Hi to all
I have a server with scheduled SMART background long test every
saturday at 03 AM
This server has only 4 300GB SAS 15K disks.

The smart long test is taking ages and making very high load on server
(iowait more than 30)

In example, currently I have:

Self-test execution status: 0% of test remaining
SMART Self-test log
Num Test Status segment LifeTime
LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Self test in progress ... 80 NOW
- [- - -]


background test is still running from 03 AM. 6 hours and is still running.

I had to cancel the other 3 background test to keep load under
reasonable values.

Some questions:
1) is this normal?
2) AFAIK, smart test should not interfere with server load and should
be "transparent"
3) why 0% remaining and "self test in progress" at the same time?

This doesn't happens on all server but only on a couple of them (same
operating system, same kind of disks)
Carlos E. R.
2016-10-08 13:37:59 UTC
Permalink
Post by Gandalf Corvotempesta
Hi to all
I have a server with scheduled SMART background long test every
saturday at 03 AM
This server has only 4 300GB SAS 15K disks.
The smart long test is taking ages and making very high load on server
(iowait more than 30)
Hi,

I'm just a user, but I think I can answer some of your doubts.

The test runs completely inside the hard disk, using its own code in the
disk firmware. Ie, smartmontools can start/stop the test, but has no
influence on what the test does or how.

The test is transparent, but does influence the operation of the disk,
making computer disk operation slower, and conversely, those operations
make the test slower. This is specially true while the disk is testing
for surface errors.

Thus you should program the long test to happen when the hard disk is
basically idle. Probably not all disks at the same time. Perhaps a
region of the disk at a time. It may be possible that different hard
disk brand handle differently.


Of your other questions, I can't say.
--
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 "Bottle" at Telcontar)
Gandalf Corvotempesta
2016-10-08 14:24:03 UTC
Permalink
Post by Carlos E. R.
The test is transparent, but does influence the operation of the disk,
making computer disk operation slower, and conversely, those operations
make the test slower. This is specially true while the disk is testing
for surface errors.
Thus you should program the long test to happen when the hard disk is
basically idle. Probably not all disks at the same time. Perhaps a
region of the disk at a time. It may be possible that different hard
disk brand handle differently.
Thank you for the response.
My biggest concern it that only on some of my servers is happening this.
I have about 1 hundreds servers and only on 2 or 3 the smart test is
causing very high load.
And these servers aren't the most loaded.
Carlos E. R.
2016-10-08 14:45:31 UTC
Permalink
Post by Gandalf Corvotempesta
Post by Carlos E. R.
The test is transparent, but does influence the operation of the disk,
making computer disk operation slower, and conversely, those operations
make the test slower. This is specially true while the disk is testing
for surface errors.
Thus you should program the long test to happen when the hard disk is
basically idle. Probably not all disks at the same time. Perhaps a
region of the disk at a time. It may be possible that different hard
disk brand handle differently.
Thank you for the response.
My biggest concern it that only on some of my servers is happening this.
I have about 1 hundreds servers and only on 2 or 3 the smart test is
causing very high load.
And these servers aren't the most loaded.
I don't think that the amount of load matters, but that the load is
constant. If the load requests read/write from the disk, the surface
test has to stop. And it doesn't stop instantly.

I guess the disk brand, or version of its firmware, matters too.

I would compare the exact hardware and type of load on those machines
trying to find coincidences.
--
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 "Bottle" at Telcontar)
Gandalf Corvotempesta
2016-10-08 17:32:31 UTC
Permalink
Post by Carlos E. R.
I would compare the exact hardware and type of load on those machines
trying to find coincidences.
Ok, i'll check. Keep in mind that when this happens, smart reports 0%
remaining for many many hours.

Another question: do you know the meaning of: "Require Write or
Reassign Blocks command"
shown doing "smartctl -x" on a SAS disk?
Carlos E. R.
2016-10-08 18:17:08 UTC
Permalink
Post by Gandalf Corvotempesta
Post by Carlos E. R.
I would compare the exact hardware and type of load on those machines
trying to find coincidences.
Ok, i'll check. Keep in mind that when this happens, smart reports 0%
remaining for many many hours.
I don't know about that.
Post by Gandalf Corvotempesta
Another question: do you know the meaning of: "Require Write or
Reassign Blocks command"
shown doing "smartctl -x" on a SAS disk?
I'm not familiar with "-x", I normally use "-a". But let me guess. It
may be related to the Current_Pending_Sector and Offline_Uncorrectable
lines in the table of smart attributes.

When the disk hardware detects an error in a sector, on the next write
operation on that sector it remaps it to another place that was reserved
during manufacture for that purpose. Typically the error is detected on
reads, so it waits. A non-zero result appears on those lines. My brute
force approach is to write zeroes to the entire partition (or disk) in
order to force the reallocation, then restore from backup.

If this is what happens to you, it would also explain your problem. When
the disk hits a bad sector, the system tries several times to read it,
forcing everything else to wait. Thus other computers in your
installation would be fine, no bad sectors.
--
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 "Bottle" at Telcontar)
Gandalf Corvotempesta
2016-10-08 18:44:29 UTC
Permalink
Post by Carlos E. R.
I'm not familiar with "-x", I normally use "-a". But let me guess. It
may be related to the Current_Pending_Sector and Offline_Uncorrectable
lines in the table of smart attributes.
I don't have them in SAS. smart for SAS is different from SMART for SATA
Post by Carlos E. R.
If this is what happens to you, it would also explain your problem. When
the disk hits a bad sector, the system tries several times to read it,
forcing everything else to wait. Thus other computers in your
installation would be fine, no bad sectors.
SAS has a "grown list" that should indicate the number of remapped sectors.
In my case, the grown list is 0, but I still have some of that strings
in "-x" output
and I don't know the meaning.

I've opened a thread on smartmon mailing list but nobody is answering.
Carlos E. R.
2016-10-08 19:07:38 UTC
Permalink
Post by Gandalf Corvotempesta
Post by Carlos E. R.
I'm not familiar with "-x", I normally use "-a". But let me guess. It
may be related to the Current_Pending_Sector and Offline_Uncorrectable
lines in the table of smart attributes.
I don't have them in SAS. smart for SAS is different from SMART for SATA
Post by Carlos E. R.
If this is what happens to you, it would also explain your problem. When
the disk hits a bad sector, the system tries several times to read it,
forcing everything else to wait. Thus other computers in your
installation would be fine, no bad sectors.
SAS has a "grown list" that should indicate the number of remapped sectors.
In my case, the grown list is 0, but I still have some of that strings
in "-x" output
and I don't know the meaning.
Could you paste the result here? Look at these values in your case:

5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
197 Current_Pending_Sector -O--C- 100 100 000 - 0
198 Offline_Uncorrectable ----C- 100 100 000 - 0

Ah, wait. You said that SAS don't have these values? I don't have any
SAS disk. I am a professional, but I don't keep any installation
currently. Mostly at home. Sorry I can't help interpreting SAS output.

My guess is that your hardware has detected the presence of bad sectors,
and is waiting for chance to relocate them. This happens on write
attempts, or perhaps via special command that is what the text you
posted says.

There must be some value that counts the sectors in relocation waiting
state.

The problem is when something tries to read any of those bad sectors,
like when testing. The entire computer has to wait. Big slowdowns.


You need to evaluate how many bad sectors are in that disk, find out if
the figure is stable or growing, how old (thousand of hours), then
decide to keep the disk or replace. A few sectors are normal in any
disk, but not if the figure keeps increasing. Some people replace disk
at the first error. I think that's excessive, but depends on how
important are the data and the purchase power you have.
Post by Gandalf Corvotempesta
I've opened a thread on smartmon mailing list but nobody is answering.
You mean this list? I'm new here. I jumped into the thread seeing there
were no answers. It seems few people answering.
--
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 "Bottle" at Telcontar)
Continue reading on narkive:
Loading...