Smartmon error message in the installer

Message
Author
User avatar
BitJam
Developer
Posts: 2283
Joined: Sat Aug 22, 2009 11:36 pm

Smartmon error message in the installer

#1 Post by BitJam »

As kmathern explained:
m_pav recommended issuing a warning for the SMART attributes with 'Reallocated', 'Pending' or 'Uncorrect' in their names, that have a non-zero RAW VALUE.
It now appears this new mechanism may have issued some false alarms. We can't know for sure if these are false alarms or not but on average, most of them will be (I will explain this below). Here is a shot of the error message which says:
The disk you selected for installation appears to be failing, as the disk health indicators (S. M. A. R. T.) warning above indicates.

We recommend you abort the installation and have the disk checked and replaced.
Typically what has happened (as reported in the forums) is smartctl or the equivalent is run and it says the disk is fine.

I looked around for information on this and the two most useful things I found were the Wikipedia and this 13-page research paper from Google written in 2007. I don't claim these are the absolute best resources. Perhaps other people can find better ones.

Figure 7 on page 7 of the paper shows the difference in failure rate between drives that have zero reallocation errors and drives that have one or more. With no errors, the failure rate is always under 5%. If there are one or more errors, the failure rate varies between 7% and 20%. Certainly, there is a strong correlation between one more more errors in this particular parameter and a higher failure rate.

But it is possible our warning message is too dire. There are several reasons for this. The first problem could be due to the fact we are flagging parameters based on name and there could be parameters with matching names that don't correlate well with imminent failure. Next, even if we take the very worst case from Figure 7, and assume a non-zero value corresponds to a 20% failure rate then 80% of the time we issue a warning it will be a false alarm (in the upcoming year at least).

One problem I've seen is that some users don't realize that this error is associated with the entire drive (for example sda) and not a particular partition (sda1, sda2, etc). Confusion is compounded by the fact that the results of the test done by the installer do not correlate with the results of the smartmon tools the user is told to run. If the Google paper is the most definitive research on this subject then we might be able to fix things by simply changing our error message to better correspond to what the paper actually says. Perhaps something like:
The smartmon tools report that the disk you selected for installation might have a higher than average failure rate in the upcoming year.
One question we need to answer is what to suggest to people if they get this error in the installer but the smartmon tools say the drive is healthy. It is possible the false alarm rate will end up making this new feature more trouble than it is worth. We should certainly wait to hear from m_pav before jumping to this conclusion.

User avatar
lucky9
Posts: 475
Joined: Wed Jul 12, 2006 5:54 am

Re: Smartmon error message in the installer

#2 Post by lucky9 »

Remove the check. The health of a disk should be the concern of the person that's using it. Any distro should not be issuing warnings.

If it hasn't shown anything else this has shown that a 'feature' can be counter-productive.
Yes, even I am dishonest. Not in many ways, but in some. Forty-one, I think it is.
--Mark Twain

User avatar
Adrian
Developer
Posts: 8267
Joined: Wed Jul 12, 2006 1:42 am

Re: Smartmon error message in the installer

#3 Post by Adrian »

Thanks for the post, BitJam.

No, I don't agree the feature is counter-productive and it needs to be removed, it needs to be fixed to avoid false positives and that's what I did in the code. The next time anticapitalista builds the ISO it will have the fix.

User avatar
chrispop99
Global Moderator
Posts: 3174
Joined: Tue Jan 27, 2009 3:07 pm

Re: Smartmon error message in the installer

#4 Post by chrispop99 »

There has been at least one other distro that did a HDD check upon install, but I'm unable to recall which. (I try upwards of two a week, so it's easy to loose track!)

I think it is a feature worth keeping as long as it works as intended, and the end user is able to understand the consequences of continuing with a failing drive.

Chris
MX Facebook Group Administrator.
Home-built desktop - Core i5 9400, 970 EVO Plus, 8GB
DELL XPS 15
Lots of test machines

User avatar
BitJam
Developer
Posts: 2283
Joined: Sat Aug 22, 2009 11:36 pm

Re: Smartmon error message in the installer

#5 Post by BitJam »

Adrian, I may be wrong but if my interpretation of the Google paper is correct, you only removed one form of false positives. If you look at, for example, Figure 7 of the Google paper, even if the field being checked is one of the valid ones, the chances of a false positive are still roughly 90%.

Another way of seeing it is that you are assuming the people who wrote the smart monitor tools are not as smart as you. Otherwise they would use a similar criteria for flagging flailing drives. They don't.

It is true that a raw count greater than zero for one of the valid fields does indicate that the chances of failure are greater than average but the chances are still (usually) less than 10% according to the Google paper. M_pav seems rather adamant that this is an important check to make. When I tried of find documentation that supports it, all I found was an echo chamber that was based on incorrect interpretations of the Google paper.

My perspective is explained in a classic introductory problem in Bayesian Probability as explained here. The reason it is used as an introductory problem is because most people's intuition is totally wrong when it comes to problems like this.

I will say it again, the Google paper says a non-zero count in the valid fields indicates the failure rate in the coming year will roughly double. But the failure rate is still low, on the order of 10%. So nine times out of ten, when you report an error on a valid field, it will be a false positive. IOW, if you double a low error rate, you still have a low error rate, just not as low as before.

I know m_pav has a lot of experience in the field and I respect his opinion. Maybe there are studies that have been done that make the Google results obsolete. Maybe I'm misreading the Google paper. But if you look at their conclusion, they say that their data does not justify the kind of simple smartmon test that is performed by the installer.

User avatar
kmathern
Developer
Posts: 2406
Joined: Wed Jul 12, 2006 2:26 pm

Re: Smartmon error message in the installer

#6 Post by kmathern »

BitJam wrote:Adrian, I may be wrong but if my interpretation of the Google paper is correct, you only removed one form of false positives. ...
We discussed this in the "Development Team" forum (which you aren't a member of for some strange reason).

Initially he was only eliminating the false postive for attributes that had "Reported_Uncorrect" in their name (that was attribute #187 -- I think).

With the latest change it's been modified to only check attributes 5, 196, 197 & 198.

User avatar
Jerry3904
Administrator
Posts: 21937
Joined: Wed Jul 19, 2006 6:13 am

Re: Smartmon error message in the installer

#7 Post by Jerry3904 »

Will make him a member when I get home.
Production: 5.10, MX-23 Xfce, AMD FX-4130 Quad-Core, GeForce GT 630/PCIe/SSE2, 16 GB, SSD 120 GB, Data 1TB
Personal: Lenovo X1 Carbon with MX-23 Fluxbox and Windows 10
Other: Raspberry Pi 5 with MX-23 Xfce Raspberry Pi Respin

User avatar
Adrian
Developer
Posts: 8267
Joined: Wed Jul 12, 2006 1:42 am

Re: Smartmon error message in the installer

#8 Post by Adrian »

I will say it again, the Google paper says a non-zero count in the valid fields indicates the failure rate in the coming year will roughly double.
We should probably change the language to what you proposed, I think that is clear and the user is at least warned that there might be some issues with their drive.

How about:
The smartmon tools report that the disk you selected for installation might have a higher than average failure rate in the upcoming year.
We recommend you install on a different disk.

Do you want to abort?

User avatar
Utopia
Administrator
Posts: 3250
Joined: Sun Apr 29, 2007 11:53 am

Re: Smartmon error message in the installer

#9 Post by Utopia »

The smartmon tools report that the disk you selected for installation might have a higher than average failure rate in the upcoming year.
Isn't this enough? Higher than average is hardly a reason for aborting the installation.
The test is good, but the presentation of the results shouldn't have a slight hysterical tone.
Henry

User avatar
Adrian
Developer
Posts: 8267
Joined: Wed Jul 12, 2006 1:42 am

Re: Smartmon error message in the installer

#10 Post by Adrian »

OK, this is how it looks like:

Code: Select all

Smartmon tool output:

<smartmon output>

The smartmon tools report that the disk you selected for installation might have a higher than average failure rate in the upcoming year.

Do you want to continue?
Is this OK?

Locked

Return to “Older Versions”