Sunday, February 25, 2007

Failure Trends in a Large Disk Drive Population

The Google engineers published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they found some interesting stuff:

While drive manufacturers often quote yearly failure rates below 2%,
user studies have seen rates as high as 6%.

We find, for example, that after their first scan error, drives are 39
times more likely to fail within 60 days than drives with no such errors. First
errors in reallocations,offline reallocations, and probational counts are also
strongly correlated to higher failure probabilities.

Six percent, that is higher than I expected. I must say (and I am knocking on wood as I write this) that I only saw a drive die once (within a month of deploying) on a blade server. The only major problem I had was when consulting for a client in NYC. They had a SQL Server box which was running for 2 years without a problem. We upgraded the machine to an active/passive cluster and a week later the motherboard died (downtime 20 seconds ;-) ), talking about good timing.....

So what failure rates do you see? Does stuff break down a lot?

