While drive manufacturers often quote yearly failure rates below 2%,
user studies have seen rates as high as 6%.
We find, for example, that after their first scan error, drives are 39
times more likely to fail within 60 days than drives with no such errors. First
errors in reallocations,offline reallocations, and probational counts are also
strongly correlated to higher failure probabilities.
Six percent, that is higher than I expected. I must say (and I am knocking on wood as I write this) that I only saw a drive die once (within a month of deploying) on a blade server. The only major problem I had was when consulting for a client in NYC. They had a SQL Server box which was running for 2 years without a problem. We upgraded the machine to an active/passive cluster and a week later the motherboard died (downtime 20 seconds ;-) ), talking about good timing.....
So what failure rates do you see? Does stuff break down a lot?