In this study we report on the failure characteristics of
consumer-grade disk drives. To our knowledge, the
study is unprecedented in that it uses a much larger
population size than has been previously reported and
presents a comprehensive analysis of the correlation between
failures and several parameters that are believed to
affect disk lifetime. Such analysis is made possible by
a new highly parallel health data collection and analysis
infrastructure, and by the sheer size of our computing
deployment.
One of our key findings has been the lack of a consistent
pattern of higher failure rates for higher temperature
drives or for those drives at higher utilization levels.
Such correlations have been repeatedly highlighted
by previous studies, but we are unable to confirm them
by observing our population. Although our data do not
allow us to conclude that there is no such correlation,
it provides strong evidence to suggest that other effects
may be more prominent in affecting disk drive reliability
in the context of a professionally managed data center
deployment.
Our results confirm the findings of previous smaller
population studies that suggest that some of the SMART
parameters are well-correlated with higher failure probabilities.
We find, for example, that after their first scan
error, drives are 39 times more likely to fail within 60
days than drives with no such errors. First errors in reallocations,
offline reallocations, and probational counts
are also strongly correlated to higher failure probabilities.
Despite those strong correlations, we find that
failure prediction models based on SMART parameters
alone are likely to be severely limited in their prediction
accuracy, given that a large fraction of our failed drives
have shown no SMART error signals whatsoever. This
result suggests that SMART models are more useful in
predicting trends for large aggregate populations than for
individual components. It also suggests that powerful
predictive models need to make use of signals beyond
those provided by SMART
In this study we report on the failure characteristics ofconsumer-grade disk drives. To our knowledge, thestudy is unprecedented in that it uses a much largerpopulation size than has been previously reported andpresents a comprehensive analysis of the correlation betweenfailures and several parameters that are believed toaffect disk lifetime. Such analysis is made possible bya new highly parallel health data collection and analysisinfrastructure, and by the sheer size of our computingdeployment.One of our key findings has been the lack of a consistentpattern of higher failure rates for higher temperaturedrives or for those drives at higher utilization levels.Such correlations have been repeatedly highlightedby previous studies, but we are unable to confirm themby observing our population. Although our data do notallow us to conclude that there is no such correlation,it provides strong evidence to suggest that other effectsmay be more prominent in affecting disk drive reliabilityin the context of a professionally managed data centerdeployment.Our results confirm the findings of previous smallerpopulation studies that suggest that some of the SMARTparameters are well-correlated with higher failure probabilities.We find, for example, that after their first scanerror, drives are 39 times more likely to fail within 60days than drives with no such errors. First errors in reallocations,offline reallocations, and probational countsare also strongly correlated to higher failure probabilities.Despite those strong correlations, we find thatfailure prediction models based on SMART parametersalone are likely to be severely limited in their predictionaccuracy, given that a large fraction of our failed driveshave shown no SMART error signals whatsoever. Thisresult suggests that SMART models are more useful inpredicting trends for large aggregate populations than forindividual components. It also suggests that powerfulpredictive models need to make use of signals beyondthose provided by SMART
การแปล กรุณารอสักครู่..
