Resilience and energy efficiency at
scale. As advanced computing and data
analysis systems grow ever larger, the
assumption of fully reliable operation
becomes much less credible. Although
the mean time before failure for individual
components continues to increase
incrementally, the large overall
component count for these systems
means the systems themselves will fail
more frequently. To date, experience
has shown failures can be managed
but only with improved techniques for
detecting and understanding component
failures.