Data from commercial cloud data
centers suggests some long-held assumptions
about component failures
and lifetimes are incorrect.11,20–22 A
2009 Google study22 showed DRAM
error rates were orders-of-magnitude
higher than previously reported,
with over 8% of DIMMs affected by
errors in a year. Equally surprising,
these were hard errors, rather than
soft, correctable (via error-correcting
code) errors.
In addition to resilience, scale also
brings new challenges in energy management
and thermal dissipation. Today’s
advanced computing and dataanalysis
systems consume megawatts
of power, and cooling capability and
peak power loads limit where many
systems can be placed geographically.
As commercial cloud operators have
learned, energy infrastructure and power
are a substantial fraction of total system
cost at scale, necessitating new infrastructure
approaches and operating
models, including low-power designs