Everything fails eventually. You can’t prevent a hard drive from failing. You
can give it perfect, vendor-recommended cooling and power, and it will still
fail eventually. You can’t stop an HBA from failing. Now and then, a bit
being transmitted down a cable gets hit by a gamma ray and is reversed. If
you have eight hard drives, the likelihood that one will fail tomorrow is eight
times more likely than if you had only one. The more hardware you have, the
more likely a failure. Sounds depressing, but there is good news. There are
techniques to manage failures to bring about any reliability level required.
The key is to decouple a component failure from an outage. If you have
one hard drive, its failure results in an outage: a 1:1 ratio of failures to
outages. However, if you have eight hard drives in a RAID 5 configuration,
a single failure does not result in an outage. Two failures, one happening
faster than a hot spare can be activated, is required to cause an outage. We
have successfully decoupled component failure from service outages. (Similar
strategy can be applied to networks, computing, and other aspects of system
administration.)
The configuration of a storage service can increase its reliability. In particular,
certain RAID levels increase reliability, and NASs can also be configured
to increase overall reliability.
The benefit of centralized storage (NAS or SAN) is that the extra cost of
reliability is amortized over all users of the service.