any machine that needs to be shut down for hardware maintenance/upgrades
or moving has to be shut down before the work on the critical machines
starts. It is important to shut down the machines in the right order, to avoid
wasting time bringing machines back up so that other machines can be shut
down cleanly. The boot sequence is also critical to the comprehensive system
testing performed at the end of the maintenance window.
The shutdown sequence can be used as part of a larger emergency power
off (EPO) procedure. An EPO is a decision and action plan for emergency
issues that require fast action. In particular, action is required more quickly
than one could get management approval. Think of it as precompiling decisions
for later execution. An EPO should include what situations require
its activation—fire, flood, overheating conditions with no response from
facilities—and instructions on how to verify these issues. A decision tree is the
best way to record this information. The EPO should then give instructions
on how to migrate services to other data centers, whom to notify, and so on.
Document a process for situations where there is time to copy critical data out
of the data center and a process for when there is not. Finally, it should use
the shutdown sequence to power off machines. In the case of overheating, one
might document ways to shut down some machines or put machines into lowpower
mode so they generate less heat by running slower but still provide services.
The steps should be documented such that they can be performed by any
SA on the team. Having such a plan can save hardware, services, and revenue.