The final stage of a maintenance window is comprehensive system testing. If
the window has been short, you may need to test only the few components
that you worked on. However, if you have spent your weekend-long maintenance
window taking apart various complicated pieces of machinery and
then putting them back together and all under a time constraint, you should
plan on spending all day Sunday doing system testing.
Sunday system testing begins with shutting down all of the machines in
the data center, so that you can then step through your ordered boot sequence.
Assign an individual to each machine on the reboot list. The flight director
announces the stages of the shutdown sequence over the radio, and each individual
responds when the machine under their responsibility has completely
shut down. When all the machines at the current stage have shut down, the
flight director announces the next stage. When everything is down, the order
is reversed, and the flight director steps everyone through the boot stages.
If any problems occur with any machine at any stage, the entire sequence is
halted until they are debugged and fixed. Each person assigned to a machine
is responsible for ensuring that it shut down completely before responding
and that all services have started correctly before calling it in as booted and
operational.
Finally, when all the machines in the data center have been successfully
booted in the correct order, the flight director splits the SA team into groups.
Each group has a team leader and is assigned an area in one of the campus
buildings. The teams are given instructions about which machines they
are responsible for and which tests to perform on them. The instructions
always include rebooting every desktop machine to make sure that it comes
up cleanly. The tests could also include logging in, checking for a particular
service, or trying to run a particular application, for example. Each person
in the group has a stack of colored sticky tabs used for marking offices and
cubicles that have been completed and verified as working. The SAs also have
a stack of sticky tabs of a different color to mark cubicles that have a problem.
When SAs run across a problem, they spend a short time trying to fix it
before calling it in to the central core of people assigned to stay in the main
building to help debug problems. As it finishes its area, a team is assigned
to a new area or to help another team to complete an area, until the whole
campus has been covered.
Meanwhile, the flight director and the senior SA troubleshooters keep
track of problems on a whiteboard and decide who should tackle each problem,
based on the likely cause and who is available. By the end of testing,
all offices and cubicles should have tags, preferably all indicating success. If
any offices or cubicles still have tags indicating a problem, a note should be
left for that customer, explaining the problem; someone should be assigned
to meet with that person to try to resolve it first thing in the morning.
This systematic approach helps to find problems before people come in
to work the next day. If there is a bad network segment connection, a failed
software depot push, or problems with a service, you’ll have a good chance
to fix it before anyone else is inconvenienced. Be warned, however, that some
machines may not have been working in the first place. The reboot teams
should always make sure to note when a machine did not look operational
before they rebooted it. They can still take time to try to fix it, but it is
lower on the priority list and does not have to happen before the end of the
maintenance window
Ideally, the system testing and sitewide rebooting should be completed
sometime on Sunday afternoon. This gives the SA team time to rest after a
stressful weekend before coming into work the next day.