Most of the principles described here for maintenance windows at a corporate
site apply at high-availability sites.
• They need to schedule the maintenance window so that it has the least
impact on their customers. For example, ISPs often choose 2 AM (local
time) midweek; e-commerce sites need to choose a time when they do
the least business. These windows will typically be quite frequent, such
as once a week, and shorter, perhaps 4 to 6 hours in duration.
• They need to let their customers know when maintenance windows
are scheduled. For ISPs, this means sending an email to the customers.
For an e-commerce site, this means having a banner on the site. In
both cases, it should be sent only to those customers who may be affected
and should contain a warning that small outages or degraded
service may occur during the maintenance window and give the times
of that window. There should be only a single message about the
window.
• Planning and doing as much as possible beforehand is critical because
the maintenance windows should be as short as possible.
• There must be a flight director who coordinates the scheduling and
tracks the progress of the tasks. If the windows are weekly, this may be
a quarter-time or half-time job.
• Each item should have a change proposal. The change proposal should
list the redundant systems and include a test to verify that the redundant
systems have kicked in and that service is still available.
• They need to tightly plan the maintenance window. Maintenance windows
are typically smaller in scope and shorter in time. Items scheduled
by different people for a given window should not have dependencies
on each other. There must be a small master plan that shows who has
what tasks and their completion times.
• The flight director must be very strict about the deadlines for change
completion.
• Everything must be fully tested before it is declared complete.
• Remote KVM and console access benefit all sites.
• The SAs need to have a strong presence when the site approaches and
enters its busy time. They need to be prepared to deal quickly with any
problems that may arise as a result of the maintenance.
• A brief postmortem the next day to discuss any remaining problems or
issues that arose is useful.