This company’s service is used primarily by the consumer market in the United
States. Because the company is on the west coast of the United States, its peak usage
times are after 2 PM on Monday to Friday---that is, after 5 PM on the east coast of the
United States---and all day during the weekend. The time specified in a change proposal
is typically the following morning, before peak time. Another thing decided at
the change-management meeting is whether the change should go ahead regardless
of ‘‘the weather’’ or wait until there is ‘‘good weather’’: the operating status of the
service. In other words, some changes are approved on the condition that the service
is operating normally at the time the SA or engineer wants to make the change. Other
changes are considered so critical that they are made regardless of how well or badly
the service is functioning.
This approach is unusual for a couple of reasons. It is certainly a step better than
having no change-management, because there is at least a minimal review process, a
defined off-peak time in which changes are performed, and a process for postponing
some tasks to avoid possibly introducing extra problems when the system is unstable.
However, the frequency of the meetings and the changes to the service network
mean that it is difficult to look at the big picture of what is going on with the service
network, that entropy and lots of small instabilities that may interact with each other
are constantly introduced, and that the SAs and the engineers are not encouraged to
plan ahead. Changes may be made quickly, without being properly thought out. It
is also unusual that changes are permitted to happen while the service network that
is the company’s revenue stream is unstable. Changes at such a time can make debugging
existing problems much more difficult, particularly if the instabilities take a
few days to debug. The formal process of checking with the operations group before
making a change and giving them the ability to prevent at least some changes from
happening is valuable, however.
Although the site was often successful in handling a large number of transactions,
for a while it was known for having stability problems and large, costly outages, the
source of which was hard to trace because of the rapidly changing nature of the network.
It was not possible to draw a line in time and say ‘‘the problems started after
this set of changes, which were approved in that change-management meeting.’’That
would have enabled them to narrow their search and perhaps find the problem more
quickly