Systems monitoring can be used to detect and fix problems, identify the source
of problems, predict and avoid future problems, and provide data on SAs’
achievements. The two primary ways to monitor systems are to (1) gather
historical data related to availability and usage and (2) perform real-time
monitoring to ensure that SAs are notified of failures.
Historical monitoring is used for recording long-term uptime, usage, and
performance statistics. This has two components: collecting the data and
viewing the data. The results of historical monitoring are conclusions: “The
web service was up 99.99 percent of the time last year, up from the previous
year’s 99.9 percent statistic.” Utilization data is used for capacity planning.
For example, you might view a graph of bandwidth utilization gathered for
the past year for an Internet connection. The graph might visually depict a
growth rate indicating that the pipe will be full in 4 months. Cricket and Orca
are commonly used historical monitoring tools.
Real-time monitoring alerts the SA team of a failure as soon as it happens
and has two components: a monitoring component that notices failures and
an alerting component that alerts someone to the failure. There is no point in
a system’s knowing that something has gone down unless it alerts someone
to the problem. The goal is for the SA team to notice outages before customers
do. This results in shorter outages and problems being fixed before
customers notice, along with building the team’s reputation for maintaining
high-quality service. Nagios and Big Brother are commonly used real-time
monitoring systems.
Typically, the two types of monitoring are performed by different systems.
The tasks involved in each type of monitoring are very different. After reading
this chapter, you should have a good idea of how they differ and know what
to look for in the software that you choose for each task.
But first, a few words of warning. Monitoring uses network bandwidth,
so make sure that it doesn’t use too much. Monitoring uses CPU and memory
resources, so you don’t want your monitoring to make your service worse.
Security is important for monitoring systems.