How can a system administrator monitor a large number of machines and services to proactively address problems before anyone else suffers from them?
The answer is Nagios.
Nagios is an open source network monitoring tool. It is free, powerful and flexible. It can be tricky to learn and implement, but can reduce enormously the amount of time required to keep track of how your organization's IT infrastructure is performing.
I'll cover the usefulness and architecture of Nagios in part one of this two-part column. In
To understand the usefulness of Nagios, consider a typical IT infrastructure that one or more system administrators are responsible for. Even a small company may have a number of pieces of hardware with many services and software packages running on them. Larger companies may have hundreds or even thousands of items to keep up and running. Both small and large companies may have decentralized operations, implying a decentralized IT infrastructure, with no ability to physically see many of the machines at all.
Naturally, each piece of hardware will have a unique set of software products running on it. Faced with a multitude of hardware and software to monitor, administrators cannot pay attention to each specific item; the default posture in this kind of situation is to respond to service outages on a reactive basis. Worse, awareness of the problem usually comes only after an end-user complains.
Beyond the obvious public relations problem, there are also inefficiencies inherent in reactive problem solving. Problems that might have only taken a few minutes to address if caught early can become much more time-consuming if addressed later. For example, a database that is running out of disk space for its logs might be easy to fix before the last byte of disk is consumed, but fixing the problem once the system is hung due to inability to write log records is much harder to do.
Therefore, an automated tool that can help in system administration can be extremely helpful. These tools go by the generic name of network management software, and all share the capability to:
- Keep track of all the services and machines running in the infrastructure;
- Raise alerts before small problems become large ones;
- Run from a central location to reduce the need to physically go to each machine; and,
- Provide a visual representation of system-wide status, outstanding problems, etc.
Two main problems keep network management software from being more widely used:
- It tends to be extremely expensive; and,
- It requires significant work to configure for a given environment.
Nagios is an open source network management tool that solves the first problem. It too, requires a fair amount of configuration, but there are a couple of suggestions to reduce that burden later in this article.
The Nagios architecture
The Nagios application runs on a central server, either Linux or Unix. Each piece of hardware that must be monitored runs a Nagios daemon that communicates with the central server. Depending on the instructions in the configuration files the central server reads, it will "reach out and touch" the remote daemon to instruct it to run a necessary check. While the application must run on Linux or Unix, the remote machines may be any piece of hardware that may be communicated with.
Depending upon the response from the remote machine, Nagios will then respond with an appropriate action, again, according to its configuration. Depending upon what remote test needs to be performed, Nagios will perform the test via a native machine capability (e.g., test to see if a file exists) or will run a custom test program (called a plugin) to test something more specific (e.g., check to see if a particular set of values has been placed into a database). If a check return value is not correct, Nagios will raise an alert via one or several methods -- again, according to how it has been configured.
Now, let's move on to part two, where I provide an example of a Nagios configuration.
Bernard Golden is CEO of Navica Inc., a systems integration firm specializing in open source
software. He writes a column for SearchEnterpriseLinux.com called Golden's Rules and answers
about open source software issues.
This was first published in April 2005