Tip

Nagios: Balancing passive and active checks

So you've managed to get Nagios, an open source network monitoring tool, up and running and now you need to fine tune its performance. The author of O'Reilly's Network Monitoring with Nagios advises using active checks only when

Requires Free Membership to View

absolutely necessary, as well as making sure that your escalation rules include all individuals that should be notified at that time and not just the additional contact.

More on Nagios:
Nagios offers open source option for network monitoring 

Chief Splunker: Splunk + Nagios = quick data center fixes 

In this interview, author Taylor Dondich suggests a workaround for balancing active and passive checks and explains the impact of service failures.

If you've already deployed Nagios in your IT environment, what are some tricks you can use to enhance and improve performance?

Dondich: As your IT environment grows, the number of monitored devices will grow with it. As it continues to grow, you may see performance degrade or the bandwidth in your network saturate with the number of checks Nagios is performing. The thing I can't stress enough is to use active checks only when necessary and to really leverage passive checking.

Active checks occur when Nagios itself is responsible for checking the status of a device at regular intervals. On the other side, a passive check is when the device reports its status to Nagios only when its status changes.

Increasing the number of passive checks you use instead of active checks will increase the number of devices you can monitor with Nagios. Beyond that, you may need to start looking into using a distributed Nagios implementation. This requires separate Nagios instances communicating with a central Nagios system. It's tough to maintain the configuration files, since each Nagios instance requires its own set, but in the end, it'll do the job.

Won't continual host checks become a performance drain in Nagios?

Dondich: A balance of the use of passive and active checks should take place. For example, I may wish to use passive checks for most of the services on a device, but I may want to check for the reachability -- via a PING check using check_ping -- using an active check.

If the Nagios service recovers from an error (i.e. a soft recovery) administrators won't be informed. Is this important?

Dondich: When a service fails for the first time, Nagios will put that service in a "soft" state. Nagios will then check the service a configured number of times to see if it comes back up. If it does not come back up within that preconfigured number of checks, then Nagios will put the service in a "hard" state and notifications will be sent out. If the service recovers within those checks, Nagios will not send out notifications. So why do this?

Well, event handlers can be used to perform actions based on a status change, whether it is a soft or hard state. For example, if you have an Apache Web service which fails, an event handler may be run to attempt to restart the Apache service. If the service comes back up while Nagios is checking it, then there's probably no real reason to send out notifications.

But if the attempt to restart Nagios fails, then Nagios will eventually put the service in a hard state, causing the notification to be sent out. If you want notifications to always be sent out, the parameter used to specify how many checks to perform before setting the state to a "hard" state is the max_check_attempts parameter for both host and services.

What are some common mistakes that occur when configuring problem escalation?

Dondich: I think the biggest one would be that of multiple escalation levels for a device or service. For example, say you have the initial contact for a Web server as Bob. Bob gets notified about a problem which is occurring. Bob is lazy, doesn't fix the problem, so the notifications escalate properly to the network admin, Tim.

Tim is working on something and doesn't initially see his notifications. So it gets escalated again, this time sending notifications to Tim's boss, Rich. But something strange has happened. Rich is getting notifications that there is this problem, but Tim is no longer receiving them. It's a common problem, and to put it plainly, you need to make sure that your escalation rules include all individuals that should be notified at that time, not just the additional contact.

Some good online documentation describes this escalation scenario.

This was first published in December 2006

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.