Tip

Managing high-volume CPU processes with Bash

Have you ever had a process claim all CPU time without you knowing it? In this article, you'll discover how a simple Bash script can be used as a solution for these situations.

Requires Free Membership to View

More on Bash:
Shell game: Managing Bash command history

Chapter 4, Entering and editing the command line

To monitor greedy processes, you can use a shell script that checks top-performing processes and takes action if the CPU load goes too high. Normally, only a process where the CPU load has gone beyond 50% is worth monitoring. Since only one process at a time can claim more than 50% of the CPU load, you'll need to display current usage, organized such that processes with the highest CPU loads are listed first, with the result displaying just the top line. The following command will enable that:

ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1

On my test system where a script with the name "stress" is causing a huge CPU load, the resulting line looks like this:

99.6   9965  stress

In this output line, we need the first field because it contains the current CPU load. If we want to send an email to the administrator telling him which process is causing a CPU load that is too high, we also need the value of the third field. To get these values, we'll use some temporary variables and command substitution to assign the value of a command (the command that returns the value of the current CPU load, or the command that returns the name of the CPU load) to that variable. Just to be certain it works, we'll also create a variable that contains the PID of the culprit process. You can accomplish this by using the following lines of code:

USAGE=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{ print $1 } '`
PID=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $2 }'`
PNAME=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $3 }'`

Now that we have all the information we need about the process that is causing the high load volume, we need notify someone about it. For example, if the CPU load claimed by one process goes beyond 80%, we could send an email to the server administrator. In this example, we'll use a command that sends an email with the line "process XX CPU load above 80%" to the user root. This command will run only if the CPU load caused by that process really goes beyond 80%. To do that check, we'll compare the value of the USAGE variable to the value 80:

[ $USAGE -gt 80 ] && mail -s "CPU load of $PNAME is above 80%" 

It is very likely that this command runs into problems: Bash cannot handle floating point integers. To prevent issues here, we need to redefine the USAGE variable to get rid of the floating point. Let's do this by using a pattern matching operator that removes everything from the first dot until the end in the value of the variable USAGE so this way a usage figure like 99.6 simply becomes 99. The line we need for that is:

USAGE=${USAGE%%.*}

We insert it after the first time that the USAGE variable is defined:

 USAGE=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{ print $1 } '`
 USAGE=${USAGE%%.*}
 PID=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $2 }'`
 PNAME=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $3 }'`

With this fixed, we have all the basics we need in the script. The following step is to make sure the script runs automatically. In this case, we'll have it run every 60 seconds:

while true
do
 sleep 60
 USAGE=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{ print $1 } '`
 USAGE=${USAGE%%.*}
 PID=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $2 }'`
 PNAME=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $3 }'`

 [ $USAGE -gt 80 ] && mail -s "CPU load of $PNAME is above 80%" root < .
done
Cool script so far, but there is one problem: on some systems it is normal for service peaks during a short period of time. We don't want to be bothered with messages in those instances, since it is more logical to get a message only if the script is causing high CPU utilization over an extended period of time.

To find out when there is a high volume outside of the normal peak hour usage, we need to run the script twice. If, in the first run, the CPU load of one particular process is above 80%, we need to create a USAGE1 variable. A couple of seconds later, we need to check if the same service is still causing a high CPU load. You only want the administrator to receive an email if the latter is true. To do this, you can use the following code:

while true
do
 sleep 60
 USAGE=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{ print $1 } '`
 USAGE=${USAGE%%.*}
 PID=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $2 }'`
 PNAME=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $3 }'`

 if [ $USAGE -gt 80 ] 
 then
  USAGE1=$USAGE
  PID1=$PID
  PNAME1=$PNAME
  sleep 7
  USAGE2=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{ print $1 } '`
  USAGE2=${USAGE2%%.*}
  PID2=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $2 }'`
  PNAME2=`ps -eo pcpu,pid -o comm= | sort -k1 -n -r | head -1 | awk '{print $3 }'`
  
  # Now we have variables with the old process information and with the
  # new information

  [ $USAGE2 -gt 80 ] && [ $PID1 = $PID2 ] && mail -s "CPU load of $PNAME is above 80%" root < .
 fi
done

The above script would do the job and check CPU load of the top active process every minute. Only if the CPU load is greater than 80%, a second check is performed seven seconds later, and only if the same process is still causing a CPU load higher than 80%, will the administrator would get a message.

In this script, you've seen how to monitor process utilization automatically. Start the script once, and it will monitor your top-active processes forever. To make sure that the script is started automatically when you boot your server, you can put it in the /etc/init.d/boot.local file (exact name and location can be different between the distributions).

The example script is just a proof of concept which you can modify to do anything you'd like to do when a process is misbehaving itself. I hope you'll find it useful and that it helps in making your mission-critical servers more stable.

This was first published in March 2007

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.