Considerate Automation – Avoid Running Shell Scripts If System State Is Undesirable

One of the many popular ways to automate tasks is the use of shell scripts scheduled via cron. However, changes in the underlying system state may cause the script to do more harm than good to the host. For example, automated script executions may overlap and cause resource contention, race conditions, and/or push an already over-utilized host past its critical point.

When troubleshooting problems like this, the typical areas to look into are are:

  • bugs/errors in the script (e.g. too many forks)
  • incorrect cron scheduling (e.g. too frequent/infrequent)
  • changes in the underlying system state (e.g. new process consuming a lot of resources)

Sometimes our initial assumptions (and past configurations) are no longer accurate – for example a website becomes popular overnight and the web server’s log files are two magnitudes larger. Ideally in such cases, we might want these scripts to not execute, and instead log a special message to be picked up by a downstream log processor and flagged to the duty operator.

In this post, I will share some ways to detect system state and avoid starting a script when certain conditions are undesirable.

Check System Load Average Is Below Threshold

The system load average (typically expressed in 1m, 5m, 15m intervals) can be found at: /proc/loadavg

$ cat /proc/loadavg 
3.70 2.47 1.74 3/874 18067

These are the load averages across all processors on the system. Based on this guide from DataCadamia, the load average on a multi-processor machine is relative to the number of processors available, and in most cases should not exceed 0.7. However, you should adjust this threshold based on several factors such as the type of work being done, and whether the host is running a critical process (e.g. web server).

Since we know the load average across all processors, the next step is to get the number of processors on the system (credit):

$ grep -c ^processor /proc/cpuinfo

With these two pieces of information, we can check and exit at the start of the script if the current load average is above our defined threshold:


NUM_PROCESSORS=`grep -c ^processor /proc/cpuinfo`
LOADAVG_CURRENT=`cat /proc/loadavg | cut -d' ' -f1`
LOADAVG_THRESHOLD=`echo "scale=2;$NUM_PROCESSORS * 0.1" | bc`

# 0 == false, 1 == true
if [ $(echo "$LOADAVG_CURRENT > $LOADAVG_THRESHOLD" | bc) -gt 0 ] 
  # log to syslog
  logger "Load average exceeds threshold! 1min loadavg should be below $LOADAVG_THRESHOLD but was $LOADAVG_CURRENT"
  # log to stderr
  >&2 echo "Load average exceeds threshold, see syslog for details"
  # exit with non-zero error code
  exit 1

# continue with remaining script

Of course, this method of calculation is only an estimate as the way load averages are derived (see Brendan Gregg’s post), as well as multi-core and multiprocessor CPU architectures can skew this metric.

Check for Sufficient Free Memory

Another possible resource constraint could be the amount of available main memory (i.e. RAM). Active processes may take up most of the memory on the system, leaving an insufficient amount to run the script. In the best case, the system will use its swap space with some performance degradation (due to increased I/O). In the worst case, thrashing or an unresponsive system might occur.

The usage statistics of memory can be found in /proc/meminfo, and can be displayed in a friendly format by using the free command:

$ cat /proc/meminfo 
MemTotal:        8033384 kB
MemFree:         3814276 kB
MemAvailable:    4764328 kB
Buffers:          186256 kB
Cached:          1746828 kB
SwapCached:            0 kB
Active:          2597952 kB
Inactive:        1016312 kB
Active(anon):    1886120 kB
Inactive(anon):   674636 kB
Active(file):     711832 kB
Inactive(file):   341676 kB
Unevictable:      203512 kB
Mlocked:              48 kB
SwapTotal:       2097148 kB
SwapFree:        2097148 kB
Dirty:               456 kB
Writeback:             0 kB
AnonPages:       1884576 kB
Mapped:           823248 kB
Shmem:            879580 kB
KReclaimable:     205660 kB
Slab:             281660 kB
SReclaimable:     205660 kB
SUnreclaim:        76000 kB
KernelStack:       11824 kB
PageTables:        62832 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6113840 kB
Committed_AS:    9849568 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       27244 kB
VmallocChunk:          0 kB
Percpu:             1248 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      308472 kB
DirectMap2M:     7960576 kB
DirectMap1G:           0 kB
$ free
              total        used        free      shared  buff/cache   available
Mem:        8033384     2102184     3752344      918912     2178856     4703092
Swap:       2097148           0     2097148

Like the previous section, we would want to make the exit decision based on the percentage of available memory. Luckily, I found this stackoverflow thread that explains how to do just that.

We can incorporate this available memory check into our shell script like this:


AVAILABLE_MEMORY=`free | grep Mem | awk '{print $4/$2 * 100.0}'`
AVAILABLE_MEMORY_THRESHOLD=25 # only need 25% of total memory

# 0 == false, 1 == true
  # log to syslog
  logger "Too little memory to run script! Available memory should be above $AVAILABLE_MEMORY_THRESHOLD but was $AVAILABLE_MEMORY"
  # log to stderr
  >&2 echo "Too little memory to run script, see syslog for details"
  # exit with non-zero error code
  exit 1

# continue with remaining script

Check for Concurrent Runs

Sometimes we do not want concurrent runs of a script. Possible reasons are: 1) race conditions when scripts mutate the same files/state, or 2) multiple concurrent script executions may starve other processes.

To ensure that only one instance of the script runs, we would need a semaphore to act as a shared lock between multiple script runs. There are several possible solutions – I’ll summarize and provide links to their original posts:

$ crontab -l
* * * * * /usr/bin/flock -w 0 /path/to/cron.lock /usr/bin/php /path/to/cron.php
if ( set -o noclobber; echo "locked" > "$lockfile") 2> /dev/null; 
  # locking succeeded
  # locking failed
if mkdir /var/lock/mylock; 
  # locking succeeded
  # locking failed

Shell environment variables are not feasible as its possible to have interleaving set/unset operations.

Advisory vs Mandatory Locking

One important thing to note is that flock‘s locks are advisory – i.e. it only works when every process cooperates to acquire/release the locks appropriately. There is no OS/kernel level enforcement, and other processes can choose to ignore (or even delete!) these locks.

It is possible to set up mandatory file locking, but that involves:

  • Mounting the partition with the mand mount option
  • Set the set-group-ID bit on the lock file
  • Unset the group-execute bit on the lock file

When done correctly, the lock file will have mandatory locking enabled and behaves accordingly when used with the right system calls. However, the kernel docs does have an interesting warning:

Not even root can override a mandatory lock, so runaway processes can wreak havoc if they lock crucial files. The way around it is to change the file permissions (remove the setgid bit) before trying to read or write to it. Of course, that might be a bit tricky if the system is hung 😦

In addition to the kernel docs link above, here are several other posts with more in-depth explanations for those who are interested:

Hope these tips help to achieve “considerate” shell scripts that do not negatively impact the host (e.g. through resource contention) when they are scheduled to execute.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s