How can I handle system signals in my job?

In UNIX or unix-like systems (like Linux, which runs on BYU's supercomputers), processes can be sent signals, and they can optionally handle those signals and change their behavior. Some of the signals cannot be caught by a signal handler, but many can. What signals are sent depends on the situation.

What are some common signals that can be sent to a process?

The following signals are relatively commonly sent to a process:

  • INT
    • This signal is the most common way to interrupt a process. For example, if you are running a script, and you realize you made a mistake, and you use CTRL-C to cancel it, this is the signal that gets sent to your script.
  • TERM
    • This signal is sent when you use the kill command to kill a process that you own. It is also the first signal that is sent by the scheduling system to jobs that are being removed because one of the following has occurred:
      • The job has gone beyond its walltime, and will be killed
      • The job is running, and the user has decided to remove the job using a command like qdel
  • KILL
    • This is an unconditional termination signal that can be sent to a process. It cannot be caught and handled, so it causes processes to end immediately. If you have a job that has already been sent a TERM signal, and hasn't exited within a few seconds, the job will be sent a KILL signal instead.

What can I do to cleanup a process when it gets sent a signal?

A few signals cannot be caught, but for the most part, you can set up a job script or another process, to catch and handle common signals, and do something different. For example, you might want to copy back some in-processing files when a job is killed via a signal, that you wouldn't care about normally.

Signals are caught by signal handlers, which are usually functions that are associated with a signal. The function is not called until the signal is received. How this is done will depend on the programming language, but a few examples are shown below for common job scripting languages.

Examples

Bash

#!/bin/bash
#SBATCH --time=03:00:00  # walltime
#SBATCH --ntasks=8  # number of processor cores (i.e. tasks)
#SBATCH --nodes=1  # number of nodes
#SBATCH --mem-per-cpu=1024M  # memory per CPU core
#SBATCH --qos=test

# define the handler function
# note that this is not executed here, but rather
# when the associated signal is sent
term_handler()
{
        echo "function term_handler called.  Exiting"
        # do whatever cleanup you want here
        exit -1
}

# associate the function "term_handler" with the TERM signal
trap 'term_handler' TERM

#Do your normal work here

Perl

#!/usr/bin/perl -w
#SBATCH --time=03:00:00  # walltime
#SBATCH --ntasks=8  # number of processor cores (i.e. tasks)
#SBATCH --nodes=1  # number of nodes
#SBATCH --mem-per-cpu=1024M  # memory per CPU core
#SBATCH --qos=test
use strict;


# define the handler function
# note that this is not executed here, but rather
# when the associated signal is sent
sub TERM_handler {
        print "TERM_handler called.  Exiting";
        # do other stuff to cleanup here
        exit(-1);
}

# associate the function "TERM_handler" with the TERM signal
$SIG{'TERM'} = 'TERM_handler';

# Do your normal work here

Python

#!/usr/bin/python
#SBATCH --time=03:00:00  # walltime
#SBATCH --ntasks=8  # number of processor cores (i.e. tasks)
#SBATCH --nodes=1  # number of nodes
#SBATCH --mem-per-cpu=1024M  # memory per CPU core
#SBATCH --qos=test
import signal
import sys


# define the handler function
# note that this is not executed here, but rather
# when the associated signal is sent
def term_handler(signum, frame):
        print "TERM signal handler called.  Exiting."
        # do other stuff to cleanup here
        sys.exit(-1)

# associate the function "term_handler" with the TERM signal
signal.signal(signal.SIGTERM, term_handler)

# Do your normal work here