What's the best way to submit lots of very short jobs?

It's becoming very common for users to want to submit large numbers of very short jobs. Under certain circumstances, this can cause some problems, as described below. Also listed here are some techniques for dealing with the large numbers of very short jobs.

Problems

Number of Jobs

As the total number of jobs that the scheduling system is tracking increases, the scheduler software has to spend more time processing all of those jobs. Under certain circumstances, this can slow down the responsiveness of commands like squeue.

Additionally, as the total number of jobs increases, the scheduler software's memory utilization increases in proportion. If the host that's running the scheduler runs out of memory as a result, then it will kill the scheduler process. We do have a monitoring tool that will automatically try to restart the scheduler within a few minutes, but if it runs out of memory again, then it will get killed again, etc.

Length of Jobs

If the total actual running time of the jobs is too short, it can cause problems as well. In the scheduler system, the software processes JOBSTART and JOBEND events when the job starts and ends, and if the number of these events per minute gets to be too high, then the software starts to have responsiveness problems, as described above. Additionally, since the scheduler software is on a periodic cycle, and doesn't start jobs immediately necessarily, the process itself creates some overhead in time that can become significant.

These problems occur when jobs are very short. Most often, this is when the job's actual running time is less than 10 minutes. For this reason, we recommend that all jobs run for at least 30 minutes.

Solutions

To deal with these issues, the solution is relatively simple: Run fewer, longer-running jobs. This can be accomplished by aggregating the existing processing tasks into longer-running jobs, as described below.

Aggregating jobs together

The most common scenario involves users who have hundreds or thousands of very-short, 1-processor jobs. Their jobs usually look something like this:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:02:00
#SBATCH --mem-per-cpu=2G
#SBATCH --job-name=job_case_1224

cd ~/compute/case_1224/
./run_case

Most of the time, the only thing that changes from case to case, is the case number. So, it's possible to do something more like this:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=01:40:00
#SBATCH --mem-per-cpu=2G
#SBATCH --job-name=job_case_1201-1250

for casenumber in {1201..1250}
do

    echo "starting case number $casenumber"

    cd ~/compute/case_$casenumber
    ./run_case

    echo "completed case number $casenumber"

done

In this example, we utilize the fact that the cases are contiguous, and we are able to run 50 cases in a single job. Notice that the total walltime has been adjusted to reflect the difference.

Additional Example

In the following example, we have a situation in which the cases are all in subdirectories named case_ followed by a number, but the numbers are not necessarily contiguous. As a result, we can't use the range method we used in the example above. Instead, we submit the jobs using a variable that contains the list of cases we want to use.

Job Script

In this example, the job script that runs the cases, looks something like this:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --mem-per-cpu=2G

# Note that caselist is defined
# at submission time
for case in $caselist
do

    echo "starting case $case"

    cd ~/compute/$case
    ./run_case

    echo "completed case $case"

done

In this example, we assume the variable caselist has been populated with a space-separated list of case directories. Notice that we don't actually define this variable inside the script. We'll do that when we submit the job instead, as shown below.

Submitting the jobs

In order to submit these jobs, we define the caselist variable so that sbatch can pass it to the job along with the rest of the environment. This can be done by exporting the variable, then calling sbatch:

export caselist="case_1 case_3 case_6"
sbatch jobscript

...or by specifying it on the same line, before the sbatch command:

caselist="case_1 case_3 case_6" sbatch jobscript

Then when that job runs, the caselist variable will have the list "case_1 case_3 case_6" in it.

In addition, we can use a simple script to gather the cases into groups of whatever size we want, and submit them. Something like this would work, although you might need to adjust it for your situation:

#!/bin/bash

# adjust this value to specify the maximum number of cases
# per job
maxcasesperjob=12


mycaselist=""
count=0

# loop through all the files/directories that look like "case_*"
for case in case_*
do


        if [ "x$mycaselist" == "x" ] # if caselist is empty
        then
                # just set it to this current case
                mycaselist=$case

                # since we know this is the first
                # case in the group, we'll also
                # start building the job name
                myjobname="job_$case-"
        else
                # if caselist is not empty
                # append the current case to the list
                mycaselist="$mycaselist $case"
        fi


        # Increment the number of jobs processed
        let count=$count+1

        # See if we have enough to submit a job yet
        let countmod=$count%$maxcasesperjob
        if [ $countmod == 0 ]
        then
                # we now have $maxcasesperjob
                # cases collected in $mycaselist

                # finish prepping job name
                myjobname="$myjobname$case"

                # submit the job
                caselist="$mycaselist" sbatch --job-name=$myjobname jobscript

                # and we clear out the list of cases
                mycaselist=""
        fi
done


# handle any cases that are left over
if [ "x$mycaselist" != "x" ]
then
        caselist="$mycaselist" sbatch --job-name=${myjobname}END jobscript
fi