Skip navigation
Brigham Young University
Your organization

PBS Batch System Help

Overview

This document describes the steps necessary to prepare and run batch jobs using the TORQUE/PBS resource manager with the Moab scheduler. This will apply to the following systems:

For information about how to query the resource manager and scheduler, or how to submit or delete jobs, etc. go to our scheduler commands page.

Script files

Like most batch systems, PBS requires its users to create a script to define what the job needs to do. This may be written in any interpreted language that uses a "#" as the comment character, including Perl, bash, tcsh, etc. This script defines the attributes of the job, including resource requirements, as well as specifying the tasks that the job needs to accomplish.

Once the script is written, it can be submitted to the scheduler with the qsub command. i.e. qsub myscript.sh

To help illustrate the principles at work here, consider the following example, and the accompanying explanation.

Example 1: simple serial job script
#!/bin/bash
#PBS -l nodes=1:ppn=1,walltime=24:00:00
#PBS -N test_program
#PBS -m abe
#PBS -M user@byu.edu
PROG=/fslhome/username/compute/test_program

PROGARGS=""
cd $PBS_O_WORKDIR
$PROG $PROGARGS
exit 0

For the above example, we will explain each line:

  • The first line, #!/bin/bash, defines the language and the interpreter used to interpret the job script.
  • The next four lines, which all start with #PBS are treated by the interpreter (bash in this case) as comments. However, the batch scheduler interprets these special comments as a definition of the resources that the job requires.
  • The line that begins #PBS -l (lowercase "L") defines the resources that your job is requesting, as well as other job instructions.
  • The next line that begins #PBS -N defines the name applied to the job.
  • The next two lines, which begin with #PBS -m and #PBS -M define the notifications that the system will send to you about your job.
  • The remainder of the script lists the commands the job will actually execute when it launches. This syntax is mostly outside the scope of this document, but the examples on this page will most likely be sufficient. For further information, we recommend the Advanced Bash-Scripting Guide.

Requesting Resources

In order for the system to schedule your job appropriately, it needs to define the resources it needs to utilize (the #PBS -l line in your script). Currently, we schedule primarily using the number of processors you need to use. When using a single-node machine, such as marylou0, use the ncpus=n syntax to specify the number of processors needed. i.e.

Example 2: Requesting processors on marylou0
#PBS -l ncpus=8,walltime=24:00:00

When using a cluster, such as maryloux or marylou4, use the nodes=n:ppn=n syntax to specify the number of nodes and processors per node (ppn) needed. For further information, see SMP vs. MPI.

Example 3: Requesting nodes and processors on maryloux or marylou4
#PBS -l nodes=2:ppn=2,walltime=24:00:00

If you need special resources, like Infiniband on marylou4, or Myrinet on maryloux, you need to request those resources as well. This is done by appending the appropriate tag to the request line. For example, to request 8 nodes, using 2 processors per node, for 72 hours on marylou4, you might use #PBS -l nodes=8:ppn=2,walltime=72:00:00, but if you needed them to be Infiniband-enabled nodes, you would use #PBS -l nodes=8:ppn=2:ib,walltime=72:00:00. Similarly, on maryloux, you can append :myri to the resource request to use Myrinet nodes.

Example 4: Requesting special resources - infiniband and myrinet
# Need 8 infiniband nodes on marylou4
#PBS -l nodes=8:ppn=4:ib,walltime=12:00:00

OR

# Need 2 myrinet nodes on maryloux
#PBS -l nodes=2:ppn=2:myri,walltime=16:00:00

The system also needs to have the total amount of time you expect the job to run, known as walltime. The system will terminate your job if it exceeds the specified walltime, so don't set it too low. On the other hand, setting the walltime too high causes problems with the scheduler, and shorter jobs have a greater chance of starting sooner, through a technique known as backfilling. The syntax for specifying a walltime is shown in the example above on the first #PBS line where it says walltime=24:00:00, meaning 24 hours, 0 minutes, and 0 seconds. Time may be specified using the format hh:mm:ss, where hh refers to the number of hours, mm the minutes, and ss the seconds. If you don't specify a walltime, the system will assume one for you, currently set at 1 hour. The system will also adjust your job start priority based on your historical walltime accuracy.

Scratch Space

Many of the systems have a separate disk space dedicated for scratch work. This area is usually optimized for large amounts of input and output, and is not backed up. We recommend that you copy your job's input data files to some location there, and make appropriate arrangements to put temporary processing files there as well, only copying the final results back to your home directory. Examples of one method of doing this are shown below in examples 5, 6, and 8.

All of our systems provide scratch space in /fslhome/username/compute. On marylou4 this space is shared among all the compute nodes. On maryloux, this space is local to each compute node, and is not shared.

Job Naming

Each job can be assigned a name for it to use when generating output files. The scheduler also assigns it a job number, which is output when you submit the job. These names are assigned to the job using the #PBS -N line as shown in the example above. If you don't specify a name, the system will use the basename of your job script as the name.

Job Output

The system will automatically create files for each of your jobs that contain the standard-output output and the standard-error output of your job. These files are automatically put into the directory you were in when you submitted the job. They are named using the name assigned to your job, and the job ID number assigned by the system. For example, if your job was named test_job, and it was given a number of 243, then in the directory you were in when you submitted the job, you would get a file named test_job.o243 that contains the standard-output output of your job, and one named test_job.e243 that contains the standard-error output of your job.

NOTE: Standard output and error files are spooled on the compute nodes, then copied back to your home directory when the job finishes. If you want these files to remain in your home directory while the job runs you can specify this in your job script with the -k option i.e. #PBS -k oe

Notification

If you request it, your job can attempt to email you when it begins execution, ends or terminates execution, and aborts execution. These are done using the #PBS -m and #PBS -M lines shown in the example above.

The line that begins #PBS -m specifies the list of events you wish to be notified about; a for aborting execution, b for began execution, and e for ending or terminating execution. You may use any combination of these letters, in any order. If you don't want any notifications, simply use the line #PBS -m n, or leave the notification directives out entirely.

The line that begins #PBS -M defines the email address you want to receive the notifications at.

SMP vs. MPI

The Supercomputing Lab manages two types of systems: large single-node systems, and clusters. Each of these should be used in a specific way. The single-node systems are single systems that contain a large number of processors, shared memory, etc., and currently only includes marylou0. All that is required for a program to utilize these systems correctly is to multi-thread the program, and launch it on the system as described above. For this reason, we prefer that the resource request use the ncpus=n syntax, which assumes a single node request. This type of processing is referred to as "SMP", or "Symmetric Multi-processing".

Clusters, like maryloux and marylou4, on the other hand operate somewhat differently. When using clusters, the program must be designed to run on multiple computers, or nodes simultaneously, while coordinating the efforts of all those program instances. The easiest method of accomplishing this is to use a communication framework like MPI (Message Passing Interface). This means, however, that the job will need to request a number of nodes and the number of processors per node, using the nodes=n:ppn=n syntax. For example, when using marylou4, there are 4 processors available per node. Therefore, if your job needs 32 total processors, you can request nodes=32:ppn=1, or nodes=16:ppn=2, or even nodes=8:ppn=4. If you request more processors per node than are available, the system will not be able to schedule your job. This is why we prefer the nodes=n:ppn=n syntax on the clusters over the ncpus=n syntax, which assumes one node. If you use the request ncpus=32 on a cluster, for example, it will try to find a 32-processor node, and, being unable to do so, will simply fail to schedule your job.

Using MPI

In order to launch an MPI process, you have to use an MPI launcher. The most common of these is called mpirun. However, another launcher named mpiexec available on both marylou4 and maryloux, is preferred.

Using mpiexec

The mpiexec job launcher is available on marylou4 and maryloux currently, and it provides a much more robust, and easily managed job launcher than the traditional mpirun. See the following example:

Example 5: 8 processor mpiexec job
#!/bin/bash
#PBS -l nodes=4:ppn=2,walltime=72:00:00
#PBS -N test_program

#PBS -m abe
#PBS -M user@byu.edu

PROG=/fslhome/username/compute/test_program
PROGARGS=""
SCRATCH_DIR=/fslhome/username/compute/$USER/$PBS_JOBID


#make sure the scratch directory is created
mkdir -p $SCRATCH_DIR

#copy datafiles from directory where I typed qsub, to scratch directory
cp -r $PBS_O_WORKDIR/* $SCRATCH_DIR/

#change to the scratch directory
cd $SCRATCH_DIR

# Execute the mpi job
/opt/mpiexec/bin/mpiexec $PROG $PROGARGS

#copy data back from scratch directory to directory where I typed qsub
cp -r $SCRATCH_DIR/* $PBS_O_WORKDIR/

exit 0

mpiexec assumes that you are using standard Ethernet (Gigabit on marylou4, 100Mbps on maryloux) as your transport by default. If you need to use Infiniband on marylou4, simply add -comm ib. Similarly, if you need to use Myrinet on maryloux, add -comm gm. See these examples:

Example 6: 8 processor Infiniband mpiexec job
#!/bin/bash
#PBS -l nodes=4:ppn=2:ib,walltime=72:00:00
#PBS -N test_program

#PBS -m abe
#PBS -M user@byu.edu

PROG=/fslhome/username/compute/test_program
PROGARGS=""
SCRATCH_DIR=/fslhome/compute/$USER/$PBS_JOBID


#make sure the scratch directory is created
mkdir -p $SCRATCH_DIR

#copy datafiles from directory where I typed qsub, to scratch directory
cp -r $PBS_O_WORKDIR/* $SCRATCH_DIR/

#change to the scratch directory
cd $SCRATCH_DIR

# Execute the mpi job
/opt/mpiexec/bin/mpiexec -comm ib $PROG $PROGARGS

#copy data back from scratch directory to directory where I typed qsub
cp -r $SCRATCH_DIR/* $PBS_O_WORKDIR/

exit 0
Example 7: 8 processor Myrinet mpiexec job
#!/bin/bash
#PBS -l nodes=4:ppn=2:myri,walltime=72:00:00
#PBS -N test_program

#PBS -m abe
#PBS -M user@byu.edu

PROG=$HOME/compute/test_program
PROGARGS=""

#cd into the directory where I typed qsub
cd $PBS_O_WORKDIR

# Execute the mpi job
/opt/mpiexec/bin/mpiexec -comm gm $PROG $PROGARGS

exit 0

Using mpirun

Which version of these utilities you use will have an effect on how your job runs. For example, if you want to use Gigabit-ethernet as your transport medium on either marylou4 or maryloux, we recommend you use /opt/mpich/gnu/bin/mpirun. If you need to use Infiniband on marylou4, use /ibrix/topspin/mpi/mpich/bin/mpirun. For Myrinet on maryloux, use /opt/mpich/myrinet/gnu/bin/mpirun.

Further Examples

Example 8: 8 processor MPI job
#!/bin/bash
#PBS -l nodes=4:ppn=2,walltime=72:00:00
#PBS -N test_program

#PBS -m abe
#PBS -M user@byu.edu

PROG=/fslhome/username/compute/test_program
PROGARGS=""
SCRATCH_DIR=/fslhome/username/compute/$USER/$PBS_JOBID


#make sure the scratch directory is created
mkdir -p $SCRATCH_DIR

#copy datafiles from directory where I typed qsub, to scratch directory
cp -r $PBS_O_WORKDIR/* $SCRATCH_DIR/

#change to the scratch directory
cd $SCRATCH_DIR

# NP should always be nodes * ppn from the #PBS -l directives above

NP=8

# Execute the mpi job
/opt/mpich/gnu/bin/mpirun \
	-machinefile $PBS_NODEFILE \
	-np $NP \
	$PROG $PROGARGS

#copy data back from scratch directory to directory where I typed qsub
cp -r $SCRATCH_DIR/* $PBS_O_WORKDIR/

exit 0
Example 9: 4 processor SMP job
#!/bin/bash

#PBS -l ncpus=4,walltime=12:00:00
#PBS -N test_program
#PBS -k oe
#PBS -m abe
#PBS -M user@byu.edu

$HOME/compute/test_program

Notices

  • New offering: Group File Sharing. Visit the FSL Groups page for details.
    Last Updated Thu Jul 10 9:29 AM 2008
  • We have prepared a new Operating System image which is available on a portion of the cluster now, and will be pushed out everywhere by mid-August. Now is the time to make sure your jobs will work with the new image. For more information, see this page
    Last Updated Fri Jun 27 10:18 AM 2008
  • A new test queue has been set up in Marylou4. See our test queue page for more information.
    Last Updated Fri May 25 4:39 PM 2007

Copyright © 1994-2008. Brigham Young University. All Rights Reserved. XHTML CSS 508