Deprecated by SLURM
General Batch Information
Overview / Jobs
Like many high-performance computing systems, BYU's Supercomputers are managed by a batch scheduling system. This means that in order to use the supercomputer, users must encapsulate the workload into a non-interactive job, and submit that job to the scheduling system. The job can specify a number of parameters to the scheduling system, including the following:
* required attribute
- Either number of nodes and processors per node, or total number of processors *
- Memory/RAM needed *
- Expected running time (or "walltime") *
- Specific node features/attributes (GPUs, processor types, etc.)
- Local disk space needed
- Events to notify on (job abort, begin, end)
- Email address to send notifications
- Job Name (used for output files)
Jobs are created by building a job script, usually written in bash, tcsh, perl, or python, that specifies the necessary parameters described above, and then contains appropriate scripting to launch the program that will do the job's work.
The scheduling system keeps track of the current state of all resources and jobs, and decides, based on conditions, and policies we have configured, where and when to start jobs. Jobs are started in priority order, until either no further jobs are eligible to start, or it runs out of appropriate resources. The factors that are included in the priority calculation are:
Historical Usage Patterns (eg. how much has been recently used)
- Per user
- Per research group
- Historical wallclock accuracy
- Total time queued
When the scheduling system chooses to start a job, it assigns it one or more nodes/processors, as requested by the job, and launches the provided job script on the first node in the list of those assigned. The responsibility of taking advantage of all the nodes/processors assigned to the job, is left up to the job script. Please do not request more resources unless you know you can use them.
Every job in the system works its way through a series of states:
- The job has been assigned a set of resources, and is running.
- The job is being considered for running, and would already be running if there were enough resources available.
- The job is not being considered for running, and is blocked due to some policy violation (Maximum procs in use per user, etc.) or some unfulfilled condition (eg. unfulfilled job dependency, etc.).
When jobs are submitted, they are given a unique number, which can be used to check on the job's status, look at output, etc.
Every job that produces output to either the STDOUT or
STDERR streams , has that data collected by the system, and copies the data back to the JOBNAME.oJOBNUMBER (for
STDOUT) and JOBNAME.eJOBNUMBER (for STDERR), where JOBNUMBER is the unique job number assigned when the job is
submitted, and JOBNAME is either the name assigned using the
-N parameter in the script, or if the name
is not specified, it's the basename of the script used to submit the job. Since the output files include the unique
job number, the filenames are guaranteed to be unique.
Many people also want to see the output during the job's execution. This can be accomplished using shell redirection inside the job script.
Summary of Commands
A summary of commands can be found under
Documentation > Job Submission > Summary of Commands
For information about resource requests, we recommend exploring the Script Generator, as well as the corresponding Youtube videos.
All jobs must request memory; see Memory Requests.