Why won't my job start?

The whypending command shows some generic informatoin on why a job is pending:

whypending $jobid

If Slurm has assigned a start time for the job it will be at the bottom of the output:

The job is expected to start by 2017-11-03T11:47:22

Of course, it may not start at that time. Here are a few examples:

  • Current jobs may terminate before their scheduled walltime, allowing you to run sooner.
  • Another user with higher priority may submit new jobs before your start time, pushing your jobs back.
  • Your job may backfill, allowing you to run sooner.

Your job may be held

If you see a reason of JobHeldUser or JobHeldAdmin then this job is in the held state. This may be because it was submitted with -H or --hold. It may also be because an administrator or account manager ran scontrol hold or scontrol uhold on your job. If the state is JobHeldAdmin then an FSL employee must release the job. If the state is JobHeldUser then you can release it with scontrol release $jobid.

OUTDATED INFORMATION BELOW

There can be many reasons why a job won't run. Below are some of the same steps that FSL admins follow when checking to see why a user's job won't run. Use checkjob -v 12345 to diagnose most issues (where 12345 is the jobid). Note the "-v" parameter.

If your job won't submit at all, take a look at this page: Why won't my job submit?

Is there an upcoming maintenance period?

Check our calendar site to determine if an upcoming maintenance period might be getting in your way. In general, we create scheduler reservations for the maintenance periods, starting around 8:00 AM local time on the first day of the maintenance. With this reservation in place, if the scheduler cannot complete your job (based on the specified walltime) before the reservation, then the job will not start. For example, if your job requests 10 days of walltime, but the maintenance period is 5 days away, then your job won't start until the maintenance is over.

Are resources available?

Go to the script generator and input similar parameters (nodes, ppn, features) to what you requested. Look at the box "Resources available with requested features". If it says "Free processors: 0/(bignumber)" then the problem is likely that the system is full. Note that the script generator doesn't take memory requests into account.

 

How much memory is the job requesting?

If you submitted using "mem=" the amount of memory you requested is the amount for the entire job (meaning each processor gets totalmem/#procs). If you chose "pmem=" the amount of memory is per processor. Thus if you chose "nodes=1:ppn=8,pmem=48GB" you requested 384GB of memory. Your job will never run. See https://marylou.byu.edu/resources for details.

A simple sanity check is to run: "checkjob -v 12345". Look at the line that says "Dedicated Resources Per Task:" and the following line "TasksPerNode:". Multiply the value MEM * TasksPerNode. If that value is larger than Supercomputing nodes have, you are out of luck.

Run checkjob -v 12345 | egrep 'MSG|Message'. If that returns:

  • job 12345 violates active SOME POLICY NAME limit of 1400 or something similar you already have many jobs in the running or eligible state and must simply wait for some of your other jobs to finish.
  • job violates class configuration 'wclimit too high for class 'batch' (3596400 > 1382400) the walltime= value that you chose was too high. The numbers shown are in seconds.
  • PBS_Server System error please contact us ASAP if you see this. It likely affects multiple users.

MPI jobs

If your MPI job starts but doesn't appear to do anything, please see FAQ entry: My MPI job appears to run but never produces output. Is it actually running?