Slurm
Slurm is a resource manager developed at Lawrence Livermore National Laboratory and developed primarily by Moe Jette and Danny Auble of SchedMD.
Contents
Quick Reference
Commands
- squeue lists your jobs in the queue
- sinfo lists the state of all machines in the cluster
- sbatch submits batch jobs (use srun for an interactive job on the blades or Blue Gene)
- sprio lists the relative priorities of pending jobs in the queue and how they are calculated
- sacct display accounting and submission data for jobs
Queues
See the individual system pages (List of Available Systems).
FAQ
Please see the FAQ in the official Slurm documentation: http://www.schedmd.com/slurmdocs/faq.html
Resource specification
Options of interest (see the manual page for sbatch for a complete list):
-n, --ntasks=ntasks number of tasks to run -N, --nodes=N number of nodes on which to run (N = min[-max]) -c, --cpus-per-task=ncpus number of cpus required per task --ntasks-per-node=n number of tasks to invoke on each node -i, --input=in file for batch script's standard input -o, --output=out file for batch script's standard output -e, --error=err file for batch script's standard error -p, --partition=partition partition requested -t, --time=minutes time limit -D, --chdir=path change remote current working directory -D, --workdir=directory set working directory for batch script --mail-type=type notify on state change: BEGIN, END, FAIL or ALL --mail-user=user who to send email notification for job state changes
Note that any of the above can be specified in a batch file by preceeding the option with #SBATCH. All options defined this way must appear first in the batch file with nothing separating them. For example, the following will send the job's output to a file called joboutput.<the job's ID>:
#SBATCH -o joboutput.%J
Example job submission scripts
See also: Modules for any additional options/requirements of specific MPI implementations. Typically, it is necessary to load the same modules at runtime (before calling srun) that were used when building a binary.
Simple (non-MPI)
A simple (non-MPI) job can be started by just calling srun:
#!/bin/bash -x srun ./a.out
For example, the above jobs could be submitted to run 16 tasks on 1 nodes, in the partition "cluster", with the current working directory set to /foo/bar, email notification of the job's state turned on, a time limit of four hours (240 minutes), and STDOUT redirected to /foo/bar/baz.out as follows (where script.sh is the script):
sbatch -p cluster -N 1 -n 16 --mail-type=ALL --mail-user=example@rpi.edu -t 240 -D /foo/bar -o /foo/bar/baz.out ./script.sh
Note: In a simple, non-MPI case, running multiple tasks will create multiple instances of the same binary.
Interactive
Interactive jobs are supported. See the srun command manual page for details. Remember to always specify a partition (-p). Here is a usage example launching xterm on the compute node allocated to an interactive session:
salloc -p cluster xterm -e 'ssh -X `srun hostname`'
Or by an alternative method:
salloc -p opterons ssh -X `srun -s hostname`
MPICH
Example job script slurmMpich.sh
:
#!/bin/bash -x module load <compilerModuleName> mpich srun ./a.out
MVAPICH2
Example job script slurmMvapich2.sh
:
#!/bin/bash -x module load mvapich2 srun ./a.out
Note, users with applications needing MPI_THREAD_MULTIPLE support must set the environment variable MV2_ENABLE_AFFINITY to 0 before running.
export MV2_ENABLE_AFFINITY=0
OpenMPI
Example job batch script slurmOpenMpi.sh
:
#!/bin/bash -x module load openmpi srun ./a.out
IBM Spectrum MPI or Mellanox HPC-X
These implementations do not have direct Slurm support and it is necessary to use mpirun. You must have passwordless SSH keys setup for mpirun to work.
Example job batch script slurmSpectrum.sh
:
#!/bin/bash -x srun hostname -s > /tmp/hosts.$SLURM_JOB_ID if [ "x$SLURM_NPROCS" = "x" ] then if [ "x$SLURM_NTASKS_PER_NODE" = "x" ] then SLURM_NTASKS_PER_NODE=1 fi SLURM_NPROCS=`expr $SLURM_JOB_NUM_NODES \* $SLURM_NTASKS_PER_NODE` fi mpirun -hostfile /tmp/hosts.$SLURM_JOB_ID -np $SLURM_NPROCS ./a.out rm /tmp/hosts.$SLURM_JOB_ID
GPU-Direct
To enable GPU-Direct ('CUDA aware MPI') pass the -gpu
flag to mpirun.
Job arrays/Many small jobs
For many small jobs running simultaneously or in quick succession, it is often better to submit one large job rather than many small jobs. On some systems, doing otherwise leads to resource fragmentation and poor scheduler performance. Example:
#!/bin/sh #SBATCH --job-name=TESTING #SBATCH -t 04:00:00 #SBATCH -D /gpfs/u/<home or barn or scratch>/<project>/<user> # #SBATCH --mail-type=ALL #SBATCH --mail-user=<email> srun -N8 -o testing.log ./my-executable <options> & srun -N8 -o testing2.log ./my-executable <options> & srun -N8 -o testing3.log ./my-executable <options> & srun -N8 -o testing4.log ./my-executable <options> & wait
The differences are the addition of an ampersand (&) at then end of each srun command and the wait
command at the end. This will run all 4 jobs in parallel within the allocation and wait until all 4 are complete. For this example, the batch script should be run as sbatch -N32 <script>
to ensure enough nodes are allocated for all the jobs that will run in parallel. This can also be done with individual tasks to fill a minimal number of nodes (replace -N with -n in sbatch/srun calls).
On a cluster that uses consumable resources, such as the ERP cluster, it is important to also specify a subset of the resources for each srun command. Otherwise, the first srun command will use all resources assigned to the job and the next srun command will wait until they are released, printing the warning "Job step creation temporarily disabled".
By default -c, --cpus-per-task=1
and can be left out if tasks only require one core. However, more complex job arrays where some processes require fewer/greater CPUs for each srun command will need the option supplied.
Example, submitted with sbatch -n4 --mem=16G --cpu
:
#!/bin/sh #SBATCH --job-name=TESTING #SBATCH -t 04:00:00 #SBATCH -D /gpfs/u/<home or barn or scratch>/<project>/<user> # #SBATCH --mail-type=ALL #SBATCH --mail-user=<email> srun -n1 --mem=4G -o testing.log ./my-executable <options> & srun -n1 --mem=4G -o testing2.log ./my-executable <options> & srun -n1 --mem=4G -o testing3.log ./my-executable <options> & srun -n1 --mem=4G -o testing4.log ./my-executable <options> & wait
Matlab
Multi-node Matlab scripts will require unique configuration. Single-node script (multi-threaded) may use a script like the following. Please contact support for more information.
#!/bin/bash module load matlab srun matlab -nodisplay -nosplash -nodesktop -nojvm -r example # or matlab -nodisplay -nosplash -nodesktop -nojvm < example.m
CCI customizations
slurm-account-usage
The slurm-account-usage
tool queries the Slurm database to report project usage for a given system. Running the tool without any arguments will output the number of allocations granted (via sbatch, salloc, or an interactive srun) and the total number of core-hours used by the invoking user's project (i.e. all allocations and cpu-hours by all members of the project). It can be supplied with optional start and end dates to narrow the result.
Note, srun
commands run from within an allocation are not counted towards the number of records the tool reports.
For customers/partners with many projects, a user can be designated to view information from other projects under the umbrella organization. For more information please contact support.
Usage
Login to the front-end node of the system you which to retrieve usage for (bgrs01, drpfen01, amos, etc...) and run the following command:
slurm-account-usage [START_DATE [END_DATE]]
The optional START_DATE and END_DATE define the inclusive period to retrieve usage from. The date for each field should be specified as YYYY-MM-DD
.