Table of Contents

Batch – Guide on Slurm Submission

Our SLURM batch system allows us to fairly share our GPU, CPU, WS and Desktop compute to all IPPP members.

Available Resources

For information about the systems available please see the following pages:

Simple SBATCH Submission Example Script

Save the following example as mfj.sh please change $USER for your username

#!/bin/bash
# Name:   My First Job
 
# These are SBATCH commands (SBATCH doesnt take Shell Variables)
#SBATCH --job-name="MyFirstJob"                  # Job Name
#SABTCH --get-user-env                           # User Environment
#SBATCH --error="job-%j.err"                     # Redirect STDERR (Error output) to this file %j is a variable for JobID
#SBATCH --output="job-%j.out"                    # Redirect STDOUT (Normal output) to this file %j is a variable for JobID
#SBATCH --mem=2G                                 # Requested memory for the Job (Default is 2G)
#SBATCH --export=ALL                             # Export Current Environment Variables (Default ALL)
#SBATCH -D /mt/batch/$USER                       # Put all output on batch storage (--chdir is bugged)
 
# The rest is similar to a standard shell script
echo "My First Job is Running"
hostname
 
# Remember to exit cleanly
exit 0;

The content marked as #SBATCH is interpreted by the SBATCH tool and is NOT a comment. All other lines starting with # are comments.

We then make the file executable and then run it like so:

user@desktop~: chmod +x mfj.sh
user@desktop~: sbatch mfj.sh
Submitted batch job 1234

The above example will output the test “My First Job is Running” and the Node Hostname that the Job ran on, this will then appear in your batch working directory (/mt/batch/users/$USER in a file named similar to MyFirstJob-1234.out.

For larger, more IO intensive jobs, please modify your script to output to /scratch/$USER and copy any scripts to and from the local scratch partition, this will help eliminate any bottlenecks from the network and aggregate IO slowdown on the disk servers and then copy the files back.

For example this HighIO Output:

myHighOutputJob >/scratch/$USER/std.out 2>/scratch/$USER/std.err
mv /scratch/$USER/std.{out,err} /mt/batch/$USER/

Checking your Jobs

Check your currently queued jobs

squeue -u $USER

Check a particular job

scontrol show job <jobid>

for example

scontrol show job 1234

Cancel a job

scancel <jobid>

for example

scancel 1234

Check the systems available

If your jobs don’t seem to be starting, you can check what systems are available/offline/draining

sinfo -a

Check the current queue

squeue -a

Advanced SBATCH Submission Example Script

Save the following example as maj.sh please change $USER for your username, this script is not tested and you should add extra checks to ensure you don’t delete anything important if parts fail.

#!/bin/bash
# Name:   My Advanced Job
 
# These are SBATCH commands (SBATCH doesnt take Shell Variables)
#SBATCH --job-name="MyAdvancedJob"                  # Job Name
#SABTCH --get-user-env                              # User Environment
#SBATCH --error="job-%j.err"                        # Redirect STDERR (Error output) to this file %j is a variable for JobID
#SBATCH --output="job-%j.out"                       # Redirect STDOUT (Normal output) to this file %j is a variable for JobID
#SBATCH --mem=2G                                    # Requested memory for the Job (Default is 2G)
#SBATCH --export=ALL                                # Export Current Environment Variables (Default ALL)
#SBATCH -D /mt/batch/$USER                          # Put all output on batch storage by default (--chdir is bugged)
 
# The rest is similar to a standard shell script

# Build our job env
echo "JOB $SLURM_JOBID - Building Environment"
mkdir /scratch/$USERNAME/$SLURM_JOBID >/dev/null
scp -r /mt/batch/$USERNAME/my_payload /scratch/$USERNAME/$SLURM_JOBID/ >/dev/null

# We move here so anything run relatively will save locally rather than batch
cd /scratch/$USERNAME/$SLURM_JOBID >/dev/null

echo "JOB $SLURM_JOBID - Running"
hostname 1>/scratch/$USERNAME/$SLURM_JOBID/stdout.log 2>/scratch/$USERNAME/$SLURM_JOBID/stderr.log
python dothings 1>/scratch/$USERNAME/$SLURM_JOBID/stdout.log 2>/scratch/$USERNAME/$SLURM_JOBID/stderr.log

# Clean up and return job output as a tar
echo "JOB $SLURM_JOBID - Finishing"
tar-zcf /scratch/$USERNAME/job-$SLURM_JOBID.tar.gz /scratch/$USERNAME/$SLURM_JOBID
rsync -r /scratch/$USERNAME/job-$SLURM_JOBID.tar.gz /mt/batch/$USERNAME/ >/dev/null
rm -rf /scratch/$USERNAME/$SLURM_JOBID >/dev/null

# Remember to exit cleanly
exit 0;