Batch – Guide on Slurm Submission
Our SLURM batch system allows us to fairly share our GPU, CPU, WS and Desktop compute to all IPPP members.
Available Resources
For information about the systems available please see the following pages:
Simple SBATCH Submission Example Script
Save the following example as mfj.sh please change $USER for your username
#!/bin/bash # Name: My First Job # These are SBATCH commands (SBATCH doesnt take Shell Variables) #SBATCH --job-name="MyFirstJob" # Job Name #SABTCH --get-user-env # User Environment #SBATCH --error="job-%j.err" # Redirect STDERR (Error output) to this file %j is a variable for JobID #SBATCH --output="job-%j.out" # Redirect STDOUT (Normal output) to this file %j is a variable for JobID #SBATCH --mem=2G # Requested memory for the Job (Default is 2G) #SBATCH --export=ALL # Export Current Environment Variables (Default ALL) #SBATCH -D /mt/batch/$USER # Put all output on batch storage (--chdir is bugged) # The rest is similar to a standard shell script echo "My First Job is Running" hostname # Remember to exit cleanly exit 0;
The content marked as #SBATCH is interpreted by the SBATCH tool and is NOT a comment. All other lines starting with # are comments.
We then make the file executable and then run it like so:
user@desktop~: chmod +x mfj.sh user@desktop~: sbatch mfj.sh Submitted batch job 1234
The above example will output the test “My First Job is Running” and the Node Hostname that the Job ran on, this will then appear in your batch working directory (/mt/batch/users/$USER in a file named similar to MyFirstJob-1234.out.
For larger, more IO intensive jobs, please modify your script to output to /scratch/$USER and copy any scripts to and from the local scratch partition, this will help eliminate any bottlenecks from the network and aggregate IO slowdown on the disk servers and then copy the files back.
For example this HighIO Output:
myHighOutputJob >/scratch/$USER/std.out 2>/scratch/$USER/std.err mv /scratch/$USER/std.{out,err} /mt/batch/$USER/
Checking your Jobs
Check your currently queued jobs
squeue -u $USER
Check a particular job
scontrol show job <jobid>
for example
scontrol show job 1234
Cancel a job
scancel <jobid>
for example
scancel 1234
Check the systems available
If your jobs don’t seem to be starting, you can check what systems are available/offline/draining
sinfo -a
Check the current queue
squeue -a
Advanced SBATCH Submission Example Script
Save the following example as maj.sh please change $USER for your username, this script is not tested and you should add extra checks to ensure you don’t delete anything important if parts fail.
#!/bin/bash # Name: My Advanced Job # These are SBATCH commands (SBATCH doesnt take Shell Variables) #SBATCH --job-name="MyAdvancedJob" # Job Name #SABTCH --get-user-env # User Environment #SBATCH --error="job-%j.err" # Redirect STDERR (Error output) to this file %j is a variable for JobID #SBATCH --output="job-%j.out" # Redirect STDOUT (Normal output) to this file %j is a variable for JobID #SBATCH --mem=2G # Requested memory for the Job (Default is 2G) #SBATCH --export=ALL # Export Current Environment Variables (Default ALL) #SBATCH -D /mt/batch/$USER # Put all output on batch storage by default (--chdir is bugged) # The rest is similar to a standard shell script # Build our job env echo "JOB $SLURM_JOBID - Building Environment" mkdir /scratch/$USERNAME/$SLURM_JOBID >/dev/null scp -r /mt/batch/$USERNAME/my_payload /scratch/$USERNAME/$SLURM_JOBID/ >/dev/null # We move here so anything run relatively will save locally rather than batch cd /scratch/$USERNAME/$SLURM_JOBID >/dev/null echo "JOB $SLURM_JOBID - Running" hostname 1>/scratch/$USERNAME/$SLURM_JOBID/stdout.log 2>/scratch/$USERNAME/$SLURM_JOBID/stderr.log python dothings 1>/scratch/$USERNAME/$SLURM_JOBID/stdout.log 2>/scratch/$USERNAME/$SLURM_JOBID/stderr.log # Clean up and return job output as a tar echo "JOB $SLURM_JOBID - Finishing" tar-zcf /scratch/$USERNAME/job-$SLURM_JOBID.tar.gz /scratch/$USERNAME/$SLURM_JOBID rsync -r /scratch/$USERNAME/job-$SLURM_JOBID.tar.gz /mt/batch/$USERNAME/ >/dev/null rm -rf /scratch/$USERNAME/$SLURM_JOBID >/dev/null # Remember to exit cleanly exit 0;