Using sbatch / SLURM on Anvil
Interactive jobs on Anvil like Jupyter Notebooks and VS Code are usually launched via notebook.anvilcloud.rcac.purdue.edu
or ondemand.anvil.rcac.purdue.edu
. These interactive jobs permit the user to edit and execute code cells in real time while they watch. However, they will terminate either after some pre-designated number of hours (OnDemand), or when the user has been idle for 20-30 minutes (notebook.anvilcloud). When working with a large project, such as a Corporate Partner project in The Data Mine, it may be desirable to use more than the 16-CPU core / 32 GB RAM limit imposed by these interactive methods, or to run one or more jobs that take many hours to complete. To do so you must utilize SLURM.
What is SLURM?
SLURM is the name of the scheduler used on Anvil that allows users to submit one or more long-running jobs to the Anvil cluster. The SLURM scheduler controls which of the 1000 nodes on Anvil will be used to run the various jobs people submit. It will assign whole or partial nodes depending on how many CPU cores and memory are requested.
What is sbatch?
sbatch
is the name of the program on Anvil that is used to submit jobs to the SLURM scheduler. It supports many arguments, such as specifying how long the job will run, how many CPU cores it requires, etc. You can specify these arguments either on the command line when you invoke sbatch, or more conveniently, by embedding them directly into the sbatch script you create as specially crafted comments. An example sbatch script may look like this:
#!/bin/bash -l
#SBATCH -N 1 # Number of nodes. ALWAYS set to 1
#SBATCH -n 1 # Number of tasks. ALWAYS set to 1
#SBATCH -c 1 # Number of CPU cores. Can go as high as 128
# Each additional CPU core adds around 1.9GB of RAM so
# to get more memory, add more CPU cores.
#SBATCH -t 1:0:0 # Number of hours to run (H:M:S). Change as needed.
#SBATCH -A cis220051 # The TDM account to charge for this. Don't change.
#SBATCH -p shared # Partition to use. Rarely change
# These three lines "load" the TDM python. Almost always keep them.
module use /anvil/projects/tdm/opt/core
module load tdm
module load python/seminar r/seminar
# This is the python program we will run
python3 myprogram.py
If the above script were saved to a file named run1.sh
, you would submit it to SLURM using the command:
sbatch run1.sh
A few more things to know
It’s best to place the sbatch script in the same directory as the code you want to run.
You can check the status of your job using the squeue --me
command.
You can cancel a job using the scancel JOBID
command, where JOBID is the job number assigned to your job when you submitted it. You can find this JOBID by using the squeue --me
command.
You can have more than one running at a time, but don’t go crazy and run dozens of jobs at once. If you feel you need to submit many jobs at once, please contact someone on the Data Mine staff first.