Using GPU’s on Anvil
A great many computational tasks on anvil can be accomplished using CPU cores only. This is the bulk of what most students do on Anvil. However, certain tasks, particularly those involving machine learning and deep learning, can be greatly accelerated by using GPU’s (Graphical Processing Units). Generally, machine learning and deep learning programs are written in Python to initially use CPU cores, and then modified to use GPU’s when the code is mature. This is often as simple as changing a single line of code, but the details depend on the specific machine learning library being used.
Jupyter Notebooks are often used on Anvil to initially develop and test machine learning code. However, on Anvil, GPU access is charged to our allocation for each minute a GPU is dedicated to a task. That means if we were to assign a GPU to a Jupyter Notebook session, we would be charged for the entire time that session is open, even if we were just editing code cells and are not actively using the GPU. This consumes our GPU allocation at a rapid rate. Therefore, we recommend using Jupyter Notebooks with CPUs to develop and test code, transferring all of the Python code in the Jupyter Notebook to a Python file, and then running the Python code in batch mode using SLURM and sbatch when we are ready to do the full training run. This ensures that we are only charged for GPU usage when the code is actually running and not while we are editing code and debugging.
GPU resources are scarce, and we must be good stewards of our allocation. Please do not use GPU’s for tasks that can be easily accomplished using CPU cores only.
Scheduling a GPU SLURM job using sbatch
| Before we can proceed with using SLURM and sbatch on Anvil for use with GPU’s, you must first read the sbatch guide and watch the corresponding video. It is an essential prerequisite to understanding the rest of this guide. |
Now that you have read the SLURM / sbatch guide and watched the corresponding video, we can proceed with using SLURM to run a job that uses GPU’s.
An example sbatch script that uses GPU’s may look like this:
#!/bin/bash -l
#SBATCH -N 1 # Number of nodes. ALWAYS set to 1
#SBATCH -n 1 # Number of tasks. ALWAYS set to 1
#SBATCH -c 32 # Number of CPU cores. Each GPU gets 32 CPU cores
# so we should always ask for 32!
#SBATCH -t 1:0:0 # Run for 1 hour. Change as needed
#SBATCH -A cis220051-gpu # the TDM account to charge for this
#SBATCH -p gpu # Must use the "gpu" or "gpu-debug" partition
# The wait for "gpu-debug" is shorter, but there is a
# maximum runtime of just 15 minutes!
#SBATCH --gpus-per-node=1 # Must use just one GPU. Do not change!
# These three lines use the TDM python
module use /anvil/projects/tdm/opt/core
module load tdm
module load python/seminar r/seminar
# This is where you specify the program you want to run!
python3 havegpu.py
If the above script were saved to a file named rungpu1.sh, you would submit it to SLURM using the command:
sbatch rungpu1.sh
A few more things to know
It’s best to place the sbatch script in the same directory as the code you want to run.
You can check the status of your job using the squeue --me command.
You can cancel a job using the scancel JOBID command, where JOBID is the job number assigned to your job when you submitted it. You can find this JOBID by using the squeue --me command.
Our GPU allocation isn’t infinite, so please be considerate of others when using these resources. You can have more than one job running at a time, but don’t go crazy and submit dozens of jobs at once. If you feel you need to submit many jobs at once, please contact someone on the Data Mine staff first to discuss.