TDM 30200: Project 5 — 2023
Motivation: In this project we will slowly get familiar with SLURM, the job scheduler installed on Anvil.
Context: This is the first in a series of 3 projects focused on parallel computing using SLURM and Python.
Scope: SLURM, unix, Python
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/coco/unlabeled2017/*.jpg
Questions
Interested in being a TA? Please apply: purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE |
Question 1
This project (and the next) will have different types of deliverables. Each question will result in an entry in a Jupyter notebook, and/or 1 or more additional Python and/or Bash scripts. To properly save screenshots in your Jupyter notebook, follow the guidelines here. Images that don’t appear in your notebook in Gradescope will not get credit. |
When you start your JupyterLab session this week, BEFORE you start your session, please change "Processor cores requested" from 1 to 4. We will use 4 processing cores this week. |
Most of the supercomputers here at Purdue, and Anvil, contain one or more frontends. Users can log in and submit jobs to run on one or more backends. To submit a job, users use SLURM.
SLURM is a job scheduler found on about 60% of the top 500 supercomputers.[1] In this project (and the next) we will learn about ways to schedule jobs on SLURM, and learn the tools used.
Let’s get started by using a program called salloc
. A brief explanation is that salloc
gets some resources (think memory and cpus), and runs the commands specified by the user. If the user doesn’t specify any commands, it will open the user’s default shell (bash
, zsh
, fish
, etc.) in the allocated resource.
Open a terminal and give it a try.
salloc -A cis220051 -p shared -n 3 -c 1 -t 00:05:00 --mem-per-cpu=1918
After some output, you should notice that your shell changed. Type hostname
followed by enter to see that your host has changed from loginXX.anvil.rcac.purdue.edu
to aXXX.anvil.rcac.purdue.edu
. You are in a different system! Very cool!
To find out what the other options are read slurm.schedmd.com/salloc.html
-
The
-A cis220051
option could have also been written--account=cis220051
. This indicates which account to use when allocating the resources (memory and cpus). You can also think of this as a "queue" or "the datamine queue". Jobs submitted using this option will use the resources we pay for. Only users with permissions can submit to our queue. -
The
-n 3
option could have also been written--ntasks=3
. This indicates how many tasks we may need for the job. -
The
-c 1
option could have also been written--cpus-per-task=1
. This indicates the number of cores per task. -
The
-t 00:05:00
option could have also been written--time=00:05:00
. This indicates how long the job may run for. If the time exceeds the time limit, the job is killed. -
The
--mem-per-cpu=1918
option indicates how much memory (in MB) we may need for each cpu in the job.
To confirm, use the following script to see how much memory and cpus we have available to us in this salloc
session. Copy and paste the contents of this script in a file called get_info.py
in your $HOME
directory. After saved, make sure it is executable by running the following command.
chmod +x $HOME/get_info.py
#!/usr/bin/env python3
import socket
import os
from pathlib import Path
from datetime import datetime
import time
def main():
time.sleep(5)
print(f'Hostname: {socket.gethostname()}')
with open("/proc/self/cgroup") as file:
for line in file:
if 'cpuset' in line:
cpu_loc = "cpuset" + line.split(":")[2].strip()
if 'memory' in line:
mem_loc = "memory" + line.split(":")[2].strip()
base_loc = Path("/sys/fs/cgroup/")
with open(base_loc / cpu_loc / "cpuset.cpus") as file:
num_cpu_sets = file.read().strip().split(",")
num_cpus = 0
for s in num_cpu_sets:
if len(s.split("-")) == 1:
num_cpus += 1
else:
num_cpus += (int(s.split("-")[1]) - int(s.split("-")[0]) + 1)
print(f"CPUs: {num_cpus}")
with open(base_loc / mem_loc / "memory.limit_in_bytes") as file:
mem_in_bytes = int(file.read().strip())
print(f"Memory: {mem_in_bytes/1024**2} Mbs")
if __name__ == "__main__":
print(f'started at: {datetime.now()}')
main()
print(f'finished at: {datetime.now()}')
To use it.
~/get_info.py
For this question, add a screenshot of running hostname
on the salloc
session, as well as ~/get_info.py
to your notebook.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
salloc
can be useful, but most of the time we want to run a job.
Before we get started, read the answer to this stackoverflow post. In many instances, it is easiest to use 1 cpu per task, and let SLURM distribute those tasks to run. In this course, we will use this simplified model.
So what is the difference between srun
and sbatch
? This stackoverflow post does a pretty great job explaining. You can think of sbatch
as the tool for submitting a job script for execution, and srun
as the tool to submit a job to run. We will test out both!
In the previous question, we used salloc
to get the resources, hop onto the system, and run hostname
along with our get_info.py
script.
Use srun
to run our get_info.py
script, to better understand how the various options work. Try and guess the results of the script for each configuration.
Be sure to give you
|
When inside a SLURM job, a variety of environment variables are set that alters how srun
behaves. If you open a terminal from within Jupyter Lab and run the following, you will see.
env | grep -i slurm
These variables altered the behavior of srun
. We can however, unset these variables, and the behavior will revert to the default behavior. In your terminal, run the following.
for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done;
Confirm that the environment variables are unset by running the following.
env | grep -i slurm
You must repeat this process each new terminal you’d like to use within Jupyter Lab. This means that if you work on this project a while, and reopen it the next day to work on, you will need to repeat the bash command to remove the SLURM environment variables. |
Great! Now, we can work in our nice Jupyter Lab environment without any concern that SLURM environment variables are changing any behaviors. Let’s test it out with something actually predictable.
srun -A cis220051 -p shared -n 2 -c 1 -t 00:00:05 $HOME/get_info.py srun -A cis220051 -p shared -n 1 -c 2 -t 00:00:05 $HOME/get_info.py
Note that when using |
srun -A cis220051 -p shared -n 1 -c 2 --mem=1918 -t 00:00:05 $HOME/get_info.py srun -A cis220051 -p shared -n 1 -c 2 --mem-per-cpu=1918 -t 00:00:05 $HOME/get_info.py srun -A cis220051 -p shared -n 2 -c 1 --mem-per-cpu=1918 -t 00:00:05 $HOME/get_info.py
Note how |
srun -A cis220051 -p shared -n 1 -c 2 --mem-per-cpu=1918 -t 00:00:05 $HOME/get_info.py srun -A cis220051 -p shared -n 1 -c 2 --mem-per-cpu=1919 -t 00:00:05 $HOME/get_info.py
Here, take careful note that when we increase our memory per cpu from 1918 to 1919 something important happens — we are granted double the CPUs we requested! This is because, SLURM on Anvil is configured to give us at max 1918 MB of memory per CPU. If you request more memory, you will be granted additional CPUs. This is why ondemand.anvil.rcac.purdue.edu was recently configured to request only the number of cores you want — because if you requested 1 core, but 4 GB of memory, you would get 3 cores, but only 4GB of memory, when you could have received 1918*3 = 5754 MB of memory instead of just 4 GB. |
srun -A cis220051 -p shared -n 3 -c 1 -t 00:00:05 $HOME/get_info.py > $SCRATCH/get_info.out
Check out the |
Reading the explanation from SLURM’s website is likely not enough to understand, running the configurations will help your understanding. If you have simple, parallel processes, that doesn’t need to have any shared state, you can use a single srun
per task. Each with --mem-per-cpu
(so memory availability is more predictable), -n 1
, -c 1
, followed by &
(just a reminder that &
at the end of a bash command puts the process in the background).
Finally, take note of the last configuration. What is the $SCRATCH
environment variable?
For the answer to this question:
-
Add a screenshot of the results of some (not all) of you running the
get_info.py
script in thesrun
commands. -
Write 1-2 sentences about any observations.
-
Include what the
$SCRATCH
environment variable is.
-
Code used to solve this problem.
-
Output from running the code.
Question 3
The following is a solid template for a job script.
#!/bin/bash #SBATCH --account=cis220051 (1) #SBATCH --partition=shared (2) #SBATCH --job-name=serial_job_test (3) #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) (4) #SBATCH [email protected] # Where to send mail (5) #SBATCH --ntasks=3 # Number of tasks (total) (6) #SBATCH --cpus-per-task=1 # Number of CPUs per task (7) #SBATCH -o /dev/null # Output to dev null (8) #SBATCH -e /dev/null # Error to dev null (9) srun -n 1 -c 1 --mem-per-cpu=1918 --exact -t 00:00:05 $HOME/get_info.py > first.out & (10) srun -n 1 -c 2 --mem-per-cpu=1918 --exact -t 00:00:05 $HOME/get_info.py > second.out & (11) srun -n 1 -c 3 --mem-per-cpu=1918 --exact -t 00:00:05 $HOME/get_info.py > third.out & (12) wait (13)
1 | Sets the account to use for billing — in this case our account is cis220051. |
2 | Sets the partition to use — in this case we are using the shared partition. |
3 | Give your job a unique name so you can identify it in the queue. |
4 | Specify when you want to receive emails about your job. We have it set to notify us when the job ends or fails. |
5 | Specify the email address to send the emails to. |
6 | Specify the number of tasks, in total, to run within this job. |
7 | Specify the number of cores to use for each task. |
8 | Redirect the output of the job to /dev/null . This is a special file that discards all output. You could change this to $HOME/output.txt and the contents would be written to that file. |
9 | Redirect the error output of the job to /dev/null . This is a special file that discards all output. You could change this to $HOME/error.txt and the contents would be written to that file. |
10 | The first step of the job. This step contains a single task, that uses a single core. |
11 | The second step of the job. This step contains a single task, that uses two cores. |
12 | The third step of the job. This step contains a single task, that uses three cores. |
13 | Wait for all steps to complete. Very important to include. |
Update the template to give your job a unique name, and to set the email to your Purdue email address.
To submit a job, run the following.
sbatch my_job.sh
Run the following experiments by tweaking my_job.sh
, submitting the job using sbatch
, and then checking the output of first.out
, second.out
, and third.out
.
-
Run the original job script and note the time each of the steps finished relative to the other steps.
-
Change the job script
--cpus-per-task
from 1 to 2. What happens to the finish times? -
Remove
--exact
from each of the job steps. What happens to the finish times?
In addition, please feel free to experiment with the various values, and see how the values effect the finish times and/or output of our get_info.py
script. Can you determine how things work? Write 1-2 sentences about your observations. Please do take the time to iterate on this question over and over until you get a good feel for how things work.
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Make your job script run for at least 20 seconds — you can do this by adding more steps, reducing cpus, or modifying the time.sleep
call in the get_info.py
script. Submit the job using sbatch
. Immediately after submitting the job, use the built in squeue
command, in combination with grep
to find the job id of your job.
What is |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |