TDM 30200: Project 7 — 2023

Motivation: In this project we will slowly get familiar with SLURM, the job scheduler installed on Anvil.

Context: This is the third in a series of 4 projects focused on parallel computing using SLURM and Python.

Scope: SLURM, UNIX, Python

Learning Objectives
  • Use basic SLURM commands to submit jobs, check job status, kill jobs, and more.

  • Understand the differences between srun and sbatch commands.

  • Predict the resources (cpus and memory) an srun job will use based on the arguments and context.

  • Write and use a job script to solve a problem faster than you would be able to without a high performance computing (HPC) system.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

You are free to use any dataset you would like for this project, even if the data is created or collected by you.

Questions

Interested in being a TA? Please apply: purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE

Question 1

You’ve been exposed to a lot about SLURM in a short period of time. For this last question, we are going to let you go at your own pace.

Think of a problem that you want to solve that may benefit from parallel computing and SLURM. It could be anything: processing many images in some way (counting pixels, applying filters, analyzing for certain objects, etc.), running many simulations to plot, bootstrapping a model to get point estimates for uncertainty quantification, calculating summary information about a large dataset, trying to guess a 6 character password, calculating the Hamming distance between all of the 123k images in the coco/hashed02 dataset, etc.

Solve your problem, or make progress towards solving your problem. The following are the loose requirements. As long as you meet these requirements, you will receive full credit. The idea is to get some experience, and have some fun.

Requirements:

  1. You must have an introductory paragraph clearly explaining your problem, and how you think using a cluster and SLURM can help you solve it.

  2. You must submit any and all code you wrote. It could be in any language you want, just put it in a code block in a Markdown cell.

  3. You must write and submit a job script to be submitted using sbatch on SLURM. This could be copy and pasted into a code block in a markdown cell.

  4. You must measure the time it takes to run your code on a sample of your data, and make a prediction for how long it will take using SLURM, based on the resources you requested in your job script. Write 1-2 sentences explaining how long the sample took and the math you used to predict how long you think SLURM will take.

  5. You must write 1-2 sentences explaining how close or far away your prediction was from the actual run time.

The above requirements should be all kept in a Jupyter notebook. The notebook should take advantage of markdown formatting, and narrate a clear story, with a clear objective, and explanations of any struggles or issues you ran into along the way.

To not hammer our resources too much, please don’t request more than 32 cores, and if you use more than 10 cores, please make sure your jobs don’t take tons of time.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.