# STAT 39000: Project 10 — Spring 2022

Motivation: In this project, we will utilize SLURM for a couple of purposes. The first is to have the chance to utilize a GPU on the cluster for some `pytorch` work, and the second is to use resampling to get point estimates. We can then use those point estimates to make a confidence interval and gain a better understand of the variability of our model.

Context: This is the fourth of a series of 4 projects focused on using SLURM. This project is also an interlude to a series of projects on `pytorch` and `JAX`. We will use `pytorch` for our calculations.

Scope: SLURM, unix, bash, `pytorch`, Python

Learning Objectives
• Demystify a "tensor".

• Utilize the pytorch API to create, modify, and operate on tensors.

• Use simple, simulated data to create a multiple linear regression model using closed form solutions.

• Use pytorch to calculate a popular uncertainty quantification, the 95% confidence interval.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset(s)

The following questions will use the following dataset(s):

• `/depot/datamine/data/sim/train.csv`

• `/depot/datamine/data/sim/test.csv`

• `/depot/datamine/data/sim/train100k.csv`

• `/depot/datamine/data/sim/train10m.csv`

## Questions

 You do not want to wait until the end of the week to do part 1 of this project. Part 1 is pretty straightforward, and basically just requires running code that you’ve already written a variety of times. There is limited GPU access, so this is the constraint and reason you should attempt to run through part 1 earlier, rather than later.
 This project is broken into two parts. In part 1, we will use `pytorch` and build our model using cpus and gpus, and draw comparisons. Models will be built using datasets of differing sizes. The goal of part 1 is to see how a GPU can make a large impact on training time. Note that these datasets are synthetic data and don’t really represent a realistic scenario, but they do work well to illustrate how powerful GPUs are. Part 2 is a continuation from the previous project. In the previous project, you used `pytorch` to perform a gradient descent and build a model for our small, simulated dataset. While it is certainly possible to use other methods to get some form of uncertainty quantification (in our case, we are specifically looking at a 95% confidence interval for our predictions), it is not always easy to do so, or possible. One of the most common methods to calculate these things, in these difficult situations is bootstrapping. In fact, Dr. Andrew Gelman, a world-class statistician, had this as his second item in his list of the top 50 influential statistical ideas in the past 50 years. We will use SLURM to perform this computationally intensive, but relatively simple method.

### Part 1

 This question should be completed on Scholar, as Scholar has a GPU that you can use. Start by navigating to gateway.scholar.rcac.purdue.edu, and launching a terminal. In the terminal, run the following. ``````mkdir -p ~/.local/share/jupyter/kernels/tdm-f2021-s2022 cp /class/datamine/apps/jupyter/kernels/f2021-s2022/kernel.json ~/.local/share/jupyter/kernels/tdm-f2021-s2022/kernel.json`````` This will give you access to a `f2021-s2022` kernel that is different than our normal kernel on Brown, but has the necessary packages to run the code we need to run. Next, delete your Jupyter instance and re-launch a fresh Jupyter Lab instance, and confirm you have access to the GPU. To launch the Jupyter Lab instance, click on "Jupyter Notebook" under the GUI section (do not use the "Jupyter Lab" in the "Datamine" section) and use the following options: Queue: gpu (Max 4.0 hours) Number of hours: 0.5 Use Jupyter Lab instead of Jupyter Notebook (checked) To confirm you have access to the GPU you can use the following code. Note that you only really need one of these, but I am showing them all because they may be interesting to you. ``````import torch # see if cuda is available torch.cuda.is_available() # see the current device torch.cuda.current_device() # see the number of devices available torch.cuda.device_count() # get the name of a device torch.cuda.get_device_name(0)``````

For this question you will use `pytorch` with cpus (like in the previous project) to build a model for `train.csv`, `train100k.csv`, and `train10m.csv`. Use the `%%time` Jupyter magic to time the calculation for each dataset.

 The following is the code from the previous project that you can use to get started. ``````import torch import pandas as pd dat = pd.read_csv("/depot/datamine/data/sim/train.csv") x_train = torch.tensor(dat['x'].to_numpy()) y_train = torch.tensor(dat['y'].to_numpy()) beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float) beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float) beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float) learning_rate = .0003 num_epochs = 10000 optimizer = torch.optim.SGD([beta0, beta1, beta2], lr=learning_rate) mseloss = torch.nn.MSELoss(reduction='mean') for idx in range(num_epochs): # calculate the predictions / forward pass y_predictions = beta0 + beta1*x_train + beta2*x_train**2 # calculate the MSE mse = mseloss(y_train, y_predictions) if idx % 100 == 0: print(f"MSE: {mse}") # calculate the partial derivatives / backwards step mse.backward() # update our parameters optimizer.step() # zero out the gradients optimizer.zero_grad() print(f"beta0: {beta0}") print(f"beta1: {beta1}") print(f"beta2: {beta2}")``````
 For `train10m.csv`, instead of running the entire 10k epochs, just perform 100 epochs, and estimate the amount of time it would take to complete 10k epochs. We try not to be that mean, although, if you do want to wait and see, that is perfectly fine.

Modify your code to use a gpu instead of cpus, and time the time it takes to train the model using `train.csv`, `train100k.csv`, and `train10m.csv`. What percentage faster is the GPU calculations for each dataset?

Items to submit
• Code used to solve this problem.

• Output from running the code.

• Time it took to build the model for the `train.csv` and `train100k.csv` using cpus. In addition, the estimated time it would take to build the model for `train10m.csv`, again, using cpus.

• Time it took to build the model for the `train.csv`, `train100k.csv`, and `train10m.csv`, using gpus.

• What percentage faster (or slower) the GPU version is vs the CPU version for each dataset.

### Part 2

 You can now save your notebook, and switch back to using Brown. Navigate to ondemand.brown.rcac.purdue.edu/ and launch a Jupyter Lab instance the way you normally would, and fill in your notebook with you solutions to part 2. Be careful not to overwrite your output from part 1. You will want to copy your notebook to Brown, first. To do so from Scholar, open a terminal and copy the notebook as follows. ``scp /home/purduealias/my_notebook.ipynb brown.rcac.purdue.edu:/home/purduealias/`` Or to copy from Brown. ``scp scholar.rcac.purdue.edu:/home/purduealias/my_notebook.ipynb /home/purduealias/``

We’ve provided you with a Python script called `bootstrap_samples.py` that accepts a single value, for example 10, and runs the code you wrote in the previous project 10 times. This code should have a few modifications. One major, but simple modification is that rather than using our training data to build the model, instead, sample the same number of values in our `x_train` tensor from our `x_train` tensor, with replacement. What this means is if our `x_train` contained 1,2,3, we could produce any of the following samples 1,2,3 or 1,1,2 or 1,2,2 or 3,3,3 etc. We called these resampled values `xr_train`. Then proceed as normal, building your model using `xr_train` instead of `x_train`.

In addition at the end of the script, we used your model to get predictions for all of the values in `x_test`. Save these predictions to a parquet file, for example, `0cd68e5e-134d-4575-a31d-2060644f4caa.parquet`, in a safe location, for example `\$CLUSTER_SCRATCH/p10output/`. Each file will each contain a single set of point estimates for our predictions.

bootstrap_samples.py
``````#!/scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python

import sys
import argparse
import pandas as pd
import random
import torch
from pathlib import Path
import uuid

class Regression(torch.nn.Module):
def __init__(self):
super().__init__()

def forward(self, x):
return self.beta0 + self.beta1*x + self.beta2*x**2

def get_point_estimates(x_train, y_train, x_test):

model = Regression()
learning_rate = .0003

num_epochs = 10000
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mseloss = torch.nn.MSELoss(reduction='mean')

# resample data
resampled_idxs = random.choices(range(75), k=75)

for _ in range(num_epochs):
# set to training mode -- note this does not _train_ anything
model.train()

# calculate the predictions / forward pass
y_predictions = model(xr_train)

# calculate the MSE
mse = mseloss(y_train[resampled_idxs], y_predictions)

# calculate the partial derivatives / backwards step
mse.backward()

# update our parameters
optimizer.step()

# get predictions
predictions = pd.DataFrame(data={"predictions": model(x_test).detach().numpy()})

return(predictions)

def main():
parser = argparse.ArgumentParser()
bootstrap_parser.add_argument("n", type=int, help="number of set of point estimates for predictions to output")
bootstrap_parser.add_argument("-o", "--output", help="directory to output file(s) to")

if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)

args = parser.parse_args()

if args.command == "bootstrap":

x_train = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)
y_train = torch.tensor(dat['y'].to_numpy(), dtype=torch.float)

x_test = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)

for _ in range(args.n):
estimates = get_point_estimates(x_train, y_train, x_test)
estimates.to_parquet(f"{Path(args.output) / str(uuid.uuid4())}.parquet")

if __name__ == "__main__":
main()``````
 Make sure your `p10output` directory exists!
 You can use the script like `./my_script.py bootstrap 10 --output /scratch/brown/purduealias/p10output/` to create 10 sets of point estimates. Make sure the `p10output` directory exists first!

Okay, there are a couple of other different modifications in the script. Carefully read through the code, and give you best explaination of the changes in 2-3 sentences. Add another 1-2 sentences with your opinion of the changes.

Next, create your job script. Let’s call this `p10_job.sh`. You can use the following code as a starting point for your script (from a previous project). We would highly recommend using 10 cores to generate a total of 2000 sets of point estimates. The total runtime will vary but should be anywhere from 5 to 15 minutes.

p10_job.sh
``````#!/bin/bash
#SBATCH --account=datamine              # Queue
#SBATCH --job-name=kevinsjob          # Job name
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH [email protected]     # Where to send mail
#SBATCH --time=00:30:00
#SBATCH -o /dev/null                  # Output to dev null
#SBATCH -e /dev/null                  # Error to dev null

arr=(/depot/datamine/data/coco/unlabeled2017/*)

for((i=0; i < \${#arr[@]}; i+=12500))
do
part=( "\${arr[@]:i:12500}" )
srun -A datamine --exclusive -n 1 --mem-per-cpu=200 module use /scratch/brown/kamstut/tdm/opt/modulefiles; module load libffi/3.4.2; \$HOME/hash1.py hash --output \$CLUSTER_SCRATCH/p4output/ \${part[*]} &
done

wait``````
 You won’t need any of that array stuff anymore since we don’t have to keep track of the files we’re working with.
 Make sure both `bootstrap_samples.py` and `p10_job.sh` have execute permissions. ``````chmod +x /path/to/bootstrap_samples.py chmod +x /path/to/p10_job.sh``````
 Make sure you keep the `module use` and `module load` lines in your job script — libffi is required for your code to run.

Submit your job using `sbatch p10_job.sh`.

 Make sure to clear out the SLURM environment variables if you choose to run the `sbatch` command from within a bash cell in your notebook. ``for i in \$(env | awk -F= '/SLURM/ {print \$1}'); do unset \$i; done;``

Great! Now you have a directory `\$CLUSTER_SCRATCH/p10output/` that contains 2000 sets of point estimates. Your job is now to process this data to create a graphic showign:

1. The actual `y_test` values (in blue) as a set of points (using `plt.scatter`).

2. The predictions as a line.

3. The confidence intervals as a shaded region. (You can use `plt.fill_between`).

The 95% confidence interval is simply the 97.5th percentile of each prediction’s point estimates (upper) and the 2.5th percentile of each prediction’s point estimates (lower).

 You will need to run the algorithm to get your predictions using the non-resampled training data — otherwise you won’t have the predictions to plot!
 You will notice that some of your point estimates will be NaN. Resampling can cause your model to no longer converge unless you change the learning rate. Remove the NaN values, you should be left with around 1500 sets of point estimates that you can use.
 You can loop through the output files by doing something like: ``````from pathlib import Path for file in Path("/scratch/brown/purduealias/p10output/").glob("*.parquet"): pass``````
Items to submit
• Code used to solve this problem.

• Output from running the code.

• 2-3 sentences explaining the "other" changes in the provided script.

• 1-2 sentences describing your opinion of the changes.

• `p10_job.sh`.

• Your resulting graphic — make sure it renders properly when viewed in Gradescope.

 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project.