STAT 39000: Project 10 — Spring 2022
Motivation: In this project, we will utilize SLURM for a couple of purposes. The first is to have the chance to utilize a GPU on the cluster for some pytorch
work, and the second is to use resampling to get point estimates. We can then use those point estimates to make a confidence interval and gain a better understand of the variability of our model.
Context: This is the fourth of a series of 4 projects focused on using SLURM. This project is also an interlude to a series of projects on pytorch
and JAX
. We will use pytorch
for our calculations.
Scope: SLURM, unix, bash, pytorch
, Python
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/sim/train.csv
-
/depot/datamine/data/sim/test.csv
-
/depot/datamine/data/sim/train100k.csv
-
/depot/datamine/data/sim/train10m.csv
Questions
You do not want to wait until the end of the week to do part 1 of this project. Part 1 is pretty straightforward, and basically just requires running code that you’ve already written a variety of times. There is limited GPU access, so this is the constraint and reason you should attempt to run through part 1 earlier, rather than later. |
This project is broken into two parts. In part 1, we will use Part 2 is a continuation from the previous project. In the previous project, you used |
Part 1
This question should be completed on Scholar, as Scholar has a GPU that you can use. Start by navigating to gateway.scholar.rcac.purdue.edu, and launching a terminal. In the terminal, run the following.
This will give you access to a Next, delete your Jupyter instance and re-launch a fresh Jupyter Lab instance, and confirm you have access to the GPU. To launch the Jupyter Lab instance, click on "Jupyter Notebook" under the GUI section (do not use the "Jupyter Lab" in the "Datamine" section) and use the following options:
To confirm you have access to the GPU you can use the following code. Note that you only really need one of these, but I am showing them all because they may be interesting to you.
|
For this question you will use pytorch
with cpus (like in the previous project) to build a model for train.csv
, train100k.csv
, and train10m.csv
. Use the %%time
Jupyter magic to time the calculation for each dataset.
The following is the code from the previous project that you can use to get started.
|
For |
Modify your code to use a gpu instead of cpus, and time the time it takes to train the model using train.csv
, train100k.csv
, and train10m.csv
. What percentage faster is the GPU calculations for each dataset?
-
Code used to solve this problem.
-
Output from running the code.
-
Time it took to build the model for the
train.csv
andtrain100k.csv
using cpus. In addition, the estimated time it would take to build the model fortrain10m.csv
, again, using cpus. -
Time it took to build the model for the
train.csv
,train100k.csv
, andtrain10m.csv
, using gpus. -
What percentage faster (or slower) the GPU version is vs the CPU version for each dataset.
Part 2
You can now save your notebook, and switch back to using Brown. Navigate to ondemand.brown.rcac.purdue.edu/ and launch a Jupyter Lab instance the way you normally would, and fill in your notebook with you solutions to part 2. Be careful not to overwrite your output from part 1. You will want to copy your notebook to Brown, first. To do so from Scholar, open a terminal and copy the notebook as follows.
Or to copy from Brown.
|
We’ve provided you with a Python script called bootstrap_samples.py
that accepts a single value, for example 10, and runs the code you wrote in the previous project 10 times. This code should have a few modifications. One major, but simple modification is that rather than using our training data to build the model, instead, sample the same number of values in our x_train
tensor from our x_train
tensor, with replacement. What this means is if our x_train
contained 1,2,3, we could produce any of the following samples 1,2,3 or 1,1,2 or 1,2,2 or 3,3,3 etc. We called these resampled values xr_train
. Then proceed as normal, building your model using xr_train
instead of x_train
.
In addition at the end of the script, we used your model to get predictions for all of the values in x_test
. Save these predictions to a parquet file, for example, 0cd68e5e-134d-4575-a31d-2060644f4caa.parquet
, in a safe location, for example $CLUSTER_SCRATCH/p10output/
. Each file will each contain a single set of point estimates for our predictions.
#!/scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python
import sys
import argparse
import pandas as pd
import random
import torch
from pathlib import Path
import uuid
class Regression(torch.nn.Module):
def __init__(self):
super().__init__()
self.beta0 = torch.nn.Parameter(torch.tensor(5, requires_grad=True, dtype=torch.float))
self.beta1 = torch.nn.Parameter(torch.tensor(4, requires_grad=True, dtype=torch.float))
self.beta2 = torch.nn.Parameter(torch.tensor(3, requires_grad=True, dtype=torch.float))
def forward(self, x):
return self.beta0 + self.beta1*x + self.beta2*x**2
def get_point_estimates(x_train, y_train, x_test):
model = Regression()
learning_rate = .0003
num_epochs = 10000
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mseloss = torch.nn.MSELoss(reduction='mean')
# resample data
resampled_idxs = random.choices(range(75), k=75)
xr_train = torch.tensor(x_train[resampled_idxs], requires_grad=True, dtype=torch.float).reshape(75)
for _ in range(num_epochs):
# set to training mode -- note this does not _train_ anything
model.train()
# calculate the predictions / forward pass
y_predictions = model(xr_train)
# calculate the MSE
mse = mseloss(y_train[resampled_idxs], y_predictions)
# calculate the partial derivatives / backwards step
mse.backward()
# update our parameters
optimizer.step()
# zero out the gradients
optimizer.zero_grad()
# get predictions
predictions = pd.DataFrame(data={"predictions": model(x_test).detach().numpy()})
return(predictions)
def main():
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers(help="possible commands", dest="command")
bootstrap_parser = subparsers.add_parser("bootstrap", help="")
bootstrap_parser.add_argument("n", type=int, help="number of set of point estimates for predictions to output")
bootstrap_parser.add_argument("-o", "--output", help="directory to output file(s) to")
if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)
args = parser.parse_args()
if args.command == "bootstrap":
dat = pd.read_csv("/depot/datamine/data/sim/train.csv")
x_train = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)
y_train = torch.tensor(dat['y'].to_numpy(), dtype=torch.float)
dat = pd.read_csv("/depot/datamine/data/sim/test.csv")
x_test = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)
for _ in range(args.n):
estimates = get_point_estimates(x_train, y_train, x_test)
estimates.to_parquet(f"{Path(args.output) / str(uuid.uuid4())}.parquet")
if __name__ == "__main__":
main()
Make sure your |
You can use the script like |
Okay, there are a couple of other different modifications in the script. Carefully read through the code, and give you best explaination of the changes in 2-3 sentences. Add another 1-2 sentences with your opinion of the changes.
Next, create your job script. Let’s call this p10_job.sh
. You can use the following code as a starting point for your script (from a previous project). We would highly recommend using 10 cores to generate a total of 2000 sets of point estimates. The total runtime will vary but should be anywhere from 5 to 15 minutes.
#!/bin/bash
#SBATCH --account=datamine # Queue
#SBATCH --job-name=kevinsjob # Job name
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH [email protected] # Where to send mail
#SBATCH --time=00:30:00
#SBATCH --ntasks=10 # Number of tasks (total)
#SBATCH -o /dev/null # Output to dev null
#SBATCH -e /dev/null # Error to dev null
arr=(/depot/datamine/data/coco/unlabeled2017/*)
for((i=0; i < ${#arr[@]}; i+=12500))
do
part=( "${arr[@]:i:12500}" )
srun -A datamine --exclusive -n 1 --mem-per-cpu=200 module use /scratch/brown/kamstut/tdm/opt/modulefiles; module load libffi/3.4.2; $HOME/hash1.py hash --output $CLUSTER_SCRATCH/p4output/ ${part[*]} &
done
wait
You won’t need any of that array stuff anymore since we don’t have to keep track of the files we’re working with. |
Make sure both
|
Make sure you keep the |
Submit your job using sbatch p10_job.sh
.
Make sure to clear out the SLURM environment variables if you choose to run the
|
Great! Now you have a directory $CLUSTER_SCRATCH/p10output/
that contains 2000 sets of point estimates. Your job is now to process this data to create a graphic showign:
-
The actual
y_test
values (in blue) as a set of points (usingplt.scatter
). -
The predictions as a line.
-
The confidence intervals as a shaded region. (You can use
plt.fill_between
).
The 95% confidence interval is simply the 97.5th percentile of each prediction’s point estimates (upper) and the 2.5th percentile of each prediction’s point estimates (lower).
You will need to run the algorithm to get your predictions using the non-resampled training data — otherwise you won’t have the predictions to plot! |
You will notice that some of your point estimates will be NaN. Resampling can cause your model to no longer converge unless you change the learning rate. Remove the NaN values, you should be left with around 1500 sets of point estimates that you can use. |
You can loop through the output files by doing something like:
|
-
Code used to solve this problem.
-
Output from running the code.
-
2-3 sentences explaining the "other" changes in the provided script.
-
1-2 sentences describing your opinion of the changes.
-
p10_job.sh
. -
Your resulting graphic — make sure it renders properly when viewed in Gradescope.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |