STAT 39000: Project 10 — Spring 2022

Motivation: In this project, we will utilize SLURM for a couple of purposes. The first is to have the chance to utilize a GPU on the cluster for some pytorch work, and the second is to use resampling to get point estimates. We can then use those point estimates to make a confidence interval and gain a better understand of the variability of our model.

Context: This is the fourth of a series of 4 projects focused on using SLURM. This project is also an interlude to a series of projects on pytorch and JAX. We will use pytorch for our calculations.

Scope: SLURM, unix, bash, pytorch, Python

Learning Objectives

Demystify a "tensor".
Utilize the pytorch API to create, modify, and operate on tensors.
Use simple, simulated data to create a multiple linear regression model using closed form solutions.
Use pytorch to calculate a popular uncertainty quantification, the 95% confidence interval.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/depot/datamine/data/sim/train.csv
/depot/datamine/data/sim/test.csv
/depot/datamine/data/sim/train100k.csv
/depot/datamine/data/sim/train10m.csv

Questions

You do not want to wait until the end of the week to do part 1 of this project. Part 1 is pretty straightforward, and basically just requires running code that you’ve already written a variety of times. There is limited GPU access, so this is the constraint and reason you should attempt to run through part 1 earlier, rather than later.

This project is broken into two parts. In part 1, we will use pytorch and build our model using cpus and gpus, and draw comparisons. Models will be built using datasets of differing sizes. The goal of part 1 is to see how a GPU can make a large impact on training time. Note that these datasets are synthetic data and don’t really represent a realistic scenario, but they do work well to illustrate how powerful GPUs are.

Part 2 is a continuation from the previous project. In the previous project, you used pytorch to perform a gradient descent and build a model for our small, simulated dataset. While it is certainly possible to use other methods to get some form of uncertainty quantification (in our case, we are specifically looking at a 95% confidence interval for our predictions), it is not always easy to do so, or possible. One of the most common methods to calculate these things, in these difficult situations is bootstrapping. In fact, Dr. Andrew Gelman, a world-class statistician, had this as his second item in his list of the top 50 influential statistical ideas in the past 50 years. We will use SLURM to perform this computationally intensive, but relatively simple method.

Part 1

This question should be completed on Scholar, as Scholar has a GPU that you can use.

Start by navigating to gateway.scholar.rcac.purdue.edu, and launching a terminal. In the terminal, run the following.

mkdir -p ~/.local/share/jupyter/kernels/tdm-f2021-s2022
cp /class/datamine/apps/jupyter/kernels/f2021-s2022/kernel.json ~/.local/share/jupyter/kernels/tdm-f2021-s2022/kernel.json

This will give you access to a f2021-s2022 kernel that is different than our normal kernel on Brown, but has the necessary packages to run the code we need to run.

Next, delete your Jupyter instance and re-launch a fresh Jupyter Lab instance, and confirm you have access to the GPU.

To launch the Jupyter Lab instance, click on "Jupyter Notebook" under the GUI section (do not use the "Jupyter Lab" in the "Datamine" section) and use the following options:

Queue: gpu (Max 4.0 hours)
Number of hours: 0.5
Use Jupyter Lab instead of Jupyter Notebook (checked)

To confirm you have access to the GPU you can use the following code. Note that you only really need one of these, but I am showing them all because they may be interesting to you.

import torch

# see if cuda is available
torch.cuda.is_available()

# see the current device
torch.cuda.current_device()

# see the number of devices available
torch.cuda.device_count()

# get the name of a device
torch.cuda.get_device_name(0)

For this question you will use pytorch with cpus (like in the previous project) to build a model for train.csv, train100k.csv, and train10m.csv. Use the %%time Jupyter magic to time the calculation for each dataset.

The following is the code from the previous project that you can use to get started.

import torch
import pandas as pd

dat = pd.read_csv("/depot/datamine/data/sim/train.csv")
x_train = torch.tensor(dat['x'].to_numpy())
y_train = torch.tensor(dat['y'].to_numpy())

beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float)
beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float)
beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float)
learning_rate = .0003

num_epochs = 10000
optimizer = torch.optim.SGD([beta0, beta1, beta2], lr=learning_rate)
mseloss = torch.nn.MSELoss(reduction='mean')

for idx in range(num_epochs):
    # calculate the predictions / forward pass
    y_predictions = beta0 + beta1*x_train + beta2*x_train**2

    # calculate the MSE
    mse = mseloss(y_train, y_predictions)

    if idx % 100 == 0:
        print(f"MSE: {mse}")

    # calculate the partial derivatives / backwards step
    mse.backward()

    # update our parameters
    optimizer.step()

    # zero out the gradients
    optimizer.zero_grad()

print(f"beta0: {beta0}")
print(f"beta1: {beta1}")
print(f"beta2: {beta2}")

For train10m.csv, instead of running the entire 10k epochs, just perform 100 epochs, and estimate the amount of time it would take to complete 10k epochs. We try not to be that mean, although, if you do want to wait and see, that is perfectly fine.

Modify your code to use a gpu instead of cpus, and time the time it takes to train the model using train.csv, train100k.csv, and train10m.csv. What percentage faster is the GPU calculations for each dataset?

Items to submit

Code used to solve this problem.
Output from running the code.
Time it took to build the model for the train.csv and train100k.csv using cpus. In addition, the estimated time it would take to build the model for train10m.csv, again, using cpus.
Time it took to build the model for the train.csv, train100k.csv, and train10m.csv, using gpus.
What percentage faster (or slower) the GPU version is vs the CPU version for each dataset.

Part 2

You can now save your notebook, and switch back to using Brown. Navigate to ondemand.brown.rcac.purdue.edu/ and launch a Jupyter Lab instance the way you normally would, and fill in your notebook with you solutions to part 2. Be careful not to overwrite your output from part 1.

You will want to copy your notebook to Brown, first. To do so from Scholar, open a terminal and copy the notebook as follows.

scp /home/purduealias/my_notebook.ipynb brown.rcac.purdue.edu:/home/purduealias/

Or to copy from Brown.

scp scholar.rcac.purdue.edu:/home/purduealias/my_notebook.ipynb /home/purduealias/

We’ve provided you with a Python script called bootstrap_samples.py that accepts a single value, for example 10, and runs the code you wrote in the previous project 10 times. This code should have a few modifications. One major, but simple modification is that rather than using our training data to build the model, instead, sample the same number of values in our x_train tensor from our x_train tensor, with replacement. What this means is if our x_train contained 1,2,3, we could produce any of the following samples 1,2,3 or 1,1,2 or 1,2,2 or 3,3,3 etc. We called these resampled values xr_train. Then proceed as normal, building your model using xr_train instead of x_train.

In addition at the end of the script, we used your model to get predictions for all of the values in x_test. Save these predictions to a parquet file, for example, 0cd68e5e-134d-4575-a31d-2060644f4caa.parquet, in a safe location, for example $CLUSTER_SCRATCH/p10output/. Each file will each contain a single set of point estimates for our predictions.

bootstrap_samples.py

#!/scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python

import sys
import argparse
import pandas as pd
import random
import torch
from pathlib import Path
import uuid


class Regression(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.beta0 = torch.nn.Parameter(torch.tensor(5, requires_grad=True, dtype=torch.float))
        self.beta1 = torch.nn.Parameter(torch.tensor(4, requires_grad=True, dtype=torch.float))
        self.beta2 = torch.nn.Parameter(torch.tensor(3, requires_grad=True, dtype=torch.float))

    def forward(self, x):
        return self.beta0 + self.beta1*x + self.beta2*x**2


def get_point_estimates(x_train, y_train, x_test):

    model = Regression()
    learning_rate = .0003

    num_epochs = 10000
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    mseloss = torch.nn.MSELoss(reduction='mean')

    # resample data
    resampled_idxs = random.choices(range(75), k=75)
    xr_train = torch.tensor(x_train[resampled_idxs], requires_grad=True, dtype=torch.float).reshape(75)

    for _ in range(num_epochs):
        # set to training mode -- note this does not _train_ anything
        model.train()

        # calculate the predictions / forward pass
        y_predictions = model(xr_train)

        # calculate the MSE
        mse = mseloss(y_train[resampled_idxs], y_predictions)

        # calculate the partial derivatives / backwards step
        mse.backward()

        # update our parameters
        optimizer.step()

        # zero out the gradients
        optimizer.zero_grad()

    # get predictions
    predictions = pd.DataFrame(data={"predictions": model(x_test).detach().numpy()})

    return(predictions)


def main():
    parser = argparse.ArgumentParser()
    subparsers = parser.add_subparsers(help="possible commands", dest="command")
    bootstrap_parser = subparsers.add_parser("bootstrap", help="")
    bootstrap_parser.add_argument("n", type=int, help="number of set of point estimates for predictions to output")
    bootstrap_parser.add_argument("-o", "--output", help="directory to output file(s) to")

    if len(sys.argv) == 1:
        parser.print_help()
        sys.exit(1)

    args = parser.parse_args()

    if args.command == "bootstrap":

        dat = pd.read_csv("/depot/datamine/data/sim/train.csv")
        x_train = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)
        y_train = torch.tensor(dat['y'].to_numpy(), dtype=torch.float)

        dat = pd.read_csv("/depot/datamine/data/sim/test.csv")
        x_test = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)

        for _ in range(args.n):
            estimates = get_point_estimates(x_train, y_train, x_test)
            estimates.to_parquet(f"{Path(args.output) / str(uuid.uuid4())}.parquet")

if __name__ == "__main__":
	main()

Make sure your p10output directory exists!

You can use the script like ./my_script.py bootstrap 10 --output /scratch/brown/purduealias/p10output/ to create 10 sets of point estimates. Make sure the p10output directory exists first!

Okay, there are a couple of other different modifications in the script. Carefully read through the code, and give you best explaination of the changes in 2-3 sentences. Add another 1-2 sentences with your opinion of the changes.

Next, create your job script. Let’s call this p10_job.sh. You can use the following code as a starting point for your script (from a previous project). We would highly recommend using 10 cores to generate a total of 2000 sets of point estimates. The total runtime will vary but should be anywhere from 5 to 15 minutes.

p10_job.sh

#!/bin/bash
#SBATCH --account=datamine              # Queue
#SBATCH --job-name=kevinsjob          # Job name
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH [email protected]     # Where to send mail
#SBATCH --time=00:30:00
#SBATCH --ntasks=10                   # Number of tasks (total)
#SBATCH -o /dev/null                  # Output to dev null
#SBATCH -e /dev/null                  # Error to dev null

arr=(/depot/datamine/data/coco/unlabeled2017/*)

for((i=0; i < ${#arr[@]}; i+=12500))
do
    part=( "${arr[@]:i:12500}" )
    srun -A datamine --exclusive -n 1 --mem-per-cpu=200 module use /scratch/brown/kamstut/tdm/opt/modulefiles; module load libffi/3.4.2; $HOME/hash1.py hash --output $CLUSTER_SCRATCH/p4output/ ${part[*]} &
done

wait

You won’t need any of that array stuff anymore since we don’t have to keep track of the files we’re working with.

Make sure both bootstrap_samples.py and p10_job.sh have execute permissions.

chmod +x /path/to/bootstrap_samples.py
chmod +x /path/to/p10_job.sh

Make sure you keep the module use and module load lines in your job script — libffi is required for your code to run.

Submit your job using sbatch p10_job.sh.

Make sure to clear out the SLURM environment variables if you choose to run the sbatch command from within a bash cell in your notebook.

for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done;

Great! Now you have a directory $CLUSTER_SCRATCH/p10output/ that contains 2000 sets of point estimates. Your job is now to process this data to create a graphic showign:

The actual y_test values (in blue) as a set of points (using plt.scatter).
The predictions as a line.
The confidence intervals as a shaded region. (You can use plt.fill_between).

The 95% confidence interval is simply the 97.5th percentile of each prediction’s point estimates (upper) and the 2.5th percentile of each prediction’s point estimates (lower).

You will need to run the algorithm to get your predictions using the non-resampled training data — otherwise you won’t have the predictions to plot!

You will notice that some of your point estimates will be NaN. Resampling can cause your model to no longer converge unless you change the learning rate. Remove the NaN values, you should be left with around 1500 sets of point estimates that you can use.

You can loop through the output files by doing something like:

from pathlib import Path

for file in Path("/scratch/brown/purduealias/p10output/").glob("*.parquet"):
    pass

Items to submit

Code used to solve this problem.
Output from running the code.
2-3 sentences explaining the "other" changes in the provided script.
1-2 sentences describing your opinion of the changes.
p10_job.sh.
Your resulting graphic — make sure it renders properly when viewed in Gradescope.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.