TDM 30200: Project 10 — 2023

Motivation: In this project, we will utilize SLURM for a couple of purposes. The first is to have the chance to utilize a GPU on the cluster for some pytorch work, and the second is to use resampling to get point estimates. We can then use those point estimates to make a confidence interval and gain a better understand of the variability of our model.

Context: This is the fourth of a series of 4 projects focused on using SLURM. This project is also an interlude to a series of projects on pytorch and JAX. We will use pytorch for our calculations.

Scope: SLURM, unix, bash, pytorch, Python

Learning Objectives

Demystify a "tensor".
Utilize the pytorch API to create, modify, and operate on tensors.
Use simple, simulated data to create a multiple linear regression model using closed form solutions.
Use pytorch to calculate a popular uncertainty quantification, the 95% confidence interval.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/anvil/projects/tdm/data/sim/train.csv
/anvil/projects/tdm/data/sim/test.csv
/anvil/projects/tdm/data/sim/train100k.csv
/anvil/projects/tdm/data/sim/train10m.csv

Questions

You do not want to wait until the end of the week to do part 1 of this project. Part 1 is pretty straightforward, and basically just requires running code that you’ve already written a variety of times. There is limited GPU access, so this is the constraint and reason you should attempt to run through part 1 earlier, rather than later.

This project is broken into two parts. In part 1, we will use pytorch and build our model using cpus and gpus, and draw comparisons. Models will be built using datasets of differing sizes. The goal of part 1 is to see how a GPU can make a large impact on training time. Note that these datasets are synthetic data and don’t really represent a realistic scenario, but they do work well to illustrate how powerful GPUs are.

Part 2 is a continuation from the previous project. In the previous project, you used pytorch to perform a gradient descent and build a model for our small, simulated dataset. While it is certainly possible to use other methods to get some form of uncertainty quantification (in our case, we are specifically looking at a 95% confidence interval for our predictions), it is not always easy to do so, or possible. One of the most common methods to calculate these things, in these difficult situations is bootstrapping. In fact, Dr. Andrew Gelman, a world-class statistician, had this as his second item in his list of the top 50 influential statistical ideas in the past 50 years. We will use SLURM to perform this computationally intensive, but relatively simple method.

Part 1

You should all have been granted access to our GPU allocation. If you try to use the GPU allocation and run into issues, please create a post in Piazza and make sure you include your Anvil username. To find your Anvil username, you can run the following in a terminal inside your Jupyter Notebook:

echo $USER

This question should be completed our GPU allocation, since our regular allocation does not have access to GPUs.

To launch the Jupyter Lab instance using our GPU allocation, use the typical Jupyter Notebook option at ondemand.anvil.rcac.purdue.edu. However, instead of using the default options, use the following:

Allocation: cis220051-gpu
Queue: gpu
Time in Hours: 1
Cores: 4
Use Jupyter Lab instead of Jupyter Notebook (checked)

To confirm you have access to the GPU you can use the following code. Note that you only really need one of these, but I am showing them all because they may be interesting to you.

import torch

# see if cuda is available
torch.cuda.is_available()

# see the current device
torch.cuda.current_device()

# see the number of devices available
torch.cuda.device_count()

# get the name of a device
torch.cuda.get_device_name(0)

For this question you will use pytorch with cpus (like in the previous project) to build a model for train.csv, train100k.csv, and train10m.csv. Use the %%time Jupyter magic to time the calculation for each dataset.

The following is the code from the previous project that you can use to get started.

import torch
import pandas as pd

dat = pd.read_csv("/anvil/projects/tdm/data/sim/train.csv")
x_train = torch.tensor(dat['x'].to_numpy())
y_train = torch.tensor(dat['y'].to_numpy())

beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float)
beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float)
beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float)
learning_rate = .0003

num_epochs = 10000
optimizer = torch.optim.SGD([beta0, beta1, beta2], lr=learning_rate)
mseloss = torch.nn.MSELoss(reduction='mean')

for idx in range(num_epochs):
    # calculate the predictions / forward pass
    y_predictions = beta0 + beta1*x_train + beta2*x_train**2

    # calculate the MSE
    mse = mseloss(y_train, y_predictions)

    if idx % 100 == 0:
        print(f"MSE: {mse}")

    # calculate the partial derivatives / backwards step
    mse.backward()

    # update our parameters
    optimizer.step()

    # zero out the gradients
    optimizer.zero_grad()

print(f"beta0: {beta0}")
print(f"beta1: {beta1}")
print(f"beta2: {beta2}")

For train10m.csv, instead of running the entire 10k epochs, just perform 100 epochs, and estimate the amount of time it would take to complete 10k epochs. We try not to be that mean, although, if you do want to wait and see, that is perfectly fine.

Modify your code to use a gpu instead of cpus, and time the time it takes to train the model using train.csv, train100k.csv, and train10m.csv. What percentage faster is the GPU calculations for each dataset?

Items to submit

Code used to solve this problem.
Output from running the code.
Time it took to build the model for the train.csv and train100k.csv using cpus. In addition, the estimated time it would take to build the model for train10m.csv, again, using cpus.
Time it took to build the model for the train.csv, train100k.csv, and train10m.csv, using gpus.
What percentage faster (or slower) the GPU version is vs the CPU version for each dataset.

Part 2

You can now save your notebook, and switch back to using the regular cis220051 allocation — don’t forget to also change the queue to "shared". Be careful not to overwrite your output from part 1.

We’ve provided you with a Python script called bootstrap_samples.py that accepts a single value, for example 10, and runs the code you wrote in the previous project 10 times. This code should have a few modifications. One major, but simple modification is that rather than using our training data to build the model, instead, sample the same number of values in our x_train tensor from our x_train tensor, with replacement. What this means is if our x_train contained 1,2,3, we could produce any of the following samples 1,2,3 or 1,1,2 or 1,2,2 or 3,3,3 etc. We called these resampled values xr_train. Then proceed as normal, building your model using xr_train instead of x_train.

In addition at the end of the script, we used your model to get predictions for all of the values in x_test. Save these predictions to a parquet file, for example, 0cd68e5e-134d-4575-a31d-2060644f4caa.parquet, in a safe location, for example $SCRATCH/p10output/. Each file will each contain a single set of point estimates for our predictions.

bootstrap_samples.py

import sys
import argparse
import pandas as pd
import random
import torch
from pathlib import Path
import uuid


class Regression(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.beta0 = torch.nn.Parameter(torch.tensor(5, requires_grad=True, dtype=torch.float))
        self.beta1 = torch.nn.Parameter(torch.tensor(4, requires_grad=True, dtype=torch.float))
        self.beta2 = torch.nn.Parameter(torch.tensor(3, requires_grad=True, dtype=torch.float))

    def forward(self, x):
        return self.beta0 + self.beta1*x + self.beta2*x**2


def get_point_estimates(x_train, y_train, x_test):

    model = Regression()
    learning_rate = .0003

    num_epochs = 10000
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    mseloss = torch.nn.MSELoss(reduction='mean')

    # resample data
    resampled_idxs = random.choices(range(75), k=75)
    xr_train = torch.tensor(x_train[resampled_idxs], requires_grad=True, dtype=torch.float).reshape(75)

    for _ in range(num_epochs):
        # set to training mode -- note this does not _train_ anything
        model.train()

        # calculate the predictions / forward pass
        y_predictions = model(xr_train)

        # calculate the MSE
        mse = mseloss(y_train[resampled_idxs], y_predictions)

        # calculate the partial derivatives / backwards step
        mse.backward()

        # update our parameters
        optimizer.step()

        # zero out the gradients
        optimizer.zero_grad()

    # get predictions
    predictions = pd.DataFrame(data={"predictions": model(x_test).detach().numpy()})

    return(predictions)


def main():
    parser = argparse.ArgumentParser()
    subparsers = parser.add_subparsers(help="possible commands", dest="command")
    bootstrap_parser = subparsers.add_parser("bootstrap", help="")
    bootstrap_parser.add_argument("n", type=int, help="number of set of point estimates for predictions to output")
    bootstrap_parser.add_argument("-o", "--output", help="directory to output file(s) to")

    if len(sys.argv) == 1:
        parser.print_help()
        sys.exit(1)

    args = parser.parse_args()

    if args.command == "bootstrap":

        dat = pd.read_csv("/anvil/projects/tdm/data/sim/train.csv")
        x_train = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)
        y_train = torch.tensor(dat['y'].to_numpy(), dtype=torch.float)

        dat = pd.read_csv("/anvil/projects/tdm/data/sim/test.csv")
        x_test = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)

        for _ in range(args.n):
            estimates = get_point_estimates(x_train, y_train, x_test)
            estimates.to_parquet(f"{Path(args.output) / str(uuid.uuid4())}.parquet")

if __name__ == "__main__":
	main()

Make sure your p10output directory exists!

mkdir -p $SCRATCH/p10output

You can use the script like the following, in order to create 10 sets of point estimates:

singularity exec /anvil/projects/tdm/apps/containers/images/python:f2022-s2023.sif python3 /path/to/bootstrap_samples.py bootstrap 10 --output /anvil/scratch/USERNAME/p10output/

Make sure the p10output directory exists first! Also, replace USERNAME with your Anvil username.

Next, create your job script. Let’s call this p10_job.sh. You can use the following code. We would highly recommend using 10 cores to generate a total of 2000 sets of point estimates. The total runtime will vary but should be anywhere from 5 to 15 minutes.

p10_job.sh

#!/bin/bash
#SBATCH --account=cis220051              # Queue
#SBATCH --partition=shared
#SBATCH --job-name=kevinsjob          # Job name
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH [email protected]     # Where to send mail
#SBATCH --time=00:30:00
#SBATCH --ntasks=10                   # Number of tasks (total)
#SBATCH -o /dev/null                  # Output to dev null
#SBATCH -e /dev/null                  # Error to dev null

for((i=0; i < 10; i+=1))
do
    srun -A cis220051 -p shared --exact -n 1 -c 1 singularity exec /anvil/projects/tdm/apps/containers/images/python:f2022-s2023.sif python3 $HOME/bootstrap_samples.py bootstrap 200 --output $SCRATCH/p10output/ &
done

wait

You won’t need any of that array stuff anymore since we don’t have to keep track of the files we’re working with.

Make sure both bootstrap_samples.py and p10_job.sh have execute permissions.

chmod +x /path/to/bootstrap_samples.py
chmod +x /path/to/p10_job.sh

Submit your job using sbatch p10_job.sh.

Make sure to clear out the SLURM environment variables if you choose to run the sbatch command from within a bash cell in your notebook.

for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done;

Great! Now you have a directory $SCRATCH/p10output/ that contains 2000 sets of point estimates. Your job is now to process this data to create a graphic showing:

The actual y_test values (in blue) as a set of points (using plt.scatter).
The predictions as a line.
The confidence intervals as a shaded region. (You can use plt.fill_between).

The 95% confidence interval is simply the 97.5th percentile of each prediction’s point estimates (upper) and the 2.5th percentile of each prediction’s point estimates (lower).

You can import via:

import matplotlib.pyplot as plt

You will need to run the algorithm to get your predictions using the non-resampled training data — otherwise you won’t have the predictions to plot!

You will notice that some of your point estimates will be NaN. Resampling can cause your model to no longer converge unless you change the learning rate. Remove the NaN values, you should be left with around 1500 sets of point estimates that you can use.

You can loop through the output files by doing something like:

from pathlib import Path

for file in Path("/anvil/scratch/USERNAME/p10output/").glob("*.parquet"):
    pass

Items to submit

Code used to solve this problem.
Output from running the code.
2-3 sentences explaining the "other" changes in the provided script.
1-2 sentences describing your opinion of the changes.
p10_job.sh.
Your resulting graphic — make sure it renders properly when viewed in Gradescope.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.