STAT 39000: Project 8 — Spring 2022

Motivation: Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the pytorch and tensorflow libraries — JAX is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold.

Context: This is the first in a series of 4-5 projects focused on pytorch (and perhaps JAX). The purpose of these projects is to give you exposure to these tools, some basic functionality, and to show why they are useful, without needing any special math or statistics background.

Scope: Python, pytorch

Learning Objectives
  • Demystify a "tensor".

  • Utilize the pytorch API to create, modify, and operate on tensors.

  • Use simple, simulated data to create a multiple linear regression model using closed form solutions.

  • Use pytorch to calculate a popular uncertainty quantification, the 95% confidence interval.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/sim/train.csv

  • /depot/datamine/data/sim/test.csv

Questions

Question 1

While data frames are a great way to work with data, they are not the only way. Many high performance parts of code are written using a library like numpy or pytorch. These libraries are optimized to be extremely efficient with computation. Multiplying, transposing, inversing matrices can take time, and these libraries can make you code run blazingly fast.

This project is the first project in a series of projects focused on the pytorch library. It is difficult to understand why a library like pytorch is useful without introducing some math. This series of projects will involve some math, however, only at a very high level. Some intuition will be presented as notes, but what is really needed is the ability to read some formulas, and perform the appropriate computations. Throughout this series of projects, we will do our best to ensure that math or statistics is not at all a barrier to completing these projects and getting familiar with pytorch. If it does end up an issue, please post in Piazza and we will do our best to address any issues as soon as possible.

This first project will start slowly, and only focus on the numpy -like functionality of pytorch. We’ve provided you with a set of 100 observations. 75 of the observations are in the train.csv file, 25 are in the test.csv file. We will build a regression model using the data in the train.csv file. In addition, we will calculate some other statistics. Finally, we will (optionally) test our model out on new data in the test.csv dataset.

Start by reading the train.csv file into a pytorch tensor.

import torch
import pandas as pd

dat = pd.read_csv('/depot/datamine/data/sim/train.csv')
x_train = torch.tensor(dat['x'].to_numpy())
y_train = torch.tensor(dat['y'].to_numpy())

A tensor is just a n-dimensional array.

Use matplotlib or plotly to plot the data on a scatterplot — x_train on the x-axis, and y_train on the y-axis. After talking to your colleague, you agreed that the data is clearly following a 2nd order polynomial. Something like:

$y = \beta_0 + \beta_1x + \beta_2x^2$

Our goal will be to estimate the values of $\beta_0$, $\beta_1$, and $\beta_2$ using the data in x_train and y_train. Then, we will have a model that could look something like:

$y = 1.2 + .4x + 2.2x^2$

Then, for any given value of x, we can use our model to predict the value of y.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

In order to build our model, we need to estimate our parameters: $\beta_0$, $\beta_1$, and $\beta_2$. Luckily, linear regression has a closed form solution, so we can calculate these values directly with the following equation.

$\hat{\beta} = (X^{T} X)^{-1} X^{T} y$

What do these symbols mean? X is a matrix (or tensor), where each column is a term in the polynomial, and each row is an observation. So, for our polynomial, if our X data was simply: 1, 2, 3, 4, the X matrix (or design matrix) would be the following:

X
1, 1, 1
1, 2, 4
1, 3, 9
1, 4, 16

Here, the first column is the constant term, the second column is the term of x, the third column is the term of $x^2$, and so on.

When we raise the matrix to the "T" this means to transpose the matrix. The transpose of X, for example, would look like:

X^T
1, 1, 1, 1
1, 2, 3, 4
1, 3, 9, 16

When we raise the matrix to the "-1" this means to invert the matrix.

Finally, placing these matrices next to each other means we need to perform matrix multiplication.

pytorch has built in functions to do all of these operations: torch.mm, mat.T, and torch.inverse.

Lastly, y is the tensor containing the observations in y_train.

Tensors must be the correct dimensions before they can be multiplied together using torch.mm. By default, x_train and y_train will be a single row and 75 columns. In order to change this to be a single column and 75 rows, we would need to use the reshape method: x_train.reshape(75,1).

When doing matrix multiplication, it is important that the tensors are aligned properly. A 4x1 matrix would be a matrix that has 4 rows and 1 column (the first number always represents the number of rows, the second always represents the number of columns).

In order to multiply 2 matrices together, the number of columns in the first matrix must equal the number of rows in the second matrix. The resulting matrix would then have the number of rows as the first matrix, and the number of columns of the second matrix. So, if we multiplied a 4x3 matrix with a 3x5 matrix, the result would be a 4x5 matrix.

These rules are important, because the tensors must be the correct shape (correct number of rows and columns) before we perform matrix multiplication, otherwise we will get an error.

The reshape method allows you to specify the number of rows and columns in the tensor, for example, x_train.reshape(75,1), would result in a matrix with 75 rows and a single column. You will need to be careful to make sure your tensors are the correct shape before multiplication.

Start by creating a new tensor called x_mat that is 75 rows and 3 columns. The first column should be filled with 1’s (using torch.ones(x_train.shape[0]).reshape(75,1)), the second column should be the values in x_train, the third column should be the values in x_train squared. Use torch.cat to combine the 75x1 tensors into a single 75x3 tensor (x_mat).

Make sure you reshape all of your tensors to be 75x1 before you use torch.cat to combine them into a 75x3 tensor.

Operations like addition and subtraction are vectorized. For example, the following would result in a 75x1 tensor of 2’s.

x = torch.ones(75,1)
x*2

The following would result in a 1x75 tensor of .5’s.

x = torch.ones(1,75)
x/2

Remember, in Python, you can use:

**

to raise a number to a power. For example $2^3$ would be

2**3

To get the transpose of a tensor 2 dimension tensor in pytorch you could use x_mat.T, or torch.transpose(x_mat, 0, 1), where 0 is the first dimension to transpose and 1 is the second dimension to transpose.

Calculate our estimates for $\beta_0$, $\beta_1$, and $\beta_2$, and save the values in a tensor called betas. The following should be the successful result.

results
tensor([[ 4.3677],
        [-1.7885],
        [ 0.4840]], dtype=torch.float64)

Now that you know the values for $\beta_0$, $\beta_1$, and $\beta_2$, what is our model (as an equation)? It should be:

$y = 4.3677-1.7885x+.4840x^2$

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

That is pretty cool, and very fast. Now, for any given value of x, we can predict a value of y. Of course, we could write a predict function that accepts a value x, and returns our prediction y, and apply that function to each of the x values in our x_train tensor, however, this can be accomplished even faster and more flexibly using matrix multiplication — simply use the following formula:

$\hat{y} = X\hat{\beta}$

Where X is the x_mat tensor from earlier, and $\hat{\beta}$ is the betas tensor from question (2). Use torch.mm to multiply the two matrices together. Save the resulting tensor to a variable called y_predictions. Finally, create side by side scatterplots. In the first scatterplot, put the values in x_train on the x-axis and the values of y_train on the y-axis. In the second scatterplot put the values of x_train on the x-axis, and your predictions (y_predictions) on the y-axis.

Very cool! Your model should be killing it (after all, we generated this data to follow a known distribution).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

To better understand our model, let us create one of the most common forms of uncertainty quantification, confidence intervals. Confidence intervals (95% confidence intervals) show you the range of values (for each x) where we are 95% confident that the average value y for a given x is within the range.

The formula is the following:

$\hat{y_h} \pm t_{(\alpha/2, n-p)} * \sqrt{MSE * diag(x_h(X^{T} X)^{-1} x_h^{T})}$

$MSE = \frac{1}{n-p}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$

Since we are calculating the 95% confidence interval for the values of x in our x_train tensor, we can simplify this to:

$\hat{Y} \pm 1.993464 * \sqrt{MSE * diag(X(X^{T} X)^{-1} X^{T})}$

$\frac{1}{72}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$

Where:

  • $\hat{Y}$ is our y_predictions tensor from question (3).

  • $Y_i$ is the value of y for the ith value of y_train.

  • $\hat{Y_i}$ is the value of y for the ith value of y_predictions.

    You could simply sum the results of subtracting the y_predictions tensor from the y_train tensor, squared. You don’t need any loop.

  • p is the number of parameters in our model (3, the constant, the x, and the $x^2$).

  • n is the number of observations in our data set (75).

The "diag" part of the formula indicates that we want the diagonal of the resulting matrix. The diagonal of a given nxn matrix is the value at location (1,1), (2,2), (3,3), …​, (n,n). So, for instance, the diagonal of the following matrix is: 1, 5, 9

matrix
1,2,3
4,5,6
7,8,9

In pytorch, you can get this using torch.diag(x), where x is the matrix you want the diagonal of.

test = torch.tensor([1,2,3,4,5,6,7,8,9]).reshape(3,3)
torch.diag(test)

You can use torch.sum to sum up the values in a tensor.

The value for MSE should be 135.5434.

The first 5 values of the upper confidence interval are:

upper
tensor([[171.3263],
        [ 91.9131],
        [ 83.3474],
        [ 63.8171],
        [ 63.0524]], dtype=torch.float64)

The first 5 values of the lower confidence interval are:

lower
tensor([[140.6660],
        [ 76.2350],
        [ 69.1461],
        [ 52.7601],
        [ 52.1101]], dtype=torch.float64)
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Create a scatterplot of x_train on the x-axis, and y_predictions on the y-axis. Add the confidence intervals to the plot.

Great! It is unsurprising that our model is a great fit.

See here for the documentation on fill_between. This function can be used to shade from the lower to upper confidence bounds. Use this function after you’ve plotted your values of x (x_mat[:, 1]) on the x-axis and values of y_predictions on your y-axis.

In this project, we explored a well known model using simulated data from a known distribution. It is pretty boring, but boring can also make things a bit easier to understand.

To give a bit of perspective, this project focused on tensor operations so you could get used to pytorch. The power of pytorch starts to really show itself when the problems do not have a closed form solution. In the next project, we will use an algorithm called gradient descent to estimate our parameters (instead of using the closed form solutions). Since gradient descent, and algorithms like it are used frequently, it will give you a good sense on why pytorch is useful. In addition, because we solved this problem using the closed form solutions, we will be able to easily verify that our work in the next project is working as intended!

Lastly, in more complex situations, you may not have formulas to calculate confidence intervals and other uncertaintly quantification measures. We will use SLURM in combination with pytorch to resample our data and calculate point estimates, which can then be used to understand the variability.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.