TDM 40200: Project 1 - Multilayer Perceptron

Welcome back to the TDM course! This semester, we will explore a range of topics that build on core machine learning concepts and extend toward practical implementation and evaluation. Our coverage will begin with neural network architectures, progress through MLOps and signal processing, and conclude with advanced modeling and validation techniques.

We have a few notes to help make this semester smoother for you:

If you do not already have an account on the Anvil supercomputer, follow example book setup instructions to set one up.
In case you haven’t already, visit Anvil notebook and log in using your ACCESS account credentials.
If you encounter any issues connecting to Anvil, please contact us promptly.
Please use the Datamine notebook and log in with your assigned ACCESS username (created when you set up your account) and the ACCESS password you selected. These credentials are different from your Purdue account and should not be confused with it.
For project-related questions, attend the in-person class on Mondays, post on Piazza, and come to office hours. For technical issues, please submit a ticket.

Project Objectives

This project will introduce you to Perceptrons, and the Multilayer Perceptron (MLP) architecture. You will first learn about how perceptrons work, implement a simple perceptron using PyTorch, and then extend this to a Multilayer Perceptron. Finally, you will apply your MLP to a classification task using the MNIST dataset.

Learning Objectives

Understand perceptrons and their role in neural networks.
Implement a simple perceptron using PyTorch.
Implement a Multilayer Perceptron (MLP) using PyTorch.
Apply an MLP to a classification task using the MNIST dataset.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Similar to the Fall 2025 semester, you may post your questions on the course Piazza page. Although the links are labeled “Fall 2025,” the same links will continue to be used for Spring 2026. Below is the Piazza link for this lecture:

TDM 40200 Piazza link

The projects are usually due on Wednesdays. You can see the schedule here: the-examples-book.com/projects/spring2026/20200/projects Please do not wait until Wednesday to complete and submit your work!

We strongly recommend starting your projects early in the week to avoid any last-minute issues that could cause you to miss the deadline.

Dataset

/anvil/projects/tdm/data/mnist/mnist_train.csv
/anvil/projects/tdm/data/mnist/mnist_test.csv

If AI is used in any cases, such as for debugging, research, etc., we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation.

The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”.

We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/spring2026/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty.

Questions

Question 1 (2 points)

Before we can learn about Multilayer Perceptrons, we first need to understand the fundamental blocks of neural networks: Perceptrons. A perceptron is a simple neuron model that takes multiple inputs to produce a single output. Mathematically, a perceptron computes a weighted sum of its inputs and then applies an activation function to produce an output. These perceptrons can then be combined into a larger more complex structure.

Now, you may have the following question: What is an activation function? An activation function is a mathematical function that typically helps remove linearity from the model, or overall modifies the behavior of the output from the perceptron. This is important as it allows the model to learn more complex, non-linear patterns. There are many different activation functions, with some of the most common listed below:

Activation Function

Description

Formula

Linear

The simplest activation function, which as you may guess, keeps the output linear.

$f(x) = x$

Step

Outputs a 1 if the input is greater than or equal to a threshold (commonly 0), otherwise outputs 0.

$f(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{otherwise} \end{cases}$

Sigmoid

A function that converts the input to a value between 0 and 1, often used in binary classification tasks.

$f(x) = \frac{1}{1 + e^{-x}}$

ReLU

Rectified Linear Unit, outputs the input if it is positive, otherwise outputs zero.

$f(x) = \max(0, x)$

Tanh

Hyperbolic tangent function, which outputs values between -1 and 1.

$f(x) = ( e^{x} - e^{-x} ) / ( e^{x} + e^{-x} )$

Softmax

Converts a vector of input values into a probability distribution output, often used in multiclass classification tasks.

$f(x_{i}) = (e^{x_{i}}) / ( \sum_{j=1}^{n} e^{x_{j}} )$ for each $i$

For this project, let’s try to use the Step, Sigmoid, and ReLU activation functions. We can use the following template for it (we used a threshold of 0 for step function)

import numpy as np
def step(x):

    output = np.where(x >= 0, 1, 0)

    return output

def sigmoid(x):

    output = 1 / (1 + np.exp(-x))

    return output

def relu(x):

    output = np.maximum(0, x)

    return output

Now that we have our activation functions, we can implement a simple perceptron model. A perceptron will have a set of weights and a bias term, which will be used to compute the weighted sum of inputs. This set of weights needs to be the same size as the number of inputs, and the bias term will be a single value. The perceptron computes its output (commonly called a prediction, or forward propagation in larger models) by taking the weighted sum of the inputs, adding a bias, and applying an activation function.

In order to train the perceptron, we will use a simple learning rule. The perceptron will predict the output for the given inputs, calculate the error between the expected output and predicted output, and then adjust the weights and bias based on this error, the learning rate, and the original input.

The learning rate is a hyperparameter that controls how quickly a model learns, or more specifically, how quickly the weights and bias are updated. This value depends on the problem, model, and data, but is commonly set to a small value such as 0.1 or 0.01.

More complex models will have more advanced training algorithms and learning rules, including backpropagation, gradient descent, etc. However, the simple perceptron model does not require these techniques.

from typing import Callable

class Perceptron:
    def __init__(self, input_size: int, learning_rate: float = 0.1, activation_function: Callable = sigmoid):
        self.weights = np.random.rand(input_size)  # Initialize weights randomly
        self.bias = np.random.rand(1)  # Initialize bias randomly
        self.learning_rate = learning_rate
        self.activation_function = activation_function

    def predict(self, x):
        # Compute the weighted sum of inputs by applying a numpy dot product between x and the weights
        weighted_sum = # YOUR CODE HERE

        # Add the bias to the weighted sum

        weighted_sum += # YOUR CODE HERE

        # Apply the activation function to the weighted sum

        output = # YOUR CODE HERE

        return output

    def train(self, X, y, epochs: int = 100):
        for epoch in range(epochs):
            for xi, yi in zip(X, y):
                # predict the output for the given input
                prediction = # YOUR CODE HERE

                # calculate the error between expected and predicted output (expected - predicted)
                error = # YOUR CODE HERE

                # multiply the error by the learning rate to get our adjustment
                adjustment = # YOUR CODE HERE

                # update the weights by adding the adjustment multiplied by the original input
                self.weights += # YOUR CODE HERE

                # update the bias by adding the adjustment
                self.bias += # YOUR CODE HERE

Once your perceptron is implemented, you can test it with this very small sample code, given below:

Typically, you would want to use a much larger dataset for training, and split the data into a training and testing set to ensure that you evaluate the model based on unseen data to prevent overfitting. However, this is a very small example to demonstrate that the perceptron is functioning correctly.

# Simple test example. We have 3 inputs, and we want to predict 1 if there are at least two 1s in the inputs, otherwise 0.
np.random.seed(11)  # For reproducibility
X = np.array([ [0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]])
y = np.array([0, 0, 0, 1, 0, 1, 1, 1])  # Expected output

perceptron = Perceptron(input_size=3, learning_rate=0.1, activation_function=step)
perceptron.train(X, y, epochs=100)

# Test the perceptron with the training data
for i, xi in enumerate(X):
    print(f"Input: {xi}, Predicted Output: {perceptron.predict(xi), 'Expected Output:', y[i]}")

Deliverables

1.1. Implement the activation functions and perceptron model in Python.
1.2. Test the perceptron with the provided example.
1.3. Ensure the perceptron can predict the expected outputs for the given inputs.

Question 2 (2 points)

After we understand how perceptrons work, we can extend this to a Multilayer Perceptron (MLP). An MLP is a basic neural network architecture that consists of multiple layers of perceptrons that feed into each other. In the previous question, we implemented a single perceptron. For this question, we will understand how to stack multiple perceptrons together to form a layer.

Now that we are creating a larger structure, it is much easier to use a framework such as PyTorch to implement it. PyTorch provides high-level abstractions for building neural networks and other machine learning models. It allows us to define layers, activation functions, loss functions, and many other components of a neural network in a more intuitive way. Typically, construction using PyTorch will involve these steps:

Define a class that inherits from torch.nn.Module. This class will represent the MLP model, and will commonly have a few key functions:
- init: This function will define the layers of the MLP, including any necessary information such as input size, output size, and number of hidden layers.
- forward: This function will define how the input data flows through the MLP, applying the layers, any activation functions, and any other necessary operations in order to produce the output.
Create an instance of the MLP class.
Define a training function or loop for the model, that will take in the model, data, labels, and any other necessary parameters.
- This function will need a loss function and an optimizer to update the model’s weights during training. Sometimes, these are defined outside of the function and passed in as parameters for a more modular design. However, sometimes you may also see the training function defined with specific loss functions and optimizers created inside the function, for a standalone design.
Create a loss function and an optimizer. These will be used to calculate the loss of the model’s prediction in that epoch, and update the model’s weights based on the loss, respectively.

Unlike our single perceptron from question 1, you will more often see that models do not have a dedicated train function, instead the training is done in a separate script or function that takes in the model, data, labels, and other necessary parameters. This is a common practice to keep the model definition separate from the training logic. This allows for more modular code, and can let us easily test different models with the same training logic, or test the same model with different training logic.

The following code below will help get you started with implementing a single layer of perceptrons in PyTorch. While most of the code is already provided, there are thorough comments to help you understand what each part does, and where you need to fill in the blanks.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

torch.manual_seed(11)  # For reproducibility
np.random.seed(11)  # For reproducibility

class SingleLayerPerceptron(nn.Module):
    def __init__(self, input_size, output_size, activation_function=torch.sigmoid):
        super().__init__()
        # Define a single linear layer of perceptrons with the specified input and output sizes
        self.layer = nn.Linear(input_size, output_size)

        # Set the activation function. Torch comes with many built-in activation functions, such as torch.relu, torch.sigmoid, torch.tanh, etc. You can also define your own activation function if needed.
        self.activation_function = activation_function

    def forward(self, x):
        # We can apply a layer to the input, simply by calling it like a function
        x = # YOUR CODE HERE, apply the linear layer to the input x and store the result in x

        # Then, we can do the same with the activation function
        x = # YOUR CODE HERE, apply the activation function to x and store the result in x

        return x

# Example usage
X = np.array([ [0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]])
y = np.array([0, 0, 0, 1, 0, 1, 1, 1])  # Expected output

# We must first convert our numpy arrays to torch tensors, as PyTorch works with tensors (which are similar to numpy arrays, but have some nice features such as GPU support)
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32).view(-1, 1) # .view(-1, 1) reshapes the tensor to have a single column, as opposed to a single row.

# Define the model
input_size = X_tensor.shape[1]  # Number of features in the input
output_size = y_tensor.shape[1]  # Number of outputs (in this case, 1 for binary classification)

# define out model with the input size, output size, and the sigmoid activation function
model = # YOUR CODE HERE

# Define a loss function and an optimizer
# Binary Cross Entropy Loss, commonly used for binary classification tasks. In this case, we are using it because our output is binary (0 or 1).
loss_function = nn.BCELoss()
# Stochastic Gradient Descent optimizer with a learning rate of 0.1. As you may notice, we pass in the model's parameters to the optimizer as a reference, allowing the optimizer to update the model's weights during training.
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training loop
epochs = 100
for epoch in range(epochs):
    # Compute the model's predictions, simply by calling the model as a function, passing in the input tensor
    predictions = # YOUR CODE HERE, pass in the input tensor X_tensor to the model to get the predictions

    # Calculate the loss between the predictions and the expected output, by calling the loss function with the predictions and the expected output tensor
    loss = # YOUR CODE HERE, pass in the predictions and the expected output tensor y_tensor to the loss function

    # Zero the gradients before the backward pass by calling the optimizer's `zero_grad()` function
    # YOUR CODE HERE


    # Have the loss function calculate the backwards pass, which computes the gradients of loss with respect to the model's parameters. simply call the loss's `backward()` function
    # YOUR CODE HERE

    # Update the model's weights by calling the optimizer's `step()` function
    # YOUR CODE HERE

    # Print the loss every 10 epochs
    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

# Test the perceptron with the training data
with torch.no_grad():
    for i, xi in enumerate(X):
        xi_tensor = torch.tensor(xi, dtype=torch.float32)
        output = model(xi_tensor)
        pred = int(output >= 0.5)
        print(f"Input: {xi}, Predicted Output: {pred, 'Expected Output:', y[i]}")

Deliverables

2.1. Implement the SingleLayerPerceptron class in PyTorch.
2.2. Test the SingleLayerPerceptron with the provided example.
2.3. Ensure the perceptron can predict the expected outputs for the given inputs.

Question 3 (2 points)

Now that we have a single layer of perceptrons, we can extend this to a Multilayer Perceptron (MLP). Having multilayer perceptron allows the model to learn more complex patterns in the data, as each layer can learn different features of the input data and pass these features to the next layer.

To implement an MLP in PyTorch, it will look almost identical to the SingleLayerPerceptron class, but with a few key differences:

Since we have multiple layers, we need to define our hidden layers as a parameter we pass into the init function. This can be a list of integers, where each integer represents the number of perceptrons in that layer. We can dynamically create the number of layers based on the length of this list.
The init function will define multiple layers of perceptrons, instead of just one. This can be done by defining multiple nn.Linear layers, and storing them in a list or using nn.Sequential to create a sequential model.
The forward function will apply each layer to the input data in sequence, passing the output of one layer to the next layer. If you stored the layers in a list, you can use a loop to apply each layer to the next layer’s output. If you used nn.Sequential, you can simply call this layer as a function and it will apply all the layers in sequence. This also eliminates the need to manually apply each activation function after each layer, as the activation functions can be included in the sequential model.
We can use a different activation function for each layer, so instead of passing a single activation function to the init function, we can pass a list of activation functions, one for each layer.

You may notice in the code below that we are using nn.Sigmoid as the activation function, instead of torch.sigmoid. This is because torch.sigmoid is a function, whereas nn.Sigmoid is a class that can be instantiated. This allows us to create a new instance of the activation function for each layer, and allows us to chain these together in our nn.Sequential model in one seamless function call, instead of needing to call each activation function separately.

class MultilayerPerceptron(nn.Module):
    def __init__(self, input_size: int = 2, hidden_sizes: list = [], output_size: int = 1, activation_functions: list = [nn.Sigmoid]):
        super().__init__()

        # create a list to hold the layers
        layers = []

        # Let's start by creating our input layer. Create a linear layer with the input size of `input_size`, and the output size of either the first hidden layer size if it exists, or the output size if it does not.
        layers.append(nn.Linear(input_size, hidden_sizes[0] if hidden_sizes else output_size))

        # Then, we can add the activation function for the input layer. Use the first activation function from the `activation_functions` list.
        layers.append(activation_functions[0]())

        # If there are no hidden layers, we can stop the model here and combine our layers list into a sequential model
        if not hidden_sizes:
            self.model = nn.Sequential(*layers)
            return

        # Now that we know we have at least one hidden layer, we can loop through the hidden sizes and create a linear layer for each hidden layer size.
        # To make this easier, we will firstly append the output size to the end of the hidden sizes list, so we can loop through it and create the layers in one go.
        hidden_sizes.append(output_size)
        # Iterate through all hidden sizes except the last one.
        for i in range(len(hidden_sizes) - 1):
            # Create a linear layer with an input size of the current hidden layer size, and an output size of the next hidden layer size. Append this layer to the layers list.
            layers.append(nn.Linear(hidden_sizes[i], hidden_sizes[i + 1]))

            # Then, we can add the activation function for the current hidden layer. Use the next activation function from the `activation_functions` list. if i=0, use activation_functions[1]
            layers.append(activation_functions[i + 1]())

        # Now that we have created all the layers, we can combine them into a sequential model
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        # We can apply the model to the input, simply by calling it like a function
        return self.model(x)  # Apply our sequential model to the input x

# Example usage
X = np.array([ [0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]])
y = np.array([0, 0, 0, 1, 0, 1, 1, 1])  # Expected output

# We must first convert our numpy arrays to torch tensors, as PyTorch works with tensors (which are similar to numpy arrays, but have some nice features such as GPU support)
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32).view(-1, 1) # .view(-1, 1) reshapes the tensor to have a single column, as opposed to a single row.

# Define the model
input_size = X_tensor.shape[1]  # Number of features in the input
output_size = y_tensor.shape[1]  # Number of outputs (in this case, 1 for binary classification)

# define out model with the input size, output size, and the sigmoid activation function
model = # YOUR CODE HERE

# Define a loss function and an optimizer
# Binary Cross Entropy Loss, commonly used for binary classification tasks. In this case, we are using it because our output is binary (0 or 1).
loss_function = nn.BCELoss()
# Stochastic Gradient Descent optimizer with a learning rate of 0.1. As you may notice, we pass in the model's parameters to the optimizer as a reference, allowing the optimizer to update the model's weights during training.
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training loop
epochs = 100
for epoch in range(epochs):
    # Compute the model's predictions, simply by calling the model as a function, passing in the input tensor
    predictions = # YOUR CODE HERE, pass in the input tensor X_tensor to the model to get the predictions

    # Calculate the loss between the predictions and the expected output, by calling the loss function with the predictions and the expected output tensor
    loss = # YOUR CODE HERE, pass in the predictions and the expected output tensor y_tensor to the loss function

    # Zero the gradients before the backward pass by calling the optimizer's `zero_grad()` function
    # YOUR CODE HERE


    # Have the loss function calculate the backwards pass, which computes the gradients of loss with respect to the model's parameters. simply call the loss's `backward()` function
    # YOUR CODE HERE

    # Update the model's weights by calling the optimizer's `step()` function
    # YOUR CODE HERE

    # Print the loss every 10 epochs
    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

# Test the perceptron with the training data
with torch.no_grad():
    for i, xi in enumerate(X):
        xi_tensor = torch.tensor(xi, dtype=torch.float32)
        output = model(xi_tensor)
        pred = int(output >= 0.5)
        print(f"Input: {xi}, Predicted Output: {pred, 'Expected Output:', y[i]}")

Notice the rate at which the loss for the model decreases. How does this compare to the single layer perceptron from the previous question? Does the MLP learn faster, or slower? Why do you think this is?

Deliverables

3.1. Implement the MultilayerPerceptron class in PyTorch.
3.2. Test the MultilayerPerceptron with the provided example.
3.3. Ensure the MLP can predict the expected outputs for the given inputs.
3.4. Compare loss decrease rate of the MLP to the single layer perceptron, and give your thoughts as to why this is the case.

Question 4 (2 points)

The data we have been training on so far has been very simple and not realistic. It is recommended to evaluate the model using a real-world dataset. One of the commonly used data sets for testing image classification models is the MNIST dataset, which consists of small images of handwritten digits.

To start, we must first load the MNIST dataset. We have already uploaded the dataset to Anvil, so you can load it using pandas. The dataset is split into two files: mnist_train.csv and mnist_test.csv.

In practice, the dataset will be treated as a whole, and splitting it into training and test sets according to predefined proportions will be required. This procedure will be performed multiple times during this semester in this course (including cross-validation topic).

Each 28x28 pixel image is represented as a row in the CSV file, with the first column being the label (the digit) and the remaining columns being the pixel values of the image. The pixel values are in the range of 0 to 255, representing the intensity of each pixel. Additionally, there are 60000 training images and 10000 test images.

Before we use the dataset, let’s look at the data to better understand it and how we can build our model. You can use the following code to load the dataset and display the first digit as an image using matplotlib.

import pandas as pd
train_data = pd.read_csv('/anvil/projects/tdm/data/mnist/mnist_train.csv')
test_data = pd.read_csv('/anvil/projects/tdm/data/mnist/mnist_test.csv')

# convert these rows into 2D image arrays to display, and a dataframe of labels
train_images = train_data.iloc[:, 1:].values.reshape(-1, 28, 28)
train_labels = train_data.iloc[:, 0].values

# we can use matplotlib to visualize the images
import matplotlib.pyplot as plt
plt.imshow(train_images[0], cmap='gray')
plt.title(f'Label: {train_labels[0]}')
plt.show()

Please also display some more images from the training dataset, and their corresponding labels. This is to get you more familiar with the dataset and how it is structured. You can use a loop to display multiple images, or simply display a few more images manually. Then, answer some questions about the dataset and how you would approach building a model to classify these images.

How many unique labels are there in the dataset?
What is the shape of the images in the dataset? What will the input size of the model be?
What is the output size of the model?
What activation function do you think would work best for the output layer of the model? Why?

Deliverables

4.1. Load the MNIST dataset using pandas.
4.2. Display the first image in the training dataset using matplotlib.
4.3. Display the label of the first image in the training dataset.
4.4. Display a few more images from the training dataset, and their corresponding labels.
4.5. Answer the associated questions about the dataset and how it affects the model.

Question 5 (2 points)

Now that we have a good understanding of the dataset, we can build our MLP model to classify the images. We will use the MultilayerPerceptron class we implemented in question 3, and modify it to work with the MNIST dataset.

To start, let’s preprocess the data and get it into a good format for training.

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

train = pd.read_csv('/anvil/projects/tdm/data/mnist/mnist_train.csv')
test = pd.read_csv('/anvil/projects/tdm/data/mnist/mnist_test.csv')

X_train = torch.tensor(train.iloc[:, 1:].values, dtype=torch.float32) / 255.0  # Normalize pixel values to [0, 1] and convert to tensor
y_train = torch.tensor(train.iloc[:, 0].values, dtype=torch.long)  # Convert labels to tensor
X_test = torch.tensor(test.iloc[:, 1:].values, dtype=torch.float32) / 255.0  # Normalize pixel values to [0, 1]
y_test = torch.tensor(test.iloc[:, 0].values, dtype=torch.long)  # Convert labels to tensor

Now that we have our data in the correct format, we can define our MLP model. We will use the MultilayerPerceptron class we implemented in question 3, and modify it to work with the MNIST dataset.

Lets use 3 hidden layers, with sizes 128, 64, and 32 respectively. We will use the ReLU activation function for the hidden layers, and the Softmax activation function for the output layer. Additionally, we will use CrossEntropyLoss as the loss function, and Adam as the optimizer. Although we have not used these yet, they are commonly used for image classification tasks and will work well with our MLP model.

model = # YOUR CODE HERE

loss_function = nn.CrossEntropyLoss()  # CrossEntropyLoss is commonly used for multi-class classification tasks
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer with a learning rate of 0.001

Then, please create a training loop to train the model on the training data. You can follow the same structure as the training loop we used in question 3. Please train for 100 epochs, and print the loss every 10 epochs.

After you run the training loop, you can evaluate the model’s performance on the test data with the below code:

# Set the model to evaluation mode
model.eval()
# Disable gradient calculation for evaluation
with torch.no_grad():
    # Get the model's predictions on the test data
    test_predictions = model(X_test)
    # Calculate the loss on the test data
    test_loss = loss_function(test_predictions, y_test)
    # Get the predicted labels by taking the index of the maximum value in each row
    _, predicted_labels = torch.max(test_predictions, 1)
    # Calculate the accuracy by comparing the predicted labels to
    accuracy = (predicted_labels == y_test).float().mean()

In this line, torch.max returns two values: the maximum values and their indices. The underscore (_) is used to indicate that the first returned value (the maximum values) is intentionally ignored, and only the indices (the predicted labels) are needed.

What accuracy did you achieve on the test data? Are there any things you can think of to try and improve the model’s performance? For example, you can change the number of hidden layers, the size or activation function of the hidden layers, the learning rate, or even the number of epochs. Please play around with these parameters and see if you can improve the model’s performance past this point. You can also try using different optimizers or loss functions, such as SGD or MSELoss, to see how they affect the model’s performance.

Deliverables

5.1. Data loaded and preprocessed for training.
5.2. MLP model defined with 3 hidden layers, ReLU activation for hidden layers, and Softmax activation for the output layer.
5.3. Training loop implemented to train the model on the training data for 100 epochs.
5.4. Model evaluated on the test data, with accuracy reported.
5.5. Experiment with different parameters to improve model performance.

Question 6 (2 points)

Now that we have a working MLP model, let’s try and visualize how the model is looking at the data. One way to do this is to visualize the weights of the model. The weights of the model represent how much each input feature contributes to the output, and can give us valuable insights into how the model is making predictions. In large scale applications, this can help reduce how much data we collect, store, and process, as we can see which features are most important to the model and which are not.

Let’s get the weights of the first hidden layer, where each of these neurons takes in the 784 input pixels, allowing us to see how the model is looking at the input data. We can reshape the weights into a 28x28 image, and display a grayscale image of the weights to see which pixels are most important to the model.

import matplotlib.pyplot as plt
import numpy as np
# Our input layer takes in 784 inputs (28x28 pixels), and passes each of these 784 inputs to 128 neurons in the first hidden layer. This means we have 128 neurons, each with their own set of 784 weights.
# Get the weights of the first hidden layer.
input_weights = model.model[0].weight.data.numpy()

# Reshape the weights to 128 images, each 28x28 pixels
input_weights = input_weights.reshape(128, 28, 28)
# Display the first weights as a grayscale image
for i in range(3):  # Display the first 3 neurons' weights
    plt.figure(figsize=(4, 4))
    plt.imshow(input_weights[i], cmap='gray')
    plt.title(f'Weights of Neuron {i+1}')
    plt.axis('off')
    plt.show()

In the color map used above, the darker pixels represent lower weights, while the lighter pixels represent higher weights. This means that the model is more likely to pay attention to the lighter pixels when making predictions.

Now that we have visualized some of these weights, let’s think about how we can explore all of them at once in a meaningful way. A simple idea we could do is to take the average of all weights for each pixel across all neurons, and visualize that result as a single image. This will give us an idea of which pixels are the most important to the model as a whole, and if any pixels may not be important at all.

# Calculate the average weights across all neurons
average_weights = np.mean(input_weights, axis=0)
# Display the average weights as a grayscale image
plt.figure(figsize=(4, 4))
plt.imshow(average_weights, cmap='gray')
plt.title('Average Weights Across All Neurons')
plt.axis('off')
plt.show()

Now that we have the average weights, are there any patterns you can start to infer about how the model is looking at the data? Are there any regions of the image that are more important than others? Are there any regions that are not important at all? How do these patterns compare to the original images in the dataset?

Deliverables

6.1. Visualize the weights of some neurons in the first hidden layer as grayscale images.
6.2. Calculate and visualize the average weights across all neurons in the input layer.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project1.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.