TDM 40200: Project 13 - Large Language Models (LLMs): Transformers and Mini-LLM

Project Objectives

This project will introduce the concept of transformers and their critical role in the development of modern day large language model (LLM) architectures like GPT, BERT,

Learning Objectives

Understand the concept of transformers and their significance in LLMs.
Learn about the architectures of some of the first Large Language Models (LLMs) and how they are built on top of transformers.
Understand the concept of attention and how it is used in transformers.
Implement a simple transformer model with guidance.

Dataset

'/anvil/projects/tdm/data/amazon/music.txt'

Questions

Question 1 (2 points)

It is recommended to reserve 4 CPU cores for this project. This project takes a good amount of memory and CPU time to run, so allocating 4 cores will be very beneficial to you.

Last project, we implemented a simple RNN model and LSTM model. From that project, you should have learned that these models are able to capture long-term context from the input sequence by maintaining a hidden state or hidden memory in order to keep track of the information from previous time steps. However, this has many limitations. For example, the model has to process the input sequence in order, which can be very inefficient for long sequences (imagine you want to get the last word of a 300 word sequence). Additionally, the small hidden state can only capture so much information from the input sequence.

Because of these limitations, the transformer model was introduced. Transformers are designed to process sequential data (like text sequences) in parallel, allowing for much faster training and inference times. Additionally, they are designed to have complete context understanding of the entire sequence, and how tokens in the sequence affect the meaning of other tokens. This is achieved through two primary mechanisms: positional encoding and self-attention.

Positional encoding, as you may be able to guess from the name, is a way to encode positional information of a given token in the input sequence. This is important because as the transformer architecture is processing the sequence in parallel, it does not necessarily know the order of tokens in the sequence. The positional encoding is added to the input embeddings to provide this crucial information.

Self-attention is a mechanism that allows the model to weigh tokens in the input sequence differently based on their relevance to other tokens in the sequence. This is done by creating a set of attention scores that determine how much, or how little, each token should pay attention to other tokens in the sequence. This allows the model to better capture relationships between tokens, regardless of their position in the sequence. Examples of this would be models being able to understand pronouns and what they refer to, or understanding the context of a word based on the words around it.

Another important difference between our previous models and the transformer model is the embedding layer. In our RNN and LSTM, we created KeyedVectors that converted words into 100-dimensional vectors. This works fine to understand individual words, but it fails to capture the relationships between words. For example, free parking and paid parking mean very different things, but the KeyedVectors would not be able to understand the context surrounding parking affecting its meaning. The transformer model uses a different form of embedding (while still quite similar). Firstly, the data is tokenized, either into distinct words, characters, or subwords (think prefixes, suffixes, etc.). Once the tokens are created, a predefined vocabulary is used to convert each token to a unique integer (think label encoding from last semester). Once all the tokens are in integer format, the transformer’s embedding layer creates a dense vector representation of each token including context surrounding the token. This results in each token having a more unique vector representation that captures the relationships between tokens in the sequence. For example, with our KeyedVectors from last project, in the sentence the first dog sat on the second dog and bit the third dog, the word dog would have the same vector representation for all three instances. However, with the transformer model, the first dog would have a different vector representation than the second and third dog, allowing the model to understand that they are different instances of the same word.

For this question, we will be implementing a new data preparation function to create a vocabulary list of words in our reviews, and then tokenizes the reviews into a list of tokens.

To start, go ahead and copy in our read_lines and clean_text functions from the last project. These functions will read in the data and clean it up for us.

def read_lines(file_path, n, start = 0):
    lines = []
    with open(file_path, 'r') as f: # open the file in read mode
        for i, line in enumerate(f):
            if i < start:
                continue
            lines.append(line.strip())
            if len(lines) == n:
                break
    return lines

import codecs
def clean_text(text):
    #modify the text here
    # 1) decode escape sequences
    # 2) remove the starting and ending double quotes
    # 3) replace any newlines with a space
    # 4) return the cleaned text

    text = codecs.decode(text, 'unicode_escape')
    text = text.replace('\n', ' ')
    text = text.lower()
    text = ''.join(c for c in text if c.isalpha() or c.isspace())


    return text

Then, we can work on creating our vocabulary list. This function will take in a list of reviews from the read_lines function, clean the text, split it into words, and create a vocabulary list of unique words. The function will return the vocabulary list and a dictionary that maps each word to its index in the vocabulary. You can use the below code to get started.

def build_vocab(lines):
    token_to_idx = {} # dictionary to map tokens to indices
    idx_to_token = [] # list to store the tokens in order of their indices

    for line in lines:
        for token in clean_text(line).split():
            '''Add the token to token_to_idx if it is not already present, giving it the next index as its value'''
            # YOUR CODE HERE

            '''Append the token to idx_to_token if it is not already present'''
            # YOUR CODE HERE
    return token_to_idx, idx_to_token

Once that is done, we can create a function to tokenize the reviews and create our input/output pairs. This function will take in our list of reviews, clean the text, and convert the words to their corresponding indices in the vocabulary. It will then group the reviews into sequences of a specified length and create input/output pairs. The function will return the input and output sequences as numpy arrays. For example, sequence length of 10 would mean that the first 10 words of the review would be the input, and the 11th word would be the output. The second sequence would be the second through 11th words of the review, and so on. If the number of words in the review is less than the sequence length + 1, then the review will be skipped. You can use the below code to get started.

def generate_dataset(lines, token_to_idx, sequence_length):
    inputs = [] # list to store the input sequences
    outputs = [] # list to store the output sequences
    for line in lines:
        ''' Clean the text and split it into words'''
        split_text = clean_text(line).split()


        ''' Check if the number of words is less than sequence_length + 1, if so, skip the review'''
        # YOUR CODE HERE


        ''' Convert the list of words to the list of indices'''
        # YOUR CODE HERE

        ''' Create the input/output pairs by iterating through the list of indices and creating sequences of length sequence_length'''
        for i in range(len(tokens) - sequence_length):
            ''' Append the input sequence to inputs and the output word to targets'''
            # YOUR CODE HERE

            ''' Append the input sequence to inputs and the output word to targets'''
            # YOUR CODE HERE

    return inputs, outputs

To test these functions, you can run the below code.

sequence_len = 3

lines = read_lines('/anvil/projects/tdm/data/amazon/music.txt', 500)

token_to_idx, idx_to_token = build_vocab(lines)

inputs, outputs = generate_dataset(lines, token_to_idx, sequence_len)

print(f'Length of Inputs {len(inputs)}') # Should output 'Length of Inputs 11095'
print(f'CD Index: {token_to_idx.get("cd",None)}') # Should output 'CD Index: 3'

Deliverables

Implement the build_vocab function
Implement the generate_dataset function
Test the functions with the provided code

Question 2 (2 points)

Now, let’s talk about positional encoding. As mentioned before, position encoding is a way to encode positional information for a given token in the input sequence. Rather that directly modifying the input embeddings, the positional encoding is added to the input embeddings during the forward pass. This makes creating positional encodings very simple and efficient. The positional encoding is a matrix of size (sequence_length, embedding_dim), i.e. each row of the matrix is a positional encoding for that rows index in the sequence. Each row is the same size as the embedding for the token, so that they can be added together.

Positional encodings are created using sinusoidal functions to create a unique encoding for each position. The sine function is used for even indices, and the cosine function is used for odd indices. This results in a unique encoding for each position in the sequence.

What do we take the sine and cosine of? This is another interesting formula, as defined below:

pos / ((10000) ** (2 * (i // 2) / d_model))

where pos is the position in the sequence, i is the index of the embedding dimension, and d_model is the size of the embedding dimension. This formula creates a unique encoding for each position in the sequence, and allows for the model to learn relationships between positions in the sequence.

Some useful functions you should use for this:

torch.zeros - creates a tensor of zeros with the specified shape. This is useful for creating the positional encoding matrix. Similar to np.zeros in numpy.

torch.arange - creates a tensor of evenly spaced values within a specified range. This is useful for creating the position tensor. Similar to np.arange in numpy. For example, torch.arange(0, 10) would create a tensor with values from 0 to 9. torch.arange(0, 10, 2) would create a tensor with values from 0 to 9 with a step of 2 (ends up being [0, 2, 4, 6, 8]).

torch.exp - computes the exponential of each element in the input tensor. This is useful for computing the division term in the positional encoding formula. Similar to np.exp in numpy.

torch.sin - computes the sine of each element in the input tensor. This is useful for computing the even indices of the positional encoding matrix. Similar to np.sin in numpy.

torch.cos - computes the cosine of each element in the input tensor. This is useful for computing the odd indices of the positional encoding matrix. Similar to np.cos in numpy.

You can use the following code to get started:

import torch
import math
def positional_encoding(seq_len, model_dimensions):
    ''' Create an tensor of zeros with shape (seq_len, model_dimensions)'''
    pe = # YOUR CODE HERE

    ''' Create an tensor of positions with shape (seq_len, 1), then unsqueeze it to dimensions (seq_len, 1)'''
    position = # YOUR CODE HERE

    ''' Create an tensor of the division term.
    Torch does not have a 10000^x function, so instead we can represent 10000^x = e^(log(10000) * x) using torch.exp and math.log.
    Similarly, we can use e^(-log(10000) * x) to get 1/(10000^(x)), allowing us to multiply the position by the division term later for faster computation.
    hint: use torch.arange(0, model_dimensions, 2) to get the even indices of model dimensions'''
    div_term = # YOUR CODE HERE

    ''' Set the even indices of positional encodings by taking the sine of the position * div_term'''
    pe[:, 0::2] = # YOUR CODE HERE

    ''' Set the odd indices of positional encodings by taking the cosine of the position * div_term'''
    pe[:, 1::2] = # YOUR CODE HERE

    ''' Return the positional encodings'''
    return pe.unsqueeze(0)  # [1, seq_len, model_dimensions]

Once you have implemented the function, you can test it with the following code:

pe = positional_encoding(10, 4)
print(pe.shape) # Should output `torch.Size([1, 10, 4])`
print(pe[0, 0, :]) # Should output `tensor([0., 1., 0., 1.])`
print(pe[0, 1, :]) # Should output `tensor([0.8415, 0.5403, 0.0100, 0.9999])`
print(pe[0, 2, :]) # Should output `tensor([ 0.9093, -0.4161,  0.0200,  0.9998])`
print(pe[0, 3, :]) # Should output `tensor([ 0.1411, -0.9900,  0.0300,  0.9996])`
print(pe[0, 4, :]) # Should output `tensor([-0.7568, -0.6536,  0.0400,  0.9992])`
print(pe[0, 5, :]) # Should output `tensor([-0.9589,  0.2837,  0.0500,  0.9988])`

Deliverables

Implement the positional_encoding function
Test the function with the provided code

Question 3 (2 points)

Now, let’s learn about the attention mechanism. This mechanism is a way to weigh tokens in the input sequence differently based on their relevance to other tokens in the sequence. This is done by creating scores that determine how much each token should pay 'attention' to other tokens. This is a quite complex process, so most of the code for this will be provided for you. Please read through the comments to understand what each part of the code is doing.

Essentially, we want to split the input into three different vectors: the query, key, and value vectors. The query vector is used to determine how much attention to pay to each token in the sequence. The key vector is used to determine how much attention each token should pay to the query vector. The value vector is used to determine the output of the attention mechanism.

Additionally, we will use multiple attention heads to allow the model to learn different relationships between tokens in the sequence. This is done by splitting the input into multiple heads (the size of the model dimensions must be divisible by the number of heads). Each head will have its own set of query, key, and value vectors. The output of each head is then concatenated together to form the final output.

The key components of this are as follows:

Figure out d_k, the size of each head. This is done by dividing the model dimensions by the number of heads, ensuring that the model dimensions are evenly divisible by the number of heads. This is important because each head will have its own set of query, key, and value vectors, and we want to make sure that they are all the same size.
Split the input into query, key, and value vectors. This is done by passing the input through a linear layer and reshaping the output to create three separate tensors. The input is of shape (batch_size, seq_len, model_dimensions), and the output is reshaped to (batch_size, seq_len, 3, num_heads, d_k) where d_k is the size of each head. Next, we permute the output to be of shape (3, batch_size, num_heads, seq_len, d_k). This allows us to easily unpack the query, key, and value vectors from the output.
Compute the attention scores. This is done by taking the dot product of the query and key vectors, then dividing by the square root of d_k to prevent large value in the future softmax function (which would cause exploding gradients).
Apply the softmax function to these attention scores to get the attention weights. This is done by applying the softmax function across the last dimension of the attention scores tensor.
Compute the output of the attention mechanism by taking the dot product of the attention weights and the value vectors.
Pass the output through a linear layer to combine the output of all the heads. This is done by reshaping the output to be of shape (batch_size, seq_len, model_dimensions) and passing it through a linear layer.

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, model_dimensions, num_heads):
        super().__init__()

        '''Make sure that the model dimensions are divisible by the number of heads'''
        # YOUR CODE HERE

        '''Calculate d_k, the size of each head'''
        self.d_k = # YOUR CODE HERE

        ''' Store the number of heads'''
        self.num_heads = # YOUR CODE HERE

        '''
        Create query, key, value linear layers
        The shape of this linear layer should be (model_dimensions, model_dimensions * 3) to create the query, key, and value vectors
        '''
        self.qkv = # YOUR CODE HERE

        '''
        Create Linear layer to combine the output of all heads
        The shape of this linear layer should be (model_dimensions, model_dimensions) to combine the output of all heads
        '''
        self.fc_out = torch.nn.Linear(model_dimensions, model_dimensions)

    def forward(self, x):
        '''x is the input tensor of shape (batch_size, seq_len, model_dimensions)'''

        '''
        B = batch size, T = sequence length, C = model dimensions
        This will essentially unpack the input tensor into 3 tensors of shape (B, T, d_k) for each head
        '''
        B, T, C = x.size()

        '''
        pass the input through the linear layer to get the query, key, and value vectors. Then, reshape the output to shape (B, T, 3, num_heads, d_k) and permute it to shape (3, B, num_heads, T, d_k)
        '''
        qkv = self.qkv(x).reshape(B, T, 3, self.num_heads, self.d_k).permute(2, 0, 3, 1, 4)

        '''
        we can then unpack the query, key, and value vectors from the qkv tensor simply by indexing the 0th, 1st, and 2nd elements of the tensor respectively
        '''
        q = # YOUR CODE HERE
        k = # YOUR CODE HERE
        v = # YOUR CODE HERE

        '''
        q, k, v are now of shape (B, num_heads, T, d_k)
        This let's us compute the attention scores by taking the dot product of the query and key vectors, and then dividing by the square root of d_k to prevent large values in the softmax function
        '''
        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_k)

        '''
        We can apply the softmax function to the scores to get the attention weights. This will give us a tensor of shape (B, num_heads, T, T) where each row is the attention weights for each token in the sequence.
        Simply the scores softmax across dim=-1 (the last dimension)
        '''
        attn = # YOUR CODE HERE

        '''Next, we can compute the dot product of the attention weights and the value vectors to get the output of the attention mechanism, giving us tensor of shape (B, num_heads, T, d_k)
        '''
        out = (attn @ v).transpose(1, 2).reshape(B, T, C)

        '''
        Finally, we pass the output through the linear layer to combine the output of all heads and return the final output
        This will give us a tensor of shape (B, T, model_dimensions)
        '''
        return self.fc_out(out)

To test this function, you can run the following code:

import numpy as np
import random

b = 2
t = 5
c = 128 # model dimensions
h = 4 # number of heads
mha = MultiHeadAttention(c,h)

torch.manual_seed(78)
np.random.seed(78)
random.seed(78)

x = np.random.rand(b,t,c).astype('f')

x = torch.from_numpy(x)
y = mha(x)
print(y.shape) # Should output `torch.Size([2, 5, 128])`

print(y[0][0][:10]) # Should output `tensor([ 0.0025, -0.0003,  0.0090,  0.0806, -0.0298,  0.0728, -0.3190, -0.2450, -0.0025, -0.1722], grad_fn=<SliceBackward0>)`

You may not get the exact output as above, but as long as the shape is correct, your values are floats, and it is a tensor with grad_fn=<SliceBackward0>, you are good to go.

Deliverables

Implement the MultiHeadAttention class, including the init and forward methods
Test the class with the provided code

Question 4 (2 points)

Now that we have setup our positional encoding and attention mechanism, we can actually make our transformer and mini LLM.

First, let’s make a Transformer Block. There are many different types of transformer blocks, and many more 'modern' versions, but we will stick with something simple for now. The transformer block will consist of the following components:

Our attention mechanism. This will be the MultiHeadAttention class we created in the last question, however we will make it modular so that we can easily change what attention we want to use in the future.
A normalization layer. This layer is used to normalize the output of the attention mechanism. This is done to prevent large values in the output, which can lead to exploding gradients.
A feed-forward layer. This is a layer that sequentially applies a linear transformation, followed by a ReLU activation, followed by another linear transformation. This is done to allow the model to learn non-linear relationships between the input and output.
A second normalization layer. This is used to normalize the output of the feed-forward layer.

Once we setup these components, our forward pass will be very simple. Please fill in the TransformerBlock class below.

class TransformerBlock(torch.nn.Module):
    def __init__(self, model_dimensions, num_heads, attention_class=MultiHeadAttention):

        super().__init__()
        '''Create the attention mechanism, using whatever class is passed in as the attention_class, using the model dimensions and number of heads'''
        self.attn = # YOUR CODE HERE

        '''Create the first normalization layer, using the model dimensions with torch.nn.LayerNorm'''

        self.norm1 = # YOUR CODE HERE

        '''Create the feed-forward layer. This is done using torch.nn.Sequential, creating a Sequential model, that has a Linear layer with dimensions (model_dimensions, 4 * model_dimensions), followed by a ReLU activation, followed by another Linear layer with dimensions (4 * model_dimensions, model_dimensions)'''

        self.ff = # YOUR CODE HERE

        '''Create the second normalization layer, using the model dimensions'''
        self.norm2 = # YOUR CODE HERE

    def forward(self, x):
        '''Add the attention output to the input'''
        x = # YOUR CODE HERE

        '''Apply the first normalization layer to the x, saving it back into x'''
        x = # YOUR CODE HERE

        '''Add the feed-forward of x to x'''
        x = # YOUR CODE HERE

        '''Apply the second normalization layer to the x, saving it back into x'''
        x = # YOUR CODE HERE

        '''Return the final output'''
        return x

Once you have implemented our transformer block, we can now create our mini LLM. The mini LLM will consist of the following components:

An embedding layer. This is a layer that converts the input tokens to their corresponding embeddings. This is done using the torch.nn.Embedding class, which takes in the size of the vocabulary and the size of the embedding dimension.
A positional encoding layer. This is the positional_encoding function we created in the last question, which adds positional information to the input embeddings.
A sequential application of transformer blocks, for however many layers we want. This is done using the torch.nn.Sequential class, which takes in a list of layers and applies them sequentially. Then, within that, we use the TransformerBlock class we created in the last question.
A final linear layer. This is a layer that converts the output of the transformer blocks to the size of the vocabulary. This is done using the torch.nn.Linear class, which takes in the size of the input and the size of the output.

With this structure, our forward pass will be very simple. We embed the input and then add the positional encoding to it. Then, we pass this through the transformer blocks sequentially. Finally, we pass the output through the final layer to get the output.

You can use the following code to get started:

class MiniLLM(torch.nn.Module):
    def __init__(self, vocab_size, model_dimensions=128, num_heads=4, num_layers=2, seq_len=10, attention_class=MultiHeadAttention):
        super().__init__()

        '''Create the embedding layer, using the vocab size and model dimensions'''
        self.embed = # YOUR CODE HERE

        '''Generate the positional encodings using the positional_encoding function, using the sequence length and model dimensions'''
        self.pe = # YOUR CODE HERE

        '''Create the transformer blocks, using the TransformerBlock class, using the model dimensions and number of heads'''
        transformer_blocks = []
        for _ in range(num_layers):
            # append a new TransformerBlock to the transformer_blocks list
            # YOUR CODE HERE
        self.transformer_blocks = torch.nn.Sequential(*transformer_blocks) # create a Sequential model from the transformer_blocks list, unpacking it

        '''Create the final linear layer, using the model dimensions and vocab size'''

        self.fc = # YOUR CODE HERE

    def forward(self, x):
        '''Set x equal to the embedding of x + the positional encodings'''
        positional_encodings = self.pe[:, :x.size(1), :]
        x = # YOUR CODE HERE

        '''Pass x through the transformer blocks'''
        x = # YOUR CODE HERE

        '''Extract the last time step of the output sequence'''
        x = x[:, -1, :]

        '''Pass x through the final linear layer'''
        x = # YOUR CODE HERE

        return x

You can’t quite test this yet, as we need to create a train loop first. We will do that in the next question.

Deliverables

Implement the TransformerBlock class, including the init and forward methods
Implement the MiniLLM class, including the init and forward methods

Question 5 (2 points)

Now that we have our mini LLM, we can create a training loop to train it. The training loop will consist of the following components:

Create an Adam optimizer using the model parameters and specified learning rate
Create a CrossEntropyLoss function.
Iterate through each epoch [Steps 4-9]
In each epoch, zip the input and output sequences together.
Get the model predictions by passing the input sequences through the model.
Calculate the loss by passing the model predictions and output sequences through the loss function.
Zero the gradients by calling optimizer.zero_grad().
Backpropagate the loss by calling loss.backward().
Step the optimizer by calling optimizer.step().
Update the loss for the epoch.

You can use the following code to get started:

from tqdm.auto import tqdm

def train_model(model, inputs, targets, epochs=5, learning_rate=1e-3):
    '''Create the Adam optimizer using the model parameters and learning rate'''
    optimizer = # YOUR CODE HERE

    '''Create the CrossEntropyLoss function'''
    loss_fn = # YOUR CODE HERE

    '''Iterate through each epoch'''
    for epoch in tqdm(range(epochs), desc='Training Epochs'):
        total_loss = 0
        '''In each epoch, zip the input and output sequences together and iterate through them'''
        for i, (x, y) in tqdm(enumerate(zip(inputs, targets)), desc=f'Training Epoch {epoch+1}', total=len(inputs), leave=False):
            '''Create a tensor for the input and output sequences, using [x] and [y] to create a batch of size 1'''
            x_tensor = # YOUR CODE HERE
            y_tensor = # YOUR CODE HERE

            '''Pass the input tensor through the model'''
            output = # YOUR CODE HERE

            '''Calculate the loss by passing the model predictions and output sequences through the loss function'''
            loss = # YOUR CODE HERE

            '''Zero the gradients by calling optimizer.zero_grad()'''
            # YOUR CODE HERE

            '''Backpropagate the loss by calling loss.backward()'''
            # YOUR CODE HERE

            '''Step the optimizer by calling optimizer.step()'''
            # YOUR CODE HERE

            ''' Add loss.item() to the total loss'''
            # YOUR CODE HERE

Now that we can train our model, we can start generating text with it. We will go ahead and give you the full code for using the model to generate text, but please read through the comments to understand what each part of the code is doing. The code is below:

import torch.nn.functional as F
def generate_text(model, start_sequence, token_to_idx, idx_to_token, length=20, context_length=10, temperature=1.0):
    '''Set the model to evaluation mode'''
    model.eval()

    '''Convert the start sequence to a list of indices using the token_to_idx dictionary'''
    current = [token_to_idx[token] for token in start_sequence.split()]

    '''For each token to generate, do the following'''
    for _ in range(length):
        '''Create a tensor of length context length'''
        x = torch.tensor([current[-context_length:]], dtype=torch.long)

        '''Pass the tensor through the model'''
        with torch.no_grad():
            '''Get the model predictions by passing the input sequences through the model'''
            logits = model(x)

            '''Get the next token by sampling from the model predictions using torch.multinomial'''
            next_token = torch.multinomial(F.softmax(logits / temperature, dim=-1).squeeze(), num_samples=1).item()

            '''Append the next token to the current sequence'''
            current.append(next_token)
    return ' '.join(idx_to_token[i] for i in current)

There are many different ways we can generate text (as you should know from the NGram class you made). We can take the most likely output, the least likely, sample from the probabilities of all outputs, etc. In the above function, we use torch.multinomial to sample the next token from the probabilities of all outputs. This allows us to generate more interesting text, rather than just the most likely output. The temperature parameter controls how 'random' the output is. A lower temperature will result in more likely outputs, while a higher temperature will result in more random outputs.

Now, let’s put it all together to test our model. You can use the following code to test the model:

context_length = 10

lines = read_lines('/anvil/projects/tdm/data/amazon/music.txt', 500)

token_to_idx, idx_to_token = build_vocab(lines)

inputs, outputs = generate_dataset(lines, token_to_idx, context_length)

vocab_size = len(token_to_idx)

model = MiniLLM(vocab_size, model_dimensions=128, num_heads=4, num_layers=2, seq_len=context_length, attention_class=MultiHeadAttention)

print('Start training....')

train_model(model, inputs, outputs, epochs=10, learning_rate=0.001)

print("Finished training!\nGenerating text...")

prompt = "this cd is the best because "
output = generate_text(model, prompt, token_to_idx, idx_to_token, length=20, context_length=context_length, temperature=0.8) # generate 20 words of text

print(f"Prompt: {prompt}")
print(f"Output: {output}")

This should take a few minutes to train using 4 CPU cores. If yours is taking hours please seek help from a TA.

Does it work? Your model should be able to generate text (while probably not very coherent), but it should be very fast to train and generate text compared to the RNN and LSTM models from the last project.

Play around with different temperatures, context lengths, length (number of tokens to generate), and prompts and see how it affects the output. Please describe what you tried and what you observed in a markdown cell below.

Feel free to also train for more reviews or more epochs, and see if that helps give better results.

Deliverables

Implement the train_model function, including the training loop
Implement the generate_text function, including the text generation loop

Question 6 (2 points)

Currently, our MultiHeadAttention class is able to see both past and future tokens in the sequence. This is not ideal for a language model, as we want to predict the next token based on past tokens only. To better achieve this, we need to implement a masked attention mechanism. This is done by creating a mask to prevent the model from seeing past tokens in the sequence.

Firstly, we will generate a causal mask. This is a mask that prevents the model from seeing future tokens in the sequence. This is done by creating a tensor of shape (seq_len, seq_len) where each element is 1 if the token at that position can see the token at that position, and 0 if it cannot. This is done by creating a tensor of ones and then setting the upper triangle of the tensor to 0. The code for this is provided below:

def generate_causal_mask(size):
    return torch.tril(torch.ones(size, size)).unsqueeze(0).unsqueeze(0)  # [1, 1, size, size]

Once we have a causal mask, we can create a new MaskedMultiHeadAttention class. This class should be almost identical to the MultiHeadAttention class, but we will add a mask to the attention scores before applying the softmax function. This is done by adding the mask to the attention scores before applying the softmax function. You can fill in the function below:

class MaskedMultiHeadAttention(torch.nn.Module):
    def __init__(self, model_dimensions, num_heads):
        super().__init__()

        '''Make sure that the model dimensions are divisible by the number of heads'''
        # YOUR CODE HERE

        '''Calculate d_k, the size of each head'''
        self.d_k = # YOUR CODE HERE

        ''' Store the number of heads'''
        self.num_heads = # YOUR CODE HERE

        '''
        Create query, key, value linear layers
        The shape of this linear layer should be (model_dimensions, model_dimensions * 3) to create the query, key, and value vectors
        '''
        self.qkv = # YOUR CODE HERE

        '''
        Create Linear layer to combine the output of all heads
        The shape of this linear layer should be (model_dimensions, model_dimensions) to combine the output of all heads
        '''
        self.fc_out = torch.nn.Linear(model_dimensions, model_dimensions)

    def forward(self, x):
        '''x is the input tensor of shape (batch_size, seq_len, model_dimensions)'''

        '''
        B = batch size, T = sequence length, C = model dimensions
        This will essentially unpack the input tensor into 3 tensors of shape (B, T, d_k) for each head
        '''
        B, T, C = x.size()

        '''
        pass the input through the linear layer to get the query, key, and value vectors. Then, reshape the output to shape (B, T, 3, num_heads, d_k) and permute it to shape (3, B, num_heads, T, d_k)
        '''
        qkv = self.qkv(x).reshape(B, T, 3, self.num_heads, self.d_k).permute(2, 0, 3, 1, 4)

        '''
        we can then unpack the query, key, and value vectors from the qkv tensor simply by indexing the 0th, 1st, and 2nd elements of the tensor respectively
        '''
        q = # YOUR CODE HERE
        k = # YOUR CODE HERE
        v = # YOUR CODE HERE

        '''
        q, k, v are now of shape (B, num_heads, T, d_k)
        This let's us compute the attention scores by taking the dot product of the query and key vectors, and then dividing by the square root of d_k to prevent large values in the softmax function
        '''
        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_k)

        '''Create a mask of shape (1, 1, T, T) using the generate_causal_mask function'''
        mask = # YOUR CODE HERE

        '''
        Use the .masked_fill function to set the value in scores to `float('-inf')` if the mask in that index is 0.
        This will prevent the model from seeing future tokens in the sequence.
        '''
        scores = # YOUR CODE HERE

        '''
        We can apply the softmax function to the scores to get the attention weights. This will give us a tensor of shape (B, num_heads, T, T) where each row is the attention weights for each token in the sequence.
        Simply the scores softmax across dim=-1 (the last dimension)
        '''
        attn = # YOUR CODE HERE

        '''Next, we can compute the dot product of the attention weights and the value vectors to get the output of the attention mechanism, giving us tensor of shape (B, num_heads, T, d_k)
        '''
        out = (attn @ v).transpose(1, 2).reshape(B, T, C)

        '''
        Finally, we pass the output through the linear layer to combine the output of all heads and return the final output
        This will give us a tensor of shape (B, T, model_dimensions)
        '''
        return self.fc_out(out)

Please try swapping out the MultiHeadAttention class in the MiniLLM class with the new MaskedMultiHeadAttention class and see how it affects the output. You can use the same training loop as before.

Deliverables

Add the generate_causal_mask function
Implement the MaskedMultiHeadAttention class, including the init and forward methods
Describe how the output changes when using the MaskedMultiHeadAttention class instead of the MultiHeadAttention class. Does it give different results? How does it affect the output? Does it make sense conceptually?

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project13.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.