TDM 30200: Project 12 - Large Language Models (LLMs): Sequence Modeling

Project Objectives

This project will take what we learned in the last two projects and combine them into a more powerful model. We will use our word embeddings generation code from Project 11, and the concept of sequence modeling with our n-gram model from Project 10 to create a more powerful machine learning based model.

Learning Objectives

Reinforce the concepts of text tokenization, word embeddings, n-gram models, and sequence modeling.
Reinforce machine learning concepts, including training and testing data, and model evaluation from last semester.
Learn how to make basic machine learning models using pytorch.

Dataset

'/anvil/projects/tdm/data/amazon/music.txt'

Questions

Question 1 (2 points)

In project 10, we learned about n-grams and created a simple model that used our n-grams to predict the next word in a sequence. Additionally, last project we learned how to turn text data into numerical data using word embeddings. In this project, we will combine these two concepts as well as our knowledge of machine learning from last semester to create a more powerful model to predict text.

To make sure everyone is on the same page, the read_lines and clean_text functions are provided below. Go ahead and copy in your NGram class from project 10 as well. Then, let’s create a new function to generate multiple of the next words in a sequence. This function will use your existing n-gram model’s get_next_word function to generate the next word in the sequence and append it to the sequence. It will then use the updated sequence to generate the next word, and so on, n times. This will allow us to generate multiple words in a sequence. You can use the following code to get started:

def read_lines(file_path, n, start = 0):
    lines = []
    with open(file_path, 'r') as f: # open the file in read mode
        for i, line in enumerate(f):
            if i < start:
                continue
            lines.append(line.strip())
            if len(lines) == n:
                break
    return lines

import codecs
def clean_text(text):
    #modify the text here
    # 1) decode escape sequences
    # 2) remove the starting and ending double quotes
    # 3) replace any newlines with a space
    # 4) return the cleaned text

    text = codecs.decode(text, 'unicode_escape')
    text = text.replace('\n', ' ')
    text = text.lower()
    text = ''.join(c for c in text if c.isalpha() or c.isspace())


    return text

def get_next_word_loop(start_string, n=10):
    # generate the next n words and output the final string
    for _ in range(n):
        # get the next word using the n-gram model
        # your code here

        # append the next word to the start_string, with a space in between
        # your code here

    # return the final string
    # your code here

Please use method="common" for the get_next_word function. This will ensure that the model uses the most common word in the n-gram model to generate the next word. This is important because we want to generate a sequence of words that are likely to occur together, rather than just any random word.

Now that you can quickly generate a sequence of words, try it out! Run the below code to setup your NGram object, and then generate a sequence of 5 words after the starting string "this is a" and print the result. How well did it work? Try the same thing with a 20 word sequence. How well did that work?

import numpy as np
my_ngram = NGram(3, is_character_based=False)
my_ngram.set_data(read_lines('music.txt', 50000))

print(get_next_word_loop("this is a",5))

print(get_next_word_loop("this is a",20))

Deliverables

Implement the get_next_word_loop function
Generate a sequence of 5 words and print the result
Explain how well the model worked for the 5 word sequence
Generate a sequence of 20 words and print the result
Explain how well the model worked for the 20 word sequence

Question 2 (2 points)

As you should have noticed, this model can generate a sequence of a few words that makes sense, but as the sequence gets longer it starts to lose meaning. That is because this model is simply looking at the last n words in the sequence and trying to predict the next word. It is not taking into account the entire sequence of words, which should give it more context and allow it to generate a more coherent sequence of words. Additionally, the model is still only using text based inputs and has no idea of how different words relate to each other. In order to address these issues, we can use a Recurrent Neural Network (RNN) to generate the next word in the sequence, while maintaining some context of the entire sequence. Additionally, we can train this model on our word embeddings vectors, which will allow the model to learn how different words relate to each other.

Although we used sklearn last semester for our machine learning models, sklearn does not have a good implementation of RNNs. Instead, we will be using pytorch to create our RNN model. Pytorch is a powerful machine learning library that allows us to create complex models with ease. It also has a great implementation of RNNs, which we will be using for this project.

The first thing we need to do is prepare our data for the RNN model. If you remember from last semester, to train a model we need to split our data into input and output data. In this case, our input data will be a sequence of n words, and the output data will be the next word in the sequence. We will also need to convert this text data into numerical data, using our word embeddings.

As you saw from last project, the word embeddings require a lot of data to train on. For this project we are going to use FastText word embeddings on the first 5000 lines of the music data. You can run the below code to generate it:

from gensim.models import FastText, KeyedVectors
def generate_fasttext_vectors(lines, filename):
    # Clean the text using the clean_text function
    cleaned_lines = [clean_text(line) for line in lines]

    # Split each line into words using .split()
    tokenized_lines = [line.split() for line in cleaned_lines]

    # Create a Word2Vec model using the tokenized lines
    model = FastText(tokenized_lines, vector_size=100, window=5, min_count=1, workers=1)

    # Save the model to disk
    model.save(f'{filename}.model')

    # Save the word vectors to disk
    word_vectors = model.wv
    word_vectors.save(f'{filename}.wordvectors')

    return word_vectors

keyedvectors = generate_fasttext_vectors(read_lines('/anvil/projects/tdm/data/amazon/music.txt', 5000, 0), 'P12_fasttext')

Then, let’s begin making a function to create our input and output data for the RNN model. This function will take in the output from the read_lines function, the word embeddings, and the number of words in each sequence to use as input (n, same process as used in our n-gram model. With n=3, we will have 2 words as input and 1 word as output). The function will then create a list of input and output data, where each input is a sequence of n-1 words, and the output is the next word. The function will also convert the text data into numerical data using the word embeddings. The function will return a tuple of two lists, one for the input data and one for the output data. You can fill in the following function outline to get started:

The purpose of using an RNN is to allow the model to have context of the entire sequence of words, rather than just the last n words. This is done by maintaining a hidden state that is updated with each word in the sequence, allowing it to have context (while admittingly limited context) of the entire sequence. Therefore, it is important to ensure that we do not allow multiple different reviews to be put into the same sequence, and we must reset the hidden state after each review while training. Therefore, your final input_data should be a list of 3-dimensional numpy arrays.

def create_dataset(data, word_embeddings, n=3):
    input_data = []
    output_data = []

    for line in data:
        # clean the line
        # your code here

        # split the line into words
        words = []
        # your code here

        # if the number of words is less than n, skip the line
        if len(words) < n:
            continue

        # convert the list of words into a list of word embeddings
        embeddings = []
        # your code here

        # group the embeddings into groups of n consecutive words
        embedding_groups = []
        # your code here

        r_i = []
        r_o = []

        # given our embedding group, append the first n-1 word embeddings to r_i, and the last word embedding to r_o
        # be sure to cast these to numpy arrays before appending them to the list
        for group in embedding_groups:
            # your code here


        # append a numpy array of r_i to input_data, and r_o to output_data
        # your code here

    # return the list of words, embeddings, input_data, and output_data
    return (words, embeddings, input_data, output_data)

To test your function, you can use the following code:

words, embeddings, input_dataset, output_data = create_dataset(read_lines('music.txt', 1500), keyedvectors, n=3)

print(input_dataset[0].shape) # (4, 2, 100), we have 4 sequences of 2 words, each with 100 dimensions
print(input_dataset[1].shape) # (2, 2, 100), we have 2 sequences of 2 words, each with 100 dimensions
print(input_dataset[2].shape) # (39, 2, 100), we have 39 sequences of 2 words, each with 100 dimensions
print(len(input_dataset)) # 1360, there are 1360 reviews that were long enough to be used
print(len(output_data)) # 1360, there are 1360 reviews that were long enough to be used

print(output_data[0][0][:5]) # [ 0.88376546  0.7545694  -0.9747805  -0.74862236  0.05938531]

print(input_dataset[10][0][0][:5]) # [-0.1334532   0.10588685 -2.6614566  -1.1856614   0.5889898 ]

print(output_data[8][0][:5]) # [ 0.4304928   0.8276741  -0.91861534 -0.3446058   0.46729067]

Deliverables

Implement the create_dataset function
Generate the input and output data using the create_dataset function
Run and pass the test cases

Question 3 (2 points)

Now that we are able to generate our input and output datasets, that’s all we need to do to get our model set up, right? While it’s true that this is all the setup we need to start training and using our model, one problem you may have realized is that all of this data is in the form of big numpy arrays. This is perfectly fine for the computer, but not so much for us. We need to have a way to convert these numpy arrays back into text so that we can understand what the model is outputting. To do this, we will create a function that takes in a numpy array of a word embedding, along with the keyedvectors object, and return the word that corresponds to that word embedding. How do we know what word corresponds to a given word embedding? We can take the word embeddings from all the words in our KeyedVectors object, and find the word that is closest to the given word embedding using cosine similarity. Then, once we know which known word embedding is closest to the given word embedding, we should know what word the given one corresponds to. You can use the following code to get started:

def cosine_similarity_vec(v1, v2):
    # find the cosine similarity between two vectors
    dot_product = np.dot(v1, v2)
    magnitude_v1 = np.linalg.norm(v1)
    magnitude_v2 = np.linalg.norm(v2)

    return dot_product / (magnitude_v1 * magnitude_v2)

def get_word_from_embedding(embedding, keyedvectors):
    # get a key: value pair of all the word embeddings in the keyedvectors object
    # your code here

    # for each word embedding in the keyedvectors object, calculate the cosine similarity to the given word embedding
    # your code here

    # find the word with the highest cosine similarity to the given word embedding
    # your code here

    # return the word with the highest cosine similarity
    # your code here

You can test your function using the following code:

print(words[0:10]) # ['great', 'video', 'you', 'are', 'taught', 'the', 'tricks', 'of', 'the', 'trade']
print([get_word_from_embedding(embeddings[i], keyedvectors) for i in range(10)])
# [('great', 0.9999999), ('video', 1.0), ('you', 1.0000001), ('are', 1.0000001), ('taught', 1.0), ('the', 1.0), ('tricks', 0.9999999), ('of', 1.0000001), ('the', 1.0), ('trade', 0.99999994)]

print(get_word_from_embedding(input_dataset[5][10][0], keyedvectors)) # should print ('still', 1.0)
print(get_word_from_embedding(input_dataset[5][11][0], keyedvectors)) # should print the word ('sounding', 0.99999994)
print(get_word_from_embedding(input_dataset[110][5][0], keyedvectors)) # should print the word ('rock', 1.0000001)

Deliverables

Implement the get_word_from_embedding function
Run and pass the test cases

Question 4 (2 points)

Now that we have our dataset of word embeddings, and a way to convert them back into text, we can start training our model. Because we have not gone over pytorch in this model, most of the code for the model will be provided for you. We encourage you to read through the code and comments to understand what is going on, but you will not need to know how to code the entire model from scratch. Some important ideas to understand with pytorch are as follows:

Models in pytorch are not created by instantiating a class and calling a function, like in opencv or sklearn. Instead, we create a class that inherits from a base torch.nn.Module class. This class has a few important functions that we need to implement:
- init: This function is called when the model is created. We will use this to create our layers and set up our model.
- forward: This function is what is called as we pass data through the model, it is the forward pass. This function will take in the input data (and hidden state if applicable) and return the output data (and updated hidden state if applicable). This is the majority of the work that the model does.
The model is typically trained using a criterion (or loss function) and an optimizer. The criterion is used to calculate the loss between the predicted output and actual output, and the optimizer is used to update model weights based on the criterion. There are many different criterions you can use, such as Cross Entropy, Mean Squared Error, etc. Additionally, there are multiple optimizers you can use, such as Stochastic Gradient Descent (SGD), Adaptive Moment Estimation (Adam), etc. For this project, we will be using Mean Squared Error (MSE) as our criterion and Adam as our optimizer, but you can experiment with different ones if you would like.
The model is trained using a loop that goes through the training data multiple times (epochs). In each epoch, we will pass the input data through the model, calculate the loss, and update the model weights. Training takes time for these types of models. Unlike a KNN model that "remembers" everything at once, this model needs to see the data multiple times to learn from it, similar to how we learn. The more epochs you run, the better the model will get (up to a point, and it won’t necessarily be linear), but it will take longer to train. You can experiment with different numbers of epochs to see how it affects the model.

Now that you have some basic understanding of how this RNN model in pytorch works, let’s go ahead and create our model. The model will be a simple RNN model that takes in a sequence of word embeddings and outputs the next word embedding. The model will have an input layer, a hidden layer, and an output layer. The input layer will take in the sequence of word embeddings, the hidden layer will be an RNN layer, and the output layer will be a linear layer that outputs the next word embedding. You can use the following code to get started:

from torch import nn
from torch import optim
import torch
import time
class RNNModel(nn.Module):
    def __init__(self, input_size=100, hidden_size=128, output_size=100, input_sequence_length = 2): # our word embeddings are 100 dimensions, and we will use a hidden size of 128.
        super().__init__() # initialize the base class
        self.input_sequence_length = input_sequence_length # this is our input_sequence_length. We won't use it in the model, but you will use it to write your predict function in the next question.
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True) # create the RNN layer, using the input size and a hidden size. The batch_first = True parameter means that the input data will be in the format (batch_size, sequence_length, input_size). This is important because we will be using batches of data to train the model. Batches are used to essentially group data together to speed up training, but in this case we can use it to group our sequences together with a batch size of 1.
        self.fc = nn.Linear(hidden_size, output_size) # add a linear layer to the model, which will take the hidden layer and output the next word embedding. The output size is the same as the input size, because we want to output a word embedding that is the same size as the input word embedding.

    def forward(self, x, hidden): # this will take in the current input data and the hidden state, and return the output data and the updated hidden state.
        out, hidden = self.rnn(x, hidden) # get the output of the RNN layer, and update the hidden state
        out = self.fc(out[:, -1, :]) # pass the output of the RNN layer through the linear layer. The out[:, -1, :] means that we are only taking the last output of the RNN layer. Although this is only the last word from the sequence, since it has been passed through the RNN layer along with the hidden state the entire sequence should be encoded here.
        return out, hidden

    def train(self, input_data, output_data, num_epochs=5, learning_rate=0.001): # this function will train the model using our input and output data from question 2. Additionally, we can specify how many epochs to run and the learning rate of the model.

        criterion = nn.MSELoss() # use Mean Squared Error as the criterion
        optimizer = optim.Adam(self.parameters(), lr=learning_rate) # use Adaptive Moment Estimation (Adam) as the optimizer.

        # for each epoch
        for epoch in range(num_epochs):
            # set the loss of the epoch to 0
            total_loss = 0

            # for each review (pair of input and output data arrays for the review)
            s = time.time()

            for i, (r_i, r_o) in enumerate(zip(input_data, output_data)): # iterate through each review we have

                if i % 1000 == 0: # occasionally print the progress of the training, along with the elapsed time
                    print(f'Processing review {i+1}/{len(input_data)}')
                    print(f'Elapsed time: {time.time() - s} seconds')
                    s = time.time()

                # set `hidden_state` variable to None each review, as we want to reset the hidden state for each review
                '''YOUR CODE HERE'''

                # for each sequence and output pair in the review
                for seq, o in zip(r_i, r_o):

                    # convert the sequence and output to tensors to be passed to the model
                    input_tensor = torch.tensor(seq, dtype=torch.float32).unsqueeze(1) # dimensions [2, 1, 100]
                    output_tensor = torch.tensor(o[0], dtype=torch.float32) # dimensions [100]

                    # set the optimizer to zero gradients using it's `zero_grad` function
                    '''YOUR CODE HERE'''

                    # perform a forward pass of the model using the input tensor and hidden state, and get the output and updated hidden state
                    output, hidden = # your code here

                    output = output[-1, 0] # get the last output of the RNN layer, which is the output we want to use for the next word embedding

                    if hidden is not None:
                        hidden = hidden.detach() # We need to detach the hidden state from the graph so that it isn't updated with the gradients.

                    # calculate the loss using the criterion
                    loss = criterion(output, output_tensor)

                    # backpropogate the loss, using the `backward` function of the loss
                    '''YOUR CODE HERE'''

                    # step the optimizer, using the `step` function of the optimizer
                    '''YOUR CODE HERE'''

                    # add loss.item() to the total loss for the epoch
                    '''YOUR CODE HERE'''


            # print the loss for the epoch
            print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(input_data)}')

        return total_loss/len(input_data)

Training this model is not a fast process. Please give yourself ample time to train the model incase you encounter any challenges. The model will improve if you train it for more epochs and give it more reviews to train on (you likely want hundreds of thousands of reviews to train on, and hundreds of epochs). However, this will take an extremely long time to train. Testing on my local computer it took ~3 hours to train on 5 epochs with 50 thousand reviews. This is clearly not practical for the scope of the course, but we would love to see you try it out if you have the motivation and time. This is not required though.

Please run the following code to test your model. Feel free to change the number of epochs or learning rate and see how it affects the model. You can also adjust the number of reviews in the previous step, but be careful as this will change how long each epoch takes to run. Additionally, you can try increasing how much data our word embeddings are generated with to see if that changes anything.

# create the model
hidden_size = 128 # size of the hidden layer
output_size = 100 # size of the output layer
input_size = 100
# input size is the size of each input data, which is n word embeddings
model = RNNModel(input_size, hidden_size, output_size)

model.train(input_dataset, output_data, num_epochs=5, learning_rate=0.001) # train the model on the input and output data

Below is an example of what your output may look like. Time and loss will vary based on your computer and the number of epochs you run.

Processing review 1/1360
Elapsed time: 0.0002453327178955078 seconds
Processing review 1001/1360
Elapsed time: 43.138824224472046 seconds
Epoch 1/5, Loss: 138.86604140063682
Processing review 1/1360
Elapsed time: 1.6927719116210938e-05 seconds
Processing review 1001/1360
Elapsed time: 44.072826862335205 seconds
Epoch 2/5, Loss: 137.2912354198065
Processing review 1/1360
Elapsed time: 8.58306884765625e-06 seconds
Processing review 1001/1360
Elapsed time: 44.183839559555054 seconds
Epoch 3/5, Loss: 136.87381747121415
Processing review 1/1360
Elapsed time: 8.58306884765625e-06 seconds
Processing review 1001/1360
Elapsed time: 44.282142162323 seconds
Epoch 4/5, Loss: 136.59735957769556
Processing review 1/1360
Elapsed time: 1.0013580322265625e-05 seconds
Processing review 1001/1360
Elapsed time: 44.27771878242493 seconds
Epoch 5/5, Loss: 135.70652526225334
135.70652526225334

Deliverables

Implement the RNNModel class
Implement the train function

Question 5 (2 points)

Now that we have our model trained, we can use it to generate text. To do this, we will create a function that takes in a starting string and generates a sequence of words using the model. The function will take in the starting string, the number of words to generate, and the model. The function will then use the model to generate the next word in the sequence, and append it to the sequence. It will then use the updated sequence to generate the next word, and so on, n times. This will allow us to generate multiple words in a sequence. You can use the following code to get started:

def predict_from_string(input_string, model, keyedvectors):
    # clean the input string using the clean_text function
    '''YOUR CODE HERE'''

    # split the input string into words using .split()
    '''YOUR CODE HERE'''

    # create a list of word embeddings for the input words using the keyedvectors object
    input_embeddings = []
    '''YOUR CODE HERE'''

    # group the embeddings into groups of model.input_sequence_length consecutive words. n-grams
    embedding_groups = []
    '''YOUR CODE HERE'''

    hidden = None
    output = None

    # we need to start passing the entire sequence of word embeddings to the model to update the hidden state
    # once we have the hidden state for the entire sequence, then we can get the next word
    for seq in embedding_groups:

        input_tensor = torch.tensor(seq, dtype=torch.float32).unsqueeze(1)
        output, hidden = model(input_tensor, hidden)

        if hidden is not None:
            hidden = hidden.detach()

    # get the predicted word embedding
    predicted_embedding = output.detach().numpy()[1]

    # get the word with the highest cosine similarity to the predicted embedding, using your `get_word_from_embedding` function
    '''YOUR CODE HERE'''

    return predicted_word, similarity

To test your function, you can use the following code:

start_string = 'this is a wonderful cd and'
for i in range(7):
    newword = predict_from_string(start_string, model, keyedvectors) # this is the function you will implement in the next question
    print(newword)
    start_string += " " + newword[0]
print(start_string)

You should notice that this model is terrible. In all fairness, we did not train our embeddings very in depth, nor did we train our model very in depth. However, this is a good start to show that the model is capable of generating some form of output from our input. Additionally, we can look at the similarity score printed in each iteration of the while loop to see if the model is maintaining context of the entire sequence, or just the last n words like or n-gram is.

Deliverables

Implement the predict_from_string function
Run the test cases
Did the model work well?
Did the model maintain context longer than a few words? (hint: look at the similarity score of each predicted word, is it changing?)

Question 6 (2 points)

A RNN isn’t the only type of model that can be used to generate text. In fact, there are many different types of models that can be used to generate text, such as LSTMs, GRUs, and Transformers. To keep things similar, let’s take a look at a Long Short Term Memory (LSTM) model, which is based on an RNN. LSTMs are a type of RNN that are designed to remember information for long periods of time. They do this by using a special type of cell called a memory cell, which can store information for long periods of time. This allows the model to maintain context over longer sequences of words, and typically generate more coherent text. LSTMs do this with the concept of 'gates' that control information flow throughout the model, by deciding what information to keep and what information to forget. This allows the model to learn which information is important and which information is not, and helps it maintain context over longer sequences of words.

To make a LSTM, we can copy our entire RNN class, and change only 2 things:

Change the name of the class from RNNModel to LSTMModel
Change references of self.rnn to self.lstm, and nn.RNN to nn.LSTM
The hidden state is now a 2-dimensional tuple, so when we detach the hidden state we need to detach both the hidden state and the cell state. This is done by changing hidden = hidden.detach() to hidden = (hidden[0].detach(), hidden[1].detach()). You will need to perform this change both in the model train function and in your predict_from_string function.

Once you have made these changes, please run the following code to test your model:

# create the model
hidden_size = 128 # size of the hidden layer
output_size = 100 # size of the output layer
input_size = 100
# input size is the size of each input data, which is n word embeddings
modellstm = LSTMModel(input_size, hidden_size, output_size)


modellstm.train(input_dataset, output_data, num_epochs=5, learning_rate=0.001) # train the model on the input and output data


start_string = 'this is a wonderful cd and'
for i in range(15):
    newword = predict_from_string(start_string, modellstm, keyedvectors)
    print(newword)
    start_string += " " + newword[0]
print(start_string)

Deliverables

Implement the LSTMModel class
Test the model using the code provided
Did the model work well?
Did the model maintain context longer than a few words? (hint: look at the similarity score of each predicted word, is it changing?)
How does the LSTM model compare to the RNN model? Does it generate more coherent text?
How does the LSTM model compare to the RNN model in terms of training time? Does it take longer to train?

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project12.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.