TDM 40200: Project 10 - Large Language Models (LLMs): N-gram model

Project Objectives

In this project, we aim to understand the concept of n-grams, their uses, and creating a simple n-gram model. We will also explore the use of n-grams in natural language processing (NLP) tasks such as text generation and language modeling.

Learning Objectives

Understand the concept of n-grams and their applications in NLP.
Implement a simple n-gram model for text generation.
Explore the use of n-grams in language modeling and text generation tasks.
Evaluate the performance of the n-gram model using perplexity and other metrics.

Dataset

'/anvil/projects/tdm/data/amazon/music.txt'

Questions

Question 1 (2 points)

Last project, you looked into basic text processing and some basic text statistics like word count, letter frequency, etc. In this project, we will look into n-grams, which are a more complex way of statistically analyzing text. N-grams are sequences of n items (words, letters, etc.) from a given text. For example, in the word "hello!", we could analyze the 2-grams (bigrams) of "he", "el", "ll", "lo", and "o!". We could similarly analyze 3-grams, 4-grams, etc. Alternatively, we could analyze the sentence "Large Language Models are great!" and analyze n-grams on a per-word basis. In this case, we have the following 2-grams: "Large Language", "Language Models", "Models are", "are great!".

To start, let’s create a generic n-gram class to store any data and functions we need. The code below serves as a good starting point. Throughout this project, we will fill in each function.

class NGram:
    def __init__(self, n, is_character_based=False):
        self.n = n
        self.data = None
        self.is_character_based = is_character_based
        self.ngrams = set()
        self.ngram_frequencies = {}
        self.ngram_probabilities = {}

    def set_data(self, data):
        pass

    def generate_ngrams(self):
        pass

    def generate_ngram_probabilities(self):
        pass

    def get_next_word(self, previous_words, method='random'):
        pass

    def get_perplexity(self, text):
        pass

Firstly, let’s figure out how to add some data to our NGram class. The set_data function should take in a list of elements that we can generate n-grams from. These elements could either be strings if we want an n-gram on a per character basis, or they could be list of strings for making an n-gram on a per word basis. At the very least, we should ensure that the data passed in is a list. If not, we should not do anything. For our data, we will continue to use the Amazon music dataset. Please copy your read_lines and clean_text functions from last project. Additionally, you will need to make a couple of modifications to your clean_text function. Firstly, please lowercase all letters. There are other techniques to ensure that the model understands that words with different capitalization are the same, but we will look into those in future projects. Then, add the following line of code to your clean_text function to ensure that only alphabetic characters are kept in the text. This will help us avoid any issues with punctuation and special characters.

text = ''.join(c for c in text if c.isalpha() or c.isspace())

Now that we have our data helper functions, let’s implement the set_data function. This function should take in a list of strings, along with the stored flag that indicates whether we want to generate n-grams on a per character basis or a per word basis. If the flag is set to True, we should generate n-grams on a per character basis. If the flag is set to False, we should generate n-grams on a per word basis. In total, your function should take this list of strings, clean it using the clean_text function, ensure that it is in the correct format for per word or per character n-grams, and then store it in the self.data variable.

In the case of generating on a per word basis, simply split the string into a list of words using the split() function. That is to say, for per character n-grams, self.data should be a list of strings. For per word n-grams, self.data should be a list of tuples of strings. This is because functionally, a string can be treated as a list of characters in python. Doing it this way will simplify your code in future steps. It is very important that you have a list of tuples of strings, not a list of lists of strings to ensure hashability.

To test that your set_data function is working, you can use the following code:

my_ngram = NGram(2, is_character_based=True)
my_ngram.set_data(["\"hello!\""])
print(my_ngram.data) # should return ['hello']

my_ngram = NGram(2, is_character_based=False)
my_ngram.set_data(read_lines('/anvil/projects/tdm/data/amazon/music.txt', 2))
print(my_ngram.data) # should return a list of lists of strings like below:
'''
[
    ('i', 'love', 'this', 'cd', 'so', 'inspiring'),
    ('love', 'it', 'great', 'seller')
]
'''

Deliverables

Modified clean_text function
Implemented set_data function
Passes tests for set_data function

Question 2 (2 points)

After you have n-gram generation working, we can begin looking at the frequencies of each n-gram. The generate_ngrams function should use the stored self.n and self.data and create each n-gram, along with counting how many times that n-gram appears in the data. This should be stored in a dictionary, where the key is the n-gram and the value is the frequency of that n-gram. For example, if we are analyzing 'hello yellow' on a per character basis, we would have the following n-grams and frequencies:

{
    'he': 1,
    'el': 2,
    'll': 2,
    'lo': 2,
    'o ': 1,
    ' y': 1,
    'ye': 1,
    'ow': 1
}

The resulting dictionary should be stored in the self.ngram_frequencies variable AND returned by the function. If self.data is empty or is not a list, the function should return an empty dictionary. Please implement this functionality and run the below test cases to ensure that it is working. You can use the following code to test your generate_ngrams function:

my_ngram = NGram(2, is_character_based=True)
my_ngram.set_data(read_lines('/anvil/projects/tdm/data/amazon/music.txt', 1, 14))
print(my_ngram.generate_ngrams())

# Should return the following dictionary:
'''
{'ha': 3, 'ad': 1, 'd ': 3, ' t': 3, 'th': 4, 'hi': 2, 'is': 2, 's ': 5, ' a': 4, 'as': 1, 'an': 1, 'n ': 2, 'al': 2, 'lb': 1, 'bu': 1, 'um': 1, 'm ': 1, ' b': 1, 'ba': 1, 'ac': 1, 'ck': 1, 'k ': 1, ' i': 1, 'in': 1, 'he': 1, 'e ': 4, ' d': 1, 'da': 1, 'ay': 2, 'y ': 2, '  ': 1, ' h': 2, 'av': 2, 've': 3, 'lw': 1, 'wa': 1, 'ys': 1, ' e': 2, 'en': 2, 'nj': 1, 'jo': 1, 'oy': 1, 'ye': 1, 'ed': 1, ' k': 1, 'ki': 1, 'ie': 1, 'et': 1, 'h ': 1, ' g': 1, 'gr': 1, 're': 1, 'ee': 1, 'ns': 1, ' m': 1, 'mu': 1, 'us': 1, 'si': 1, 'ic': 1, 'c ': 1, 'ev': 1, 'er': 1, 'ry': 1, ' o': 1, 'on': 1, 'ne': 1, ' s': 1, 'sh': 1, 'ho': 1, 'ou': 1, 'ul': 1, 'ld': 1, ' c': 1, 'cd': 1}
'''

my_ngram = NGram(2, is_character_based=False)
my_ngram.set_data(read_lines('/anvil/projects/tdm/data/amazon/music.txt', 1, 14))
print(my_ngram.generate_ngrams())

Deliverables

Implemented generate_ngrams function
Passes tests for generate_ngrams function

Question 3 (2 points)

Now that we have our n-grams and their frequencies, let’s look at the probabilities of each n-gram. Our probability table will essentially be a nested dictionary, where the first key is n-1 part of the n-gram and its value is another dictionary. In this second dictionary, each key is the last part of the n-gram and the value is the probability of that n-gram. For example, if we are analyzing 'hey help' on a per character basis with n=3, we would have the following n-grams and probabilities:

{
    'he': {'y': 0.5, 'l': 0.5},
    'ey': {' ': 1.0},
    'y ': {'h': 1.0},
    ' h': {'e': 1.0},
    'el': {'p': 1.0},
}

In the above example, half of the time after 'he' we see 'y' and half of the time we see 'l'. This is a very simple example, but it shows how we can use n-grams to predict the next word in a sequence. The generate_ngram_probabilities function should generate the n-gram frequencies by calling the generate_ngrams function and use the returned value to calculate the probabilities of each n-gram. This should be stored in the self.ngram_probabilities variable AND returned by the function. If self.ngram_frequencies is empty or is not a dictionary, the function should return an empty dictionary.

The following code should be used to test your generate_ngram_probabilities function:

my_ngram = NGram(3, is_character_based=False)
my_ngram.set_data(read_lines('/anvil/projects/tdm/data/amazon/music.txt', 5, 3))
my_ngram.generate_ngram_probabilities()
print(my_ngram.ngram_probabilities)

# Your output should be decently long. You should be able to find near the center the following:
# ('one', 'good'): {'album': 1.0}, ('good', 'album'): {'and': 0.5, 'because': 0.5}, ('album', 'and'): {'all': 1.0}

This example shows that from the 5 given reviews, after 'one good' we see 'album' 100% of the time, after 'album and' we see 'all' 100% of the time, and after 'good album' we see 'and' 50% of the time and 'because' 50% of the time. Although these numbers are very clean and there are a lot of 100% probabilities in your dataset, that is simply because we do not have a lot of data yet. Recall from project 1 just how much data modern LLMs are trained on. The more data we have, if you want to see a more realistic example, you can try using the read_lines function to read in 500 lines of data. You should see a much more diverse set of probabilities, however it will be a much longer output.

Deliverables

Implemented generate_ngram_probabilities function
Passes tests for generate_ngram_probabilities function

Question 4 (2 points)

Now that we have our n-gram probabilities, we can finally try and use our n-gram model to generate some text. The get_next_word function should take in a string of previous words/letters and 'predict' the next word/letter in the sequence. This function can either be used in 'random', 'common', or 'uncommon' mode. In random mode, the function should use the probabilities in conjunction with np.random.choice to randomly select the next word/letter. In common mode, the function will select the most common next word/letter. In uncommon mode, the function will select the least common next word/letter. The function should return the next word/letter as a string. If self.ngram_probabilities is empty or is not a dictionary, the function should return an empty string.

Please assume that the input string will always be the same length as the n-gram size. For example, if we are using a 3-gram model, the input string should be 2 words long. If we are using a 4-gram model, the input string should be 3 words long. This is important because it will help us avoid any issues with the n-gram model not being able to find the correct n-gram in the dictionary. If the input string is not the same length as the n-gram size, the function should return an empty string. However, the input string will not be a list of strings for the n-gram word mode, so you will need to split the input string by whitespace.

Something that may be useful is an example of using np.random.choice with probabilities. Typically, you would provide just a list of items to choose from, and the random choice would assume a uniform probability distribution. However, if you want to provide a custom probability distribution, we can use the p parameter, which is a list of probabilities for each item in the list. For example, if we have a list of items ['a', 'b', 'c'] and we want to select one of them with the following probabilities: {'a': 0.1, 'b': 0.3, 'c': 0.6}, we can use the following code:

import numpy as np

np.random.choice(
    ['a', 'b', 'c'],
    p=[0.1, 0.3, 0.6]
)

A keen eye may notice that this is simply our dictionary.keys() as the first parameter and the dictionary.values() as the second parameter.

Please implement the get_next_word function and run the following test cases to ensure that it is working. You can use the following code to test your get_next_word function:

import numpy as np
my_ngram = NGram(3, is_character_based=False)
my_ngram.set_data(read_lines('/anvil/projects/tdm/data/amazon/music.txt', 300, 50))
my_ngram.generate_ngram_probabilities()
np.random.seed(18)
print(my_ngram.get_next_word('is a', method='random')) # random word from the n-gram probabilities: truly
print(my_ngram.get_next_word('is a', method='random')) # random word from the n-gram probabilities: very
print(my_ngram.get_next_word('is a', method='random')) # random word from the n-gram probabilities: show

print(my_ngram.get_next_word('is a', method='common')) # most common word from the n-gram probabilities: great
print(my_ngram.get_next_word('is a', method='uncommon')) # least common word from the n-gram probabilities: masterpiece

Deliverables

Implemented get_next_word function
Passes tests for get_next_word function

Question 5 (2 points)

One metric that is commonly used to evaluate the performance of n-gram models is the concept of perplexity. Perplexity is a measure of how well a probability distribution predicts a sample. In the context of n-gram models, perplexity is a measure of how well the model predicts the next word in a sequence. The lower the perplexity, the better the model is at predicting the next word.

We can calculate the perplexity of our n-gram model using the following formula:

perplexity = H^(sum(-1/N * log_H(P(w_i|w_1, w_2, ..., w_n-1))))

where: - N is the number of words in the sequence - P(w_i|w_1, w_2, …, w_n-1) is the probability of the i-th word given the previous n-1 words - log_H is the logarithm base H (where H is the base of the logarithm used to calculate the probabilities) - H is the base of the logarithm used to calculate the probabilities (typically 2 or e, but any base can be used)

Essentially, we find perplexity by giving the model some string of text. for each n-gram within the text, we calculate the probability of our model generating that n-gram based on the previous n-1 words. We then take the log of that probability, sum all of those values, divide it by the number of words in the text, multiply it by negative 1, and finally take the exponential of that value. This will give us a single number that represents how well our model is able to predict the next word in the sequence. These values can range from 1 to infinity, with lower values indicating a better model. A value of 1 would indicate that the model is perfect and is able to predict the next word with 100% accuracy. A value of infinity would indicate that the model is unable to predict the next word at all.

If your model can not find the probability of generating that nth word based on the previous n-1 words, please use a probability of 0.00001. This will heavily penalize the model.

For example, let’s say we have a 3-gram model and input the following string: 'apples and bananas taste good' We want to find the probability that our 3-gram model will predict bananas given the previous 2 words apples and, taste given the previous 2 words and bananas, and good given the previous 2 words bananas taste. Each of these probabilities is calculated using the probabilities we generated in step 3, then their log base H is taken, then they are multiplied by -1 and divided by the total number of words in our input string (5 words). Finally, we sum all of these values and take H to the power of that value. This will give us a single number that represents how well our model is able to predict the next word in the sequence.

To summarize what these values actually mean, a theoretical perfect model would have a perplexity of 1, where it has no doubts about what the next word in a sequence is. A higher perplexity indicates that the model is less certain about what the next word in a sequence is. This is not necessarily a bad thing, as it shows that the model is able to generate a wider variety of text.

Please implement the get_perplexity function and run the following test cases to ensure that it is working as expected. You can use the following code to test your get_perplexity function:

import numpy as np
my_ngram = NGram(3, is_character_based=False)
my_ngram.set_data(read_lines('/anvil/projects/tdm/data/amazon/music.txt', 50000))
my_ngram.generate_ngram_probabilities()
print(my_ngram.get_perplexity('is a great cd')) # 3.513998171326
print(my_ngram.get_perplexity('is a good cd')) # 5.461575334059693
print(my_ngram.get_perplexity('is a bad cd')) # 11.67642266413374

print(my_ngram.get_perplexity('this music is a wonderful experience and i love it')) # 27.903381350329948

Deliverables

Implemented get_perplexity function
Passes tests for get_perplexity function

Question 6 (2 points)

N-grams have many uses besides just text generation. Another popular use of n-grams is text classification. We could classify text based on its language (English, French, Spanish, etc.), its sentiment (positive, negative, neutral, etc.), or even its topic (politics, STEM, sports, etc.). In these next two projects, we will be adapting our n-gram model to classify text based on its language. Let’s go ahead and create a new class called LanguageClassifier that will use our NGram class to predict the language of text. This class will be used to classify text based on its language. The LanguageClassifier class should have the following functions:

add_language_data: This function should take in a list of strings and a language label. It should then create an n-gram model for that language with the input data and store it in a dictionary, where the keys are the language labels and the value is the n-gram object.
classify: This function should take in a string of text and return the language label that the text is most likely to be. This should be done by creating an n-gram model for the input text and then comparing it to each of the n-gram models in the dictionary. The language with the lowest perplexity should be returned as the predicted language.

This should be done on a per character basis. If you wanted to do this on a per word basis, you could assume that it would almost always work as the words would be completely different between languages, so our perplexity scores for the wrong language would be extremely high. However, we want to do this on a per character basis where each language shares the same characters, just to prove conceptually that this is possible.

Below is an outline of the LanguageClassifier class to get you started:

class LanguageClassifier():
    def __init__(self, n):
        self.n = n
        self.languages = dict()
    def add_language_data(self, data, language):
        pass
    def classify(self, text):
        pass

For this question, please implement the add_language_data function, as described above. You can test your implementation using the following code:

my_language_classifier = LanguageClassifier(2) # use 2-grams

my_language_classifier.add_language_data(
    ["Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. Mr. Dursley made drills. He was a big, beefy man with hardly any neck, although he did have a very large moustache. Mrs. Dursley was thin and blonde and had twice the usual amount of neck, which came in very useful as she spent so much of her time spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. Mrs Dursley had a sister called Lily Potter. She and her husband James Potter had a son called Harry Potter. They lived far from the Dursleys and did not speak to them much. They did not get along. One day, a man appeared outside of the Dursleys house. He was tall, thin, and very old, judging by the silver of his hair and beard, which were both long enough to tuck into his belt. He was wearing long robes, a purple cloak that swept the ground, and highheeled, buckled boots. His blue eyes were light, bright, and sparkling behind half-moon spectacles and his nose was very long and crooked, as though it had been broken at least twice. This man's name was Albus Dumbledore."], 'English')

my_language_classifier.add_language_data(
    ["Poco después de la guerra civil, un brote de cólera se había llevado a mi madre. La enterramos en Montju'ic el día de mi cuarto cumpleaños. Sólo recuerdo que llovió todo el día y toda la noche, y que cuando le pregunté a mi padre si el cielo lloraba le faltó la voz para responderme. Seis años después, la ausencia de mi madre era para mí todavía un espejismo, un silencio a gritos que aún no había aprendido a acallar con palabras. Mi padre y yo vivíamos en un pequeño piso de la calle Santa Ana, junto a la plaza de la iglesia. El piso estaba situado justo encima de la librería especializada en ediciones de coleccionista y libros usados heredada de mi abuelo, un bazar encantado que mi padre confiaba en que algún día pasaría a mis manos. Me crié entre libros, haciendo amigos invisibles en páginas que se deshacían en polvo y cuyo olor aún conservo en las manos. De niño aprendí a conciliar el sueño mientras le explicaba a mi madre en la penumbra de mi habitación las incidencias de la jornada, mis andanzas en el colegio, lo que había aprendido aquel día... No podía oír su voz o sentir su tacto, pero su luz y su calor ardían en cada rincón de aquella casa y yo, con la fe de los que todavía pueden contar sus años con los dedos de las manos, creía que si."], 'Spanish'
)

my_language_classifier.add_language_data(
    ["Quoique ce détail ne touche en aucune manière au fond même de ce que nous avons à raconter, il n'est peut-être pas inutile, ne fût-ce que pour être exact en tout, d'indiquer ici les bruits et les propos qui avaient couru sur son compte au moment où il était arrivé dans le diocèse. Vrai ou faux, ce qu'on dit des hommes tient souvent autant de place dans leur vie et surtout dans leur destinée que ce qu'ils font. M. Myriel était fils d'un conseiller au parlement d'Aix; noblesse de robe. On contait de lui que son père, le réservant pour hériter de sa charge, l'avait marié de fort bonne heure, à dix-huit ou vingt ans, suivant un usage assez répandu dans les familles parlementaires. Charles Myriel, nonobstant ce mariage, avait, disait-on, beaucoup fait parler de lui. Il était bien fait de sa personne, quoique d'assez petite taille, élégant, gracieux, spirituel; toute la première partie de sa vie avait été donnée au monde et aux galanteries. La révolution survint, les événements se précipitèrent, les familles parlementaires décimées, chassées, traquées, se dispersèrent. M. Charles Myriel, dès les premiers jours de la révolution, émigra en Italie. Sa femme y mourut d'une maladie de poitrine dont elle était atteinte depuis longtemps. Ils n'avaient point d'enfants. Que se passa-t-il ensuite dans la destinée de M. Myriel? L'écroulement de l'ancienne société française, la chute de sa propre famille, les tragiques spectacles de 93, plus effrayants encore peut-être pour les émigrés qui les voyaient de loin avec le grossissement de l'épouvante, firent-ils germer en lui des idées de renoncement et de solitude? Fut-il, au milieu d'une de ces distractions et de ces affections qui occupaient sa vie, subitement atteint d'un de ces coups mystérieux et terribles qui viennent quelquefois renverser, en le frappant au cœur, l'homme que les catastrophes publiques n'ébranleraient pas en le frappant dans son existence et dans sa fortune? Nul n'aurait pu le dire; tout ce qu'on savait, c'est que, lorsqu'il revint d'Italie, il était prêtre."], 'French'
)

print(my_language_classifier.languages)

Question 7 (2 points)

Now that we are able to add language data to our LanguageClassifier class, we can implement the classify function. This function should take in a string of text and return the language label that the text is most likely to be. There are many ways to do this, but the simplest way is to compare the perplexity of the input text to each of the n-gram models in the dictionary. The language with the lowest perplexity should be returned as the predicted language, as it was the model that was able to predict the input text easiest. Other ways you could do this would be create a new n-gram model with the input text and compare the probabilities of each n-gram language model to the n-gram model of the input text, with some algorithm like cosine similarity or euclidean distance. However, this is a bit more complex and not necessary for this project.

Cosine similarity is a measure of similarity between two vectors of an inner product space. We can take anything and essentially "vectorize" it in n-dimensional space, and then apply cosine similarity to find the similarity between the two. A value of 1 meaning similar, 0 meaning orthogonal or not similar, and -1 meaning opposite. This has very powerful applications in NLP, as we can vectorize words and phrases and compare their meanings to eachother. For example, we could use cosine similarity in a model to compare "cat" and "kitten", resulting in a high similarity score, and "cat" and "car", resulting in a low similarity score, or perhaps "rich" and "poor" resulting in a negative similarity score. This is a very powerful tool in NLP and is used in many modern LLMs. We will discuss this in more detail in future projects.

Please implement the classify function and run the following test cases to ensure that it is working as expected. You can use the following code to test your classify function:

my_language_classifier = LanguageClassifier(2) # use 2-grams

my_language_classifier.add_language_data(
    ["Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. Mr. Dursley made drills. He was a big, beefy man with hardly any neck, although he did have a very large moustache. Mrs. Dursley was thin and blonde and had twice the usual amount of neck, which came in very useful as she spent so much of her time spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. Mrs Dursley had a sister called Lily Potter. She and her husband James Potter had a son called Harry Potter. They lived far from the Dursleys and did not speak to them much. They did not get along. One day, a man appeared outside of the Dursleys house. He was tall, thin, and very old, judging by the silver of his hair and beard, which were both long enough to tuck into his belt. He was wearing long robes, a purple cloak that swept the ground, and highheeled, buckled boots. His blue eyes were light, bright, and sparkling behind half-moon spectacles and his nose was very long and crooked, as though it had been broken at least twice. This man's name was Albus Dumbledore."], 'English')

my_language_classifier.add_language_data(
    ["Poco después de la guerra civil, un brote de cólera se había llevado a mi madre. La enterramos en Montju'ic el día de mi cuarto cumpleaños. Sólo recuerdo que llovió todo el día y toda la noche, y que cuando le pregunté a mi padre si el cielo lloraba le faltó la voz para responderme. Seis años después, la ausencia de mi madre era para mí todavía un espejismo, un silencio a gritos que aún no había aprendido a acallar con palabras. Mi padre y yo vivíamos en un pequeño piso de la calle Santa Ana, junto a la plaza de la iglesia. El piso estaba situado justo encima de la librería especializada en ediciones de coleccionista y libros usados heredada de mi abuelo, un bazar encantado que mi padre confiaba en que algún día pasaría a mis manos. Me crié entre libros, haciendo amigos invisibles en páginas que se deshacían en polvo y cuyo olor aún conservo en las manos. De niño aprendí a conciliar el sueño mientras le explicaba a mi madre en la penumbra de mi habitación las incidencias de la jornada, mis andanzas en el colegio, lo que había aprendido aquel día... No podía oír su voz o sentir su tacto, pero su luz y su calor ardían en cada rincón de aquella casa y yo, con la fe de los que todavía pueden contar sus años con los dedos de las manos, creía que si."], 'Spanish'
)

my_language_classifier.add_language_data(
    ["Quoique ce détail ne touche en aucune manière au fond même de ce que nous avons à raconter, il n'est peut-être pas inutile, ne fût-ce que pour être exact en tout, d'indiquer ici les bruits et les propos qui avaient couru sur son compte au moment où il était arrivé dans le diocèse. Vrai ou faux, ce qu'on dit des hommes tient souvent autant de place dans leur vie et surtout dans leur destinée que ce qu'ils font. M. Myriel était fils d'un conseiller au parlement d'Aix; noblesse de robe. On contait de lui que son père, le réservant pour hériter de sa charge, l'avait marié de fort bonne heure, à dix-huit ou vingt ans, suivant un usage assez répandu dans les familles parlementaires. Charles Myriel, nonobstant ce mariage, avait, disait-on, beaucoup fait parler de lui. Il était bien fait de sa personne, quoique d'assez petite taille, élégant, gracieux, spirituel; toute la première partie de sa vie avait été donnée au monde et aux galanteries. La révolution survint, les événements se précipitèrent, les familles parlementaires décimées, chassées, traquées, se dispersèrent. M. Charles Myriel, dès les premiers jours de la révolution, émigra en Italie. Sa femme y mourut d'une maladie de poitrine dont elle était atteinte depuis longtemps. Ils n'avaient point d'enfants. Que se passa-t-il ensuite dans la destinée de M. Myriel? L'écroulement de l'ancienne société française, la chute de sa propre famille, les tragiques spectacles de 93, plus effrayants encore peut-être pour les émigrés qui les voyaient de loin avec le grossissement de l'épouvante, firent-ils germer en lui des idées de renoncement et de solitude? Fut-il, au milieu d'une de ces distractions et de ces affections qui occupaient sa vie, subitement atteint d'un de ces coups mystérieux et terribles qui viennent quelquefois renverser, en le frappant au cœur, l'homme que les catastrophes publiques n'ébranleraient pas en le frappant dans son existence et dans sa fortune? Nul n'aurait pu le dire; tout ce qu'on savait, c'est que, lorsqu'il revint d'Italie, il était prêtre."], 'French'
)

print(my_language_classifier.classify("this is a good cd")) # English
print(my_language_classifier.classify('Este es un buen CD')) # Spanish
print(my_language_classifier.classify("c'est un bon cd")) # French

Depending on your implementation of the get_perplexity method, there may be a few things you need to change. Primarily, if the n-1 part of the n-gram is not in the dictionary or the nth letter/word is not in the dictionary, you may have simply added 0 to the logarithmic probability sum. However, you may get better results if you add a large positive number, ie 5,6, etc. This will make it so that if the model cannot find the correct n-th letter/word it is heavily penalized

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project10.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.