TDM 40100: Project 02 - Intro to ML - Basic Concepts

Project Objectives

In this project, we will learn how to select an appropriate machine learning model. Understanding specifics of how the models work may help in this process, but other aspects can be investigated for this.

Learning Objectives

Learn the difference between classification and regression
Learn the difference between supervised and unsupervised learning
Learn how our dataset influences our model selection

Supplemental Reading and Resources

Datasets

/anvil/projects/tdm/data/iris/Iris.csv
/anvil/projects/tdm/data/boston_housing/boston.csv
/anvil/projects/tdm/data/forest/REF_SPECIES.csv

The Iris dataset is a classic dataset that is often used to introduce machine learning concepts. You can read more about it here. If you would like more information on the boston dataset, please read here.

Questions

Question 1 (2 points)

In this project, we will use the Iris dataset and the boston dataset as samples to learn about the various aspects that go into choosing a machine learning model. Let’s review last project by loading the Iris and boston datasets, then printing the first 5 rows of each dataset.

Deliverables

Output of running code to print the first 5 rows of both datasets.

Question 2 (2 points)

One of the most distinguishing features of machine learning is the difference between classification and regression.

Classification is the process of predicting a discrete class label. For example, predicting whether an email is spam or not spam, whether a patient has a disease or not, or if an animal is a dog or a cat.

Regression is the process of predicting a continuous quantity. For example, predicting the price of a house, the temperature tomorrow, or the weight of a person.

Some columns may be misleading. Just because a column is a number does not mean it is a regression problem. One-hot encoding is a technique used to convert categorical variables into numerical variables (we will cover this deeper in future projects). Therefore, it is important to try and understand what a column represents, as just seeing a number does not necessarily mean it corresponds to a continuous quantity.

Let’s look at the Species column of the Iris dataset, and the MEDV column of the boston dataset. Based on these columns, classify the type of machine learning problem that we would be solving with each dataset.

Here’s a trickier example: If we have an image of some handwritten text, and we want to predict what the text says, would we be solving a classification or regression problem? Why?

Deliverables

Would we likely be solving a classification or regression problem with the Species column of the Iris dataset? Why?
Would we likely be solving a classification or regression problem with the MEDV column of the boston dataset? Why?
Would we likely be solving a classification or regression problem with the handwritten text example? Why?

Question 3 (2 points)

Another important distinction in machine learning is the difference between supervised and unsupervised learning.

Supervised learning is the process of training a model on a labeled dataset. The model learns to map some input data to an output label based on examples in the training data. The Iris dataset is a great example of a supervised learning problem. Our dataset has columns such as SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm that contain information about the flower. Additionally, it has a column labeled Species that contains the label we want to predict. From these columns, the model can associate the features of the flower with the labeled species.

We can think of supervised learning as already knowing the answer to a problem, and working backwards to understand how we got there. For example, if we have a a banana, apple, and grape in front of us, we can look at each fruit and their properties (shape, size, color, etc.) to learn how to distinguish between them. We can then use this information to predict a fruit from just its properties in the future.

For example, given this table of data:

Color	Size	Label
Yellow	Small	A
Red	Medium	B
Red	Large	B
Yellow	Medium	A
Yellow	Large	B
Red	Small	B

Color

Size

Label

Yellow

Small

Red

Medium

Red

Large

Yellow

Medium

Yellow

Large

Red

Small

You should be able to describe a relationship between the color and size, and the resulting label. If you were told an object is yellow and extra large, what would you predict the label to be?

The projects in 30100 and 40100 will focus on supervised learning. From our dataset, there will be a single column we want to predict, and the rest will be used to train the model. The column we want to predict is called the label/target, while the remaining columns are called features.

Unsupervised learning is the process of training a model on an unlabeled dataset. As opposed to the model trying to predict an output variable, the model instead learns patterns in the data without any guidance. This is often used in clustering problems, eg. a store wants to group items based on how often they are purchased together. Examples of this can be seen commonly in recommendation systems (have you ever noticed how Amazon always seems to know what you want to buy?).

If we had a dataset of fruits that users commonly purchase together, we could use unsupervised learning to create groups of fruits to recommend to users. We don’t need to know the answer for what to recommend to the user beforehand; we are simply looking for patterns in the data.

For example, given the following dataset of shopping carts:

Item 1	Item 2	Item 3
Apple	Banana	Orange
Apple	Banana	Orange
Apple	Grape	Kiwi
Banana	Orange	Apple
Orange	Banana	Apple
Cantelope	Watermelon	Honeydew
Cantelope	Apple	Banana

Item 1

Item 2

Item 3

Apple

Banana

Orange

Apple

Banana

Orange

Apple

Grape

Kiwi

Banana

Orange

Apple

Orange

Banana

Apple

Cantelope

Watermelon

Honeydew

Cantelope

Apple

Banana

We could use unsupervised learning to recommend fruits to users right before they check out. If a user had an orange and banana in their cart, what fruit would we recommend to them?

Deliverables

Predicted label for an object that is yellow and extra large in the table above.
What fruit would we recommend to a user who has an orange and banana in their cart?
Should we use supervised or unsupervised learning if we want to predict the Species of some data using the Iris dataset? Why?

Question 4 (2 points)

Another important tradeoff in machine learning is the flexibility of the model versus the interpretability of the model.

A model’s flexibility is defined by its ability to capture complex relationships within the dataset. This can be anything from

Imagine a simple function f(x) = 2x. This function is very easy to interpret, it simply doubles x. However, it is not very flexible, as doubling the input is all it can do. A piecewise function like f(x) = { x < 5: 2x^2 + 3x + 4, x >= 5: 4x^2 - 7 } is considered more flexible, because it can model more complex relationships. However it, becomes much more difficult to understand the relationship between the input and output.

We can also see this complexity increase as we increase the number of variables. f(x) will typically be more interpretable than f(x,y), which will typically be more interpretable than f(x,y,z). When we get to a large number of variables, eg. f(a,b,c,…,x,y,z), it can become difficult to understand the impact of each variables on the output. However, a function that captures all of these variables can be very flexible.

Machine learning models can be imagined in the same way. Many factors, including the type of model and the number of features can impact the interpretability of the model. A function that can accurately capture the relationship between a large number of features and the target variable can be extremely flexible but not understandable to humans. A model that performs some simple function between the input and output may be very interpretable, but as the complexity of that function increases its interpretability decreases.

An important concept in this regard is the curse of dimensionality. The general idea is that as our number of features (dimensions) increases, the amount of data needed to get a good model exponentially increases. Therefore, it is impractical to have an extreme number of features in our model. Imagine given a 2d function y=f(x). Given some points that we plot, we probably pretty quickly find an approximation of f(x). However, imagine we are given y=f(x1,x2,x3,x4,x5). We would need a lot more points to find an approximation of f(x1,x2,x3,x4,x5), and understand the relationship between y and each of the variables. Just because we can have a lot of features in our model does not mean we should.

Black box is a term often used to describe models that are too complex for humans to easily interpret. Large neural networks can be considered black boxes. Other models, such as linear regression, are easier to interpret. Decision Trees are designed to be interpretable, as they have a very simple structure and you can easily follow along with how they operate and make decisions. These easy to interpret models are often called white box models.

Please print the number of columns in the Iris dataset and the boston dataset. Based purely on the number of columns, would you expect a machine learning model trained on the Iris dataset to be more or less interpretable than a model trained on the boston dataset? Why?

Deliverables

How many columns are in the Iris dataset?
How many columns are in the boston dataset?
Based purely on the number of features, would you expect a machine learning model trained on the Iris dataset to be more or less interpretable than a model trained on the boston dataset? Why?

Question 5 (2 points)

Parameterization is the idea of approximating a function or model using parameters. If we have some function f, and we have examples of f(x) for many different x, we can find an approximate function to represent f. To make this approximation, we will need to choose some function to represent f, along with the parameters of that function. For complex functions, this can be difficult, as we may not understand the relationship between x and f(x), or how many parameters are needed to represent this relationship.

A non-parametrized model does not necessarily mean that the model does not have parameters. However, it means that we don’t know how many of these parameters exist or how they are used before training. The model itself will work to figure out what parameters it needs while training on the dataset. This can be visualized with splines, which are a type of curve that can be used to approximate a function. There are also non-parametrized models such as K-Nearest Neighbors Regression, which do not have a fixed number of parameters, and instead learn the function from the data.

If we have 5 points (x, y) and want to find a function to fit these points, through parameterization we would have a single function with multiple parameters that need to be adjusted to give us the best fit. However, with splines (a form of non-parametrization), we could create a piecewise function, where each piece is a linear function between two points. This function has no parameters, and is created by the model solely based on the data. You can read more about splines here.

A commonly used non-paramtrized model is k-nearest neighbors, which classifies points by comparing them to existing points in the dataset. In this way, the model does not have any parameters, but instead only learns from the data.

Linear regression is a parametrized model, where a linear relationship between inputs and output(s) is assumed. The data is then used to identify the values of the parameters to best fit the data.

If we already have a good understanding of the data, (eg. we know it to be some linear function or second order polynomial), it is likely best to choose a parametrized model. However, if we don’t have an understanding of the data, a non-parametrized model that learns the function from the data may be a better fit.

To better understand the difference, please run the following code:

import matplotlib.pyplot as plt

a = [1, 3, 5, 7, 9, 11, 13]
b = [9, 6, 4, 7, 8, 15, 9]
x = [1, 2, 3, 4, 5, 6, 7]

plt.scatter(x, a, label='Function A')
plt.scatter(x, b, label='Function B')
plt.legend()
plt.xlabel('Feature X')
plt.ylabel('Label y')
plt.show()

Based on the plots shown, decide if each function would be better approximated by a parametrized or non-parametrized model.

Deliverables

Can you easily describe the relationship between Feature X and Label y for Function A? If so, what is the relationship? Would you use a parametrized or non-parametrized model to approximate this function?
Can you easily describe the relationship between Feature X and Label y for Function B? If so, what is the relationship? Would you use a parametrized or non-parametrized model to approximate this function?

Question 6 (2 points)

As a practical example, we will take a look at the forest species dataset to determine the different aspects that go into choosing our machine learning model.

Load the forest species dataset and print the shape of the dataset its first 10 rows.

Based on what you see from those outputs, please answer the following questions:

Deliverables

Could you solve a regression problem with this dataset? What about a classification problem? What column(s) would you use as the target variable in each case?
Could you use unsupervised learning on this dataset? Supervised learning? Please explain your answer for each.
Do you think a model trained on all columns of this dataset would be very interpretable?
Do you think a parametrized model would work well given the number of features?

Submitting your Work

Items to submit

firstname_lastname_project2.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.