Prediction vs. Inference

Introduction

Here we explore the two primary goals often sought from data, to predict and/or to infer. We talk about both individually, how they get used, and how to discern between when one should be more appropriate than the other for analysis.

Prediction

Prediction is fairly straightforward. We have some data, and we want to predict a pattern: what word comes next, what event will happen, when an event will happen, etc. If you recall the discussion on the meaning of f(x) in data science, you will recall that our primary goal is approximate $Y$. We don’t necessarily care what $\hat{f}$ is, just as long as it predicts our $Y$ as accurately as is needed.

Inference

In inference, we want to understand, or infer, the nature of the relationship between $Y$ and the inputs $X_1, X_2, …, X_p$. Maybe we discover that some input variables don’t matter. Maybe we discover that there is a positive linear relationship between $X_4$ and $Y$; namely, as $X_4$ increases, so does $Y$. What does that tell us about our data? What can we infer about the world writ large given that relationship? Here, knowing why something happens is critical. We want to understand the relationship between all of our inputs and outputs.

The Middle Road

Of course, all analysis utilizes both worlds. Whether through testing, model building, EDA, or just sheer intuition, we often bounce back and forth between these two settings to get a hold of our problem. For instance, doing variable selection in a regression setting can amount to testing what happens when we remove one variable at a time ("I predict if we remove variable $X_7$, we will get …"), despite the fact that often regression is used for inference modes. Conversely, after witnessing numerous market segmentation clustering algorithms, we’ve inferred that high spending customers are typically also customers who buy on weekdays, despite the fact that we merely wanted to predict when they would buy and how much they would spend (which our model usually returns back Tuesdays/Wednesdays, and a few hundred dollars).

To fantasize of being a purist in either of these camps is to ignore the complexity of data science and to be dishonest. With this in mind…

When to Focus on Inference or Prediction

When you want to answer the following questions, focus on inference:

Which input variables are associated with a specific response?
What is the relationship between the response and the input variable(s)?
Can linear equations represent the relationship between the output and the inputs? Or is a non-linear solution preferred?

And for prediction, typically are we looking for:

When an event will occur?
What comes next?
Whether some data is this or that? (Classification)