Cross Validation

Introduction

Here we overview two approaches to overcome some of the challenges associated with train, valid and test splits.

LOOCV (Leave-One-Out Cross Validation)

In LOOCV, a single observation is left out for the validation set. This leaves the remaining observations for train and/or test. We train the model like usual, and use the single validation observation as the test for that data point.

Then, we repeat that process across each individual observation, starting with ($x_1$,$y_1$), then ($x_2$,$y_2$), etc. Then, after getting the MSE (Mean Squared Error) for each LOOCV, take the average of these validation error estimates:

$ CV_n = \frac{1}{n}\sum_{i=1}^{n}MSE_i $

Where n is the number of observations.

Here’s a simple example of running through each validation set. Sometimes, the test split is called the validation split (meaning they only split the data into 2 splits, not 3), as you see in the animation below.

Strengths With LOOCV

Far less bias (see Bias-Variance Tradeoff) because the validation error rate is often not overestimated granted that far more of the data set is used for training in LOOCV
There is no randomness in the results, since each observation is used for validation exactly once
Very general and easy to implement with just about any predictive modeling

Challenges With LOOCV

Although the MSE is unbiased for each validation split since it was not used for training, it is a poor estimate because it’s highly variable
If n is large, each model can be very slow to fit

K-fold Cross Validation

K-fold CV is similar to LOOCV, except instead of 1 observation per validation split, we split the data into k number of "folds", or groups of approximately equal size. See the animation below for a visual explanation.

Again, this animation uses test to mean the validation split, because there isn’t a separate test split. First, k number of folds are chosen. In this animation, $k=3$. Then, the data is randomly shuffled. Then, each fold is given 4 observations, since 12 observations divided by 3 folds equals 4 observations per fold (sometimes you will have unequal folds; that’s OK, just try to make them as equal as possible).

This is how you compute the MSE mathematically using k-fold CV:

$ CV_k = \frac{1}{k}\sum_{i=1}^{k}MSE_k $

The Right k

How can we choose the right size of k? One of the strengths of k-folding is that it requires less computation than LOOCV, but as k increases, this becomes less and less the case. At the extreme, when k=n, that would be the same as LOOCV. On the other extreme, if k=2, then you are only taking 2 average MSE’s.

The rule of thumb is typically a lower number is preferred, usually around 5-10. But instead of trying just one number, just try many k values and see which gives the best accuracy, or whatever metric you want to measure by! Let the data drive your k decision. Often there is an approximately optimal k value, but it is highly contextual and depends on the problem.

Strengths With k-fold

If you recall from LOOCV, one of the challenges is that it has to be computed n times. And as you get more and more data, this can be computationally expensive. K-folds simplify this computation: you end up doing precisely k computations, compared to n computations in LOOCV.

Another advantage of k-folds is that they introduce an element of Bias-Variance Tradeoff. The details for this are outside of the scope of this article, but in general, an increasingly high k value will lead to higher variance, and vice versa. Again, there are empirical studies which show that k=5 to k=10 is usually the approximately optimal value of k. But why trust these empirical studies? See for yourself which k is best for your data!

Challenges With k-fold

Choosing the right value of k, and ensuring random sampling across the folds are the primary challenges with k-folds.

Our Sources

www.statlearning.com