Optimize Leave-one-out Cross-validation for Lasso Regression

Apply a modified version of the LARS algorithm to efficiently find the exact hyperparameter that minimize leave-one-out prediction error.

Given an n×p feature matrix, X, a target vector, y, and λ 0, lasso regression estimates the parameters, β, of a linear model by solving the optimization problem

Lasso regression is a popular method for estimating linear models as it performs both regularization and variable selection. But a natural question for users is, how do we choose λ?

Often this is done by estimating prediction error with k-fold cross-validation and applying an optimization algorithm to find a value of λ that approximately minimizes the cross-validation proxy for prediction error. Many software packages choose smaller values of k as that can be more computationally tractable. For example, sklearn’s LassoCV model defaults to 5-fold cross-validation. But small k can bias the estimation of prediction error, particularly in high-dimensional settings. More recently leave-one-out cross-validation, with k = n, has emerged as a better alternative with lower bias, [1].

Computed naively, leave-one-out cross-validation is expensive since it would require fitting lasso regression n times for each value of λ. Making use of the matrix inversion lemma, though, it is possible to compute an approximate form of leave-one-out cross-validation efficiently for GLMs [2, 3]. Going a step further, and making some adjustments to the LARS algorithm, it is actually possible to efficiently compute and optimize leave-one-out cross-validation exactly for the case of lasso regression.

Before getting into details, here is a quick demo using the diabetes data set distributed with sklearn and the software package bbai:

from sklearn.datasets import load_diabetes 
from bbai.glm import Lasso

X, y = load_diabetes(return_X_y=True)
model = Lasso().fit(X, y)

In a few fractions of a second, this bit of code will fit a lasso regression model with λ set to exactly minimize the leave-one-out prediction error. As an artifact of the leave-one-out LARs algorithm (LoLARS), bbai also produces a piecewise quadratic function that computes LOOCV for any value of λ:

Leave-one-out cross-validation error as a function of the lasso hyperparameter λ. We can see that LOOCV error is minimized at λ=22.18. Dots represent validation checks using a brute-force approach.

Validating is quite easy since we can check the function against brute force computations, and the dots along the curve show such checks. You can view a notebook with the full example here and see additional validation in the test suite.

Sketch of LoLARS algorithm

The Karush-Kuh-Tucker (KKT) optimality conditions tell us that if β is a solution to lasso regression, then it satisfies the conditions

It follows that a solution to lasso regression can be described as a piecewise linear function of λ where on each segment the active (i.e. non-zero) regressors are given by

where X_A denotes the active part of the design matrix X.

LARS solves lasso regression by computing the piecewise linear segments of the β(λ) function. It starts at λ = where all regressors are zero and works its way backwards.

Consider, for example, the data set

Letting red, green, and blue denote the three regressors, LARS solves for the solution path

Solution path produced by the LARS algorithm. The graph represents the regressors, β, as a function of λ. Vertical lines delineate the piecewise linear segments of the solution path and are numbered in the order visited by LARS.

Dropping values, LARS produces the activation path

Ordered active sets of regressors for the LARS algorithm.

Now, let’s consider solving LARS for each leave-one-out subset. Each LARS solution produces a piecewise linear path βi(λ). Thus, leave-one-out cross-validation error

will be a piecewise quadratic function of λ. Running LARS independently for the subsets would be expensive. The key to an efficient implementation is making use of the matrix inversion lemma:

where

When the activation paths of leave-one-out subsets overlap, applying the matrix inversion lemma significantly reduces the overhead of solving each LARS solution path and the cost of leave-one-out LARS is largely determined by the extent to which the leave-one-out activation paths diverge.

References

[1]: Kamiar Rahnama Rad, Wenda Zhou, Arian Maleki. Error bounds in estimating the out-of-sample prediction error using leave- one-out cross validation in high-dimensions. https://arxiv.org/abs/2003.01770

[2]: Kamiar Rahnama Rad, Arian Maleki. A scalable estimate of the extra-sample prediction error via approximate leave-one-out. https://arxiv.org/abs/1801.10243

[3]: Shuaiwen Wang, Wenda Zhou, Haihao Lu, Arian Maleki, Vahab Mirrokni. Approximate Leave-One-Out for Fast Parameter Tuning in High Dimen- sions. https://arxiv.org/abs/1807.02694