In the regression setting, the standard linear model is commonly used to describe the relationship between a response \(Y\) and a set of variables \(X_1, \dots, X_p\).

The linear model has distinct advantages in terms of inference and is often surprisingly competitive for prediction. How can it be improved?

We can yield both better prediction accuracy and model interpretability:

1 Subset Selection

We consider methods for selecting subsets of predictors.

1.1 Best Subset

To perform best subset selection, we fit a separate least squares regression for each possible combination of the \(p\) predictors.

Algorithm:

We can perform something similar with logistic regression.

1.2 Stepwise Selection

For computational reasons, best subset selection cannot be performed for very large \(p\).

Stepwise selection is a computationally efficient procedure that considers a much smaller subset of models.

Forward Stepwise Selection:

Backward Stepwise Selection:

Neither forward nor backwards stepwise selection are guaranteed to find the best model containing a subset of the \(p\) predictors.

1.3 Choosing the Optimal Model

\(C_p\)

AIC & BIC

Adjusted \(R^2\)

Validation and Cross-Validation

2 Shrinkage Methods

The subset selection methods involve using least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model with all \(p\) predictors using a technique that constrains (regularizes) the estimates.

Shrinking the coefficient estimates can significantly reduce their variance!

2.1 Ridge Regression

Recall that the least squares fitting procedure estimates \(\beta_1, \dots, \beta_p\) using values that minimize

Ridge Regression is similar to least squares, except that the coefficients are estimated by minimizing

The tuning parameter \(\lambda\) serves to control the impact on the regression parameters.

The standard least squares coefficient estimates are scale invariant.

In contrast, the ridge regression coefficients \(\hat{\beta}^R_\lambda\) can change substantially when multiplying a given predictor by a constant.

Therefore, it is best to apply ridge regression after standardizing the predictors so that they are on the same scale:

Why does ridge regression work?

2.2 The Lasso

Ridge regression does have one obvious disadvantage.

This may not be a problem for prediction accuracy, but it could be a challenge for model interpretation when \(p\) is very large.

The lasso is an alternative that overcomes this disadvantage. The lasso coefficients \(\hat{\beta}_\lambda^L\) minimize

As with ridge regression, the lasso shrinks the coefficient estimates towards zero.

As a result, lasso models are generally easier to interpret.

Why does the lasso result in estimates that are exactly equal to zero but ridge regression does not? One can show that the lasso and ridge regression coefficient estimates solve the following problems

In other words, when we perform the lasso we are trying to find the set of coefficient estimates that lead to the smalled RSS, subject to the contraint that there is a budget \(s\) for how large \(\sum_{j = 1}^p |\beta_j|\) can be.

2.3 Tuning

We still need a mechanism by which we can determine which of the models under consideration is “best”.

For both the lasso and ridge regression, we need to select \(\lambda\) (or the budget \(s\)).

How?

3 Dimension Reduction Methods

So far we have controlled variance in two ways:

We now explore a class of approaches that

We refer to these techniques as dimension reduction methods.

The term dimension reduction comes from the fact that this approach reduces the problem of estimating \(p + 1\) coefficients to the problem of estimating \(M + 1\) coefficients where \(M < p\).

Dimension reduction serves to constrain \(\beta_j\), since now they must take a particular form.

All dimension reduction methods work in two steps.

3.1 Principle Component Regression

Principal Components Analysis (PCA) is a popular approach for deriving a low-dimensional set of features from a large set of variables.

The first principal component directions of the data is that along which the obervations vary the most.

We can construct up to \(p\) principal components, where the \(2\)nd principal component is a linear combination of the variables that are uncorrelated to the first principal component and has the largest variance subject to this constraint.

The Principal Components Regression approach (PCR) involves

Key idea:

In other words, we assume that the directions in which \(X_1, \dots, X_p\) show the most variation are the directions that are associated with \(Y\).

How to choose \(M\), the number of components?

Note: PCR is not feature selection!

3.2 Partial Least Squares

The PCR approach involved identifying linear combinations that best represent the predictors \(X_1, \dots, X_p\).

Consequently, PCR suffers from a drawback

Alternatively, partial least squares (PLS) is a supervised version.

Roughly speaking, the PLS approach attempts to find directions that help explain both the reponse and the predictors.

The first PLS direction is computed,

To identify the second PLS direction,

As with PCR, the number of partial least squares directions is chosen as a tuning parameter.

4 Considerations in High Dimensions

Most traditional statistical techniques for regression and classification are intendend for the low-dimensional setting.

In the past 25 years, new technologies have changed the way that data are collected in many fields. It is not commonplace to collect an almost unlimited number of feature measurements.

Data sets containing more features than observations are often referred to as high-dimensional.

What can go wrong in high dimensions?

Many of the methods that we’ve seen for fitting less flexible models work well in the high-dimension setting.

When we perform the lasso, ridge regression, or other regression procedures in the high-dimensional setting, we must be careful how we report our results.

Chapter 6: Linear Model Selection & Regularization