Ridge Regression

\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \boldsymbol{\beta}^\intercal \boldsymbol{\beta} \]

This sort of loss function can be motivated a number of ways. One way is we can think of finding the Lagrangian of this constrained optimization problem:

\[\begin{gather} \min_{\boldsymbol{\beta}} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\ \text{subject to } \boldsymbol{\beta}^\intercal \boldsymbol{\beta} \le s \end{gather}\]

The Lagrangian is \[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \left( \boldsymbol{\beta}^\intercal \boldsymbol{\beta} - s \right) \] but $ks$ doesn’t matter.

Ridge Regression

source: ISLR

Ridge Regression

\[ \mathbf{y}^\intercal \mathbf{y} - 2\mathbf{y}^\intercal \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\beta}^\intercal\left[ k \mathbf{I} + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\beta} \]

Taking derivatives and setting the resulting expression to $\mathbf{0}$ yields different normal equations:

\[ \left[k \mathbf{I} + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\hat{\beta}}_R = \mathbf{X}^\intercal \mathbf{y} \]

Ridge Regression

Solving

\[ \left[ k \mathbf{I} + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\hat{\beta}}_R = \mathbf{X}^\intercal \mathbf{y} \]

yields \[ \boldsymbol{\hat{\beta}}_R = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} \mathbf{X}^\intercal \mathbf{y} \]

c.f. regular OLS:

\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y} \]

Ridge Regression

$\boldsymbol{\hat{\beta}}_R = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} \mathbf{X}^\intercal \mathbf{y}$
$\boldsymbol{\hat{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}$

The Ridge Regression estimate is a linear transformation of the OLS estimate

\[ \boldsymbol{\hat{\beta}}_R = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} \mathbf{X}^\intercal \mathbf{y} = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} (\mathbf{X}^\intercal \mathbf{X}) \boldsymbol{\hat{\beta}} = \mathbf{Z}_k \boldsymbol{\hat{\beta}} \]

this helps us find the bias vector and the covariance matrix.

Ridge Regression’s Bias

$\boldsymbol{\hat{\beta}}_R = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} \mathbf{X}^\intercal \mathbf{y}$
$\boldsymbol{\hat{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}$
$\boldsymbol{\hat{\beta}}_R = \mathbf{Z}_k \boldsymbol{\hat{\beta}}$
$\mathbf{Z}_k = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1}(\mathbf{X}^\intercal \mathbf{X})$

\[ E\left[ \boldsymbol{\hat{\beta}}_R \right] = \mathbf{Z}_kE\left[ \boldsymbol{\hat{\beta}} \right] = \mathbf{Z}_k \boldsymbol{\beta} \]

Ridge Regression’s Variance

$\boldsymbol{\hat{\beta}}_R = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} \mathbf{X}^\intercal \mathbf{y}$
$\boldsymbol{\hat{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}$
$\boldsymbol{\hat{\beta}}_R = \mathbf{Z}_k \boldsymbol{\hat{\beta}}$
$\mathbf{Z}_k = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1}(\mathbf{X}^\intercal \mathbf{X})$

\[ \text{V}\left[ \boldsymbol{\hat{\beta}}_R \right] = \mathbf{Z}_k \text{V}\left[ \boldsymbol{\hat{\beta}} \right]\mathbf{Z}_k^\intercal = \sigma^2 \mathbf{Z}_k \left(\mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{Z}_k^\intercal = \sigma^2 (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1}\left(\mathbf{X}^\intercal \mathbf{X} \right)(\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} \]

The book takes the trace of the following, which gives us the total variance (ignoring covariances).

The Spectral Representation Theorem

A special case of the spectral representation theorem (for matrices) states that any real, symmetric, matrix is orthogonally diagonalizable. For us this means that \[ \mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D}\mathbf{U}^\intercal \] where

the columns of $\mathbf{U}$ are the eigenvectors
$\mathbf{U}\mathbf{U}^\intercal = \mathbf{U}^\intercal\mathbf{U} = \mathbf{I}$ (orthogonal)
$\mathbf{D} = \text{diag}\left(\lambda_1, \lambda_1, \ldots, \lambda_{p-1} \right)$ are the eigenvalues
positive-definiteness implies that the eigenvalues are positive!

Using The Spectral Representation Theorem

\[ \mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D}\mathbf{U}^\intercal \]

This is helpful for us here because

$\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I} = \mathbf{U}\left[ \mathbf{D} + k \mathbf{I} \right]\mathbf{U}^\intercal$
$[\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I}]^{-1} = \mathbf{U}\left[ \mathbf{D} + k \mathbf{I} \right]^{-1}\mathbf{U}^\intercal$

\[ [\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I}]^{-1}[\mathbf{X}^\intercal \mathbf{X}][\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I}]^{-1} = \mathbf{U} \begin{bmatrix} \frac{\lambda_1}{(\lambda_1 + k)^2 } & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \frac{\lambda_{p-1}}{(\lambda_{p-1} + k)^2} \end{bmatrix} \mathbf{U}^\intercal \]

Taking the trace on both sides gives us $\text{tr}\left(\text{V}\left[ \boldsymbol{\hat{\beta}}_R \right] \right) = \sigma^2 \sum_{i=0}^{p-1} \frac{\lambda_i}{(\lambda_i + k)^2}$

Using The Spectral Representation Theorem

$\mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D}\mathbf{U}^\intercal$
$\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I} = \mathbf{U}\left[ \mathbf{D} + k \mathbf{I} \right]\mathbf{U}^\intercal$

We can also see why they call $k$ the biasing parameter:

\[ E\left[ \boldsymbol{\hat{\beta}}_R - \boldsymbol{\beta} \right] = \mathbf{Z}_k - \mathbf{I} = \mathbf{U} \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + k } & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \frac{\lambda_{p-1}}{\lambda_{p-1} + k} \end{bmatrix} \mathbf{U}^\intercal - \mathbf{U}\mathbf{U}^\intercal = \mathbf{U} \begin{bmatrix} \frac{-k}{\lambda_1 + k } & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \frac{-k}{\lambda_{p-1} + k} \end{bmatrix} \mathbf{U}^\intercal \]

More realistic

In practice usually the intercept is not penalized. Instead we fix $k$ and choose $\hat{\boldsymbol{\beta}}_R$ to minimize

\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \sum_{j=1}^{p-1} \beta_j^2, \] where the penalty term doesn’t include $\boldsymbol{\beta}_0$.

Taking the derivatives and setting all of them equal to $0$ yields new normal equations:

\[ \left[k \text{ diag}(0, 1, \ldots, 1) + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\hat{\beta}}_R = \mathbf{X}^\intercal \mathbf{y} \]

More realistic

\[ \left[k \text{ diag}(0, 1, \ldots, 1) + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\hat{\beta}}_R = \mathbf{X}^\intercal \mathbf{y} \]

Notice the first row is the same as the OLS: \[ \sum_{i=1}^n y_i = n \hat{\beta}_0 + \hat{\beta}_1 \sum_{i=1}^n x_{i1} + \cdots + \hat{\beta}_{p-1} \sum_{i=1}^n x_{i,p-1} \]

if your predictors are demeaned $\hat{\beta}_0 = \bar{y}$
if both predictors and $y$ are demeaned, intercept should be $0$

Read the documentation, because every function is different!

More realistic

Another thing that this book fails to mention is that scaling predictors is often important.

The reason this is done is so that the penalty is applied evenly to all predictors.

\[ \tilde{x}_{i,j} = \frac{x_{i,j} - \bar{x}_j}{s_j} \]

where $\bar{x}_j = n^{-1} \sum_{i=1}^n x_{i,j}$ and $s_j^2 = (n-1)^{-1} \sum_{i=1}^n (x_{i,j} - \bar{x}_j)^2$.

Selecting $k$

Increasing $k$

increases bias :(
reduces variance :)

How do we pick the “best” $k$?

The book recommends me use the ridge trace, which is a plot of the ridge coefficient estimates versus $k$.

Selecting $k$

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-18

head(longley)

##      GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
## 1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
## 1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
## 1949         88.2 258.054      368.2        161.6    109.773 1949   60.171
## 1950         89.5 284.599      335.1        165.0    110.929 1950   61.187
## 1951         96.2 328.975      209.9        309.9    112.075 1951   63.221
## 1952         98.1 346.999      193.2        359.4    113.270 1952   63.639

X <- model.matrix(GNP.deflator ~ ., data = longley)
possible_k <- seq(0, 10, .05)
mod <- glmnet(x = X,
              y = as.matrix(longley[,1]),
              family = "gaussian",
              alpha = 0, # 0 for ridge regression
              lambda = possible_k,
              standardize = T, # default is T 
              intercept = T)
plot(mod, xvar = "lambda", label=T)

selecting $k$

glmnet also provides a method that can select $k$ using cross-validation: cv.glmnet()

mod2 <- cv.glmnet(x = X,
                  y = as.matrix(longley[,1]),
                  family = "gaussian",
                  alpha = 0, # 0 for ridge regression
                  standardize = T, # default is T 
                  intercept = T,
                  nfolds = 3)
plot(mod2)

mod2$lambda.min the first vertical line, minimizes out of sample MSE
mod2$lambda.1se second vertical line, adds one std. error to be conservative
numbers on the top are the number of nonzero coefficients

The Lasso

Ridge regression chooses $\hat{\boldsymbol{\beta}}_R$, for a fixed $k$, to minimize

\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \sum_{j=1}^{p-1} \beta_j^2, \] while the Lasso minimizes

\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \sum_{j=1}^{p-1} |\beta_j|. \]

This small change leads to very different behavior!

The Lasso

For any $0 < k < \infty$, ridge regression will always have all predictors in the model. It does not perform variable selection.

The Lasso does, though! Increasing $k$ will eliminate more and more predictors from the model, until only the intercept is left. Very useful if you want to figure out which predictors are the most useful.

To get a sense for why this is true, we can rewrite the minimization problem as follows:

\[ \min_{\beta}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \hspace{5mm} \text{subject to} \sum_{j=1}^{p-1}|\beta_j| \le s \]

The Lasso

This plot is a classic:

source: ISLR

The Lasso

Most of the time we have to do the following two step procedure:

estimate $k$ (ideally with cross-validation using cv.glmnet)
use the chosen $k$ to estimate the regression coefficients

The Lasso

Step 1/2: perform cross-validation to choose $k$

mod3 <- cv.glmnet(x = X,
                  y = as.matrix(longley[,1]),
                  family = "gaussian",
                  alpha = 1, # 0 for ridge regression, 1 for lasso
                  standardize = T, # default is T 
                  intercept = T,
                  nfolds = 3)
bestK <- mod3$lambda.min
cat("best k is ", bestK, "\n")

## best k is  0.3314758

plot(mod3)

The Lasso

Step 2/2: Use the chosen $k$ to estimate the coefficients

mod4 <- glmnet(x = X,
              y = as.matrix(longley[,1]),
              family = "gaussian",
              alpha = 1, # lasso=1, ridge=0 (see dox)
              lambda = bestK, # calculated in the last slide
              standardize = T, 
              intercept = T)
predict(mod4, newx = X[c(3,5),])

##            s0
## 1949 88.68671
## 1951 95.04757

The Lasso

Instead of picking a single $k$, we can plotting coefficient estimates versus many different $k$s…

# plot.glmnet is different than plot.cv.glmnet
# we let glmnet calculate several k values instead of supplying our own
plot(glmnet(x = X,
              y = as.matrix(longley[,1]),
              family = "gaussian",
              alpha = 1, # lasso=1, ridge=0 (see dox)
              standardize = T, 
              intercept = T),
     xvar = "lambda")

The Lasso

Lasso beats Ridge at performing variable selection, but which model predicts the best?

Lasso versus Ridge

Lasso beats Ridge at performing variable selection, but which model predicts the best?

Very generally, Lasso predicts better when you have too many possible predictors, and very few of them actually have coefficients that are far away from $0$.

Ridge regression performs better when many coefficients are far away from $0$, and they’re all roughly the same size.

Unfortunately, these are rarely known before you look at the data, so cross-validation is your friend!

PCA regression

As long as we’re talking about variable selection, we should mention principal components analysis.

Principal components can be used a few ways, but we use it for choosing which predictors to include in a regression model.

This is a very useful tool when you have many columns and a lot of multicollinearity. It is a principled way to combine and remove columns.

PCA regression

Let’s plot our two predictor columns:

what do we do about this correlation?

PCA regression

PCA will help us reduce the number of columns. In this case we reduce two columns into $1$. However, there are a few pre-processing steps:

we ignore the intercept for the time being
each column must be de-meaned
each column must be standardized to give it sample variance $1$

When we write $\mathbf{X}$, we are assuming all these things have already been done.

PCA regression

The first principal component is a linear combination of your $p-1$, zero-mean predictors:

\[\begin{align*} \mathbf{z}_1 &= \phi_{11} \mathbf{X}_1 + \phi_{21} \mathbf{X}_2 + \cdots + \phi_{p-1,1} \mathbf{X}_{p-1} \\ &= \mathbf{X}\boldsymbol{\phi}_1 \end{align*}\]

that maximizes the sample variance of the elements of $\mathbf{z}_1$, restricting $\boldsymbol{\phi}_1^\intercal\boldsymbol{\phi}_1 = \sum_{j=1}^{p-1} \phi_{j1}^2 = 1$.

PCA regression

So the first principal component $\mathbf{z}_1 = \mathbf{X}\boldsymbol{\phi}_1$ solves the following constrained optimization problem:

\[ \max \frac{1}{n-1}\mathbf{z}_1^\intercal \mathbf{z}_1 \hspace{5mm} \text{subject to } \hspace{5mm} \boldsymbol{\phi}_1^\intercal\boldsymbol{\phi}_1 = 1 \]

PCA regression

To maximize $\frac{1}{n-1}\mathbf{z}_1^\intercal \mathbf{z}_1$, it is handy to use the Spectral Representation Theorem $\mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{U}^\intercal$.

\[\begin{align*} \frac{1}{n-1}\mathbf{z}_1^\intercal \mathbf{z}_1 &= \frac{1}{n-1}\left(\mathbf{X}\boldsymbol{\phi}_1\right)^\intercal\left(\mathbf{X}\boldsymbol{\phi}_1 \right) \tag{defn. of $\mathbf{z}_1$} \\ &= \frac{1}{n-1}\boldsymbol{\phi}_1^\intercal\left(\mathbf{X}^\intercal \mathbf{X} \right)\boldsymbol{\phi}_1 \tag{property of transpose}\\ &= \frac{1}{n-1}\boldsymbol{\phi}_1^\intercal\left(\mathbf{U} \mathbf{D} \mathbf{U}^\intercal \right)\boldsymbol{\phi}_1 \tag{SRT} \\ &= \frac{1}{n-1}\boldsymbol{\phi}_1^\intercal\left(\sum_{i=1}^p\lambda_i \mathbf{U}_i \mathbf{U}_i^\intercal \right)\boldsymbol{\phi} \tag{matrix mult.} \\ &= \frac{1}{n-1}\sum_{i=1}^p \lambda_i \left(\boldsymbol{\phi}_1^\intercal \mathbf{U}_i \mathbf{U}_i^\intercal \boldsymbol{\phi}_1\right) \end{align*}\]

So… what’s $\boldsymbol{\phi}_1$?

PCA regression

So, the weight vector of the first principal component is the eigenvector with the largest eigenvalue!

WLOG assume $\lambda_1 > \lambda_2 > \cdots > \lambda_{p-1}$ when you write $\mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{U}^\intercal$.

If we do that, we set $\boldsymbol{\phi}_1 = \mathbf{U}_1$. In this case, the first principal component is defined by the first eigenvector.

PCA Regression

head(myDf) # all the data (make sure it's mean 0!)

##            y         x1         x2
## 1  0.2167549  1.6565861  1.7625717
## 2 -0.5424926 -1.3533537 -1.6398911
## 3  0.8911446 -0.5982720 -0.5874262
## 4  0.5959806  0.7255946  0.9545904
## 5  1.6356180 -0.5693906 -0.7319355
## 6  0.6892754 -0.5064352  0.1437340

X <- as.matrix(myDf[,-1]) # ignore y for the time being
eigenStuff <- eigen(t(X) %*% X) 
eigenStuff$values # make sure first is the biggest

## [1] 257.250792   4.589225

(firstEigenVector <- eigenStuff$vectors[,1]) # get the first eigenvector

## [1] -0.7121932 -0.7019835

plot(myDf$x1, myDf$x2) # vis
arrows(0,0, firstEigenVector[1], firstEigenVector[2], col = "red")

# finally, combine the two columns into the one, first principal component
(firstPrinComp <- X %*% firstEigenVector)

##                [,1]
##   [1,] -2.417105579
##   [2,]  2.115025805
##   [3,]  0.838448742
##   [4,] -1.186870251
##   [5,]  0.919322749
##   [6,]  0.259780799
##   [7,]  1.098530219
##   [8,] -1.696482864
##   [9,] -0.006692646
##  [10,]  1.989452984
##  [11,] -0.257988036
##  [12,] -0.657930906
##  [13,] -0.780110578
##  [14,] -3.019223659
##  [15,] -1.868162855
##  [16,]  0.786262582
##  [17,]  1.026773229
##  [18,]  2.466224733
##  [19,] -2.370901657
##  [20,]  1.913865641
##  [21,]  0.537242106
##  [22,]  0.425435830
##  [23,] -2.212999057
##  [24,]  0.144775738
##  [25,] -2.660526482
##  [26,]  0.571112610
##  [27,]  0.392206813
##  [28,]  0.190059404
##  [29,] -0.675164896
##  [30,]  3.161861977
##  [31,]  0.038269319
##  [32,] -1.687904584
##  [33,]  1.283782081
##  [34,]  1.716871331
##  [35,]  1.936274675
##  [36,] -0.780147633
##  [37,] -0.646177979
##  [38,]  0.547281377
##  [39,]  2.592285247
##  [40,] -2.047714067
##  [41,] -0.023097495
##  [42,] -0.887397720
##  [43,]  2.200450344
##  [44,] -1.409116946
##  [45,]  1.977940134
##  [46,] -0.919403667
##  [47,]  1.445216021
##  [48,] -0.696725437
##  [49,]  0.061424947
##  [50,] -1.852274650
##  [51,] -1.620187849
##  [52,]  2.211386744
##  [53,] -1.100297853
##  [54,]  0.721282275
##  [55,]  0.462965401
##  [56,] -2.311143422
##  [57,]  2.488446851
##  [58,]  0.343610944
##  [59,]  1.887324944
##  [60,] -1.423166064
##  [61,]  1.140247675
##  [62,]  0.016028170
##  [63,] -0.487439654
##  [64,]  0.808513151
##  [65,] -1.859746495
##  [66,] -1.045632916
##  [67,]  0.335704695
##  [68,]  0.733749416
##  [69,] -0.738275351
##  [70,]  0.504983337
##  [71,] -3.569345315
##  [72,] -0.049218259
##  [73,]  1.050060406
##  [74,]  3.089302821
##  [75,]  1.005604968
##  [76,]  0.672428284
##  [77,]  0.520423194
##  [78,] -2.357425973
##  [79,]  1.907982321
##  [80,]  3.349689317
##  [81,]  1.592016215
##  [82,] -0.079164622
##  [83,]  1.393872742
##  [84,]  4.416673205
##  [85,] -1.344943674
##  [86,] -0.684727489
##  [87,]  1.232862825
##  [88,] -0.973344836
##  [89,] -0.478852686
##  [90,] -0.597563667
##  [91,]  0.012231655
##  [92,] -3.201081919
##  [93,] -1.105158190
##  [94,]  2.383871050
##  [95,]  0.360882537
##  [96,]  2.338604019
##  [97,]  1.667914261
##  [98,] -0.930633762
##  [99,]  1.683427512
## [100,] -0.934267606

PCA regression

The second principal component $\mathbf{z}_2 = \mathbf{X}\mathbf{U}_2$ maximizes the variance under the same restriction, plus an extra one: \[ \mathbf{z}_1 \text{ must be uncorrelated with } \mathbf{z}_2. \]

This happens iff

\[ \mathbf{z}_1^\intercal \mathbf{z}_2 = 0. \]

PCA regression

We need the Spectral Representation Theorem again: $\mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{U}^\intercal$

\[ \mathbf{z}_1^\intercal \mathbf{z}_2 = 0 \]

iff

\[ \left(\mathbf{X}\boldsymbol{\phi}_1\right)^\intercal \left(\mathbf{X}\boldsymbol{\phi}_2\right) = 0 \]

iff

\[ \sum_{i=1}^{p-1} \lambda_i \left(\boldsymbol{\phi}_1^\intercal \mathbf{U}_i \mathbf{U}_i^\intercal \boldsymbol{\phi}_2\right) = 0 \]

PCA regression

So, given $\boldsymbol{\phi}_1$, if we wanted to pick a unit-length vector $\boldsymbol{\phi}_2$ to maximize

\[ \frac{1}{n-1}\mathbf{z}_2^\intercal \mathbf{z}_2 = \frac{1}{n-1}\sum_{i=1}^{p-1} \lambda_i \left(\boldsymbol{\phi}_2^\intercal \mathbf{U}_i \mathbf{U}_i^\intercal \boldsymbol{\phi}_2\right) \]

subject to

\[ \frac{1}{n-1}\sum_{i=1}^{p-1} \lambda_i \left(\boldsymbol{\phi}_1^\intercal \mathbf{U}_i \mathbf{U}_i^\intercal \boldsymbol{\phi}_2\right) = 0 \]

what would it be?

PCA regression

In general, the weights are

$\boldsymbol{\phi}_1 = \mathbf{U}_1$
$\boldsymbol{\phi}_2 = \mathbf{U}_2$
$\boldsymbol{\phi}_3 = \mathbf{U}_3$
etc. etc.

So the principal components themselves are

$\mathbf{z}_1 = \mathbf{X}\mathbf{U}_1$
$\mathbf{z}_2 = \mathbf{X}\mathbf{U}_2$
$\mathbf{z}_3 = \mathbf{X}\mathbf{U}_3$
etc. etc.

PCA regression

With two columns of data, we have at most two principal components

PCA regression

Compare the plot of $\mathbf{X}$ with the plot of $\mathbf{X} \begin{bmatrix} \mathbf{U}_1 & \mathbf{U}_2 \end{bmatrix}$

no more correlation!
we can ignore the second principal component predictor!

PCA regression

In general, the sample variance of $\mathbf{z}_i$ is $\lambda_i$ because \[ \frac{1}{n-1}\mathbf{z}_i^\intercal\mathbf{z}_i = \frac{1}{n-1}\mathbf{U}_i^\intercal \mathbf{U} \mathbf{D} \mathbf{U}^\intercal\mathbf{U}_i = \frac{\lambda_i}{n-1} \]

When we have more than two columns, we have to pick how many principal component predictors to ignore. We ignore principal components if they have a small variance. A plot of the variances is called a Scree plot

PCA regression

X <- model.matrix(GNP.deflator ~ . + 0, data = longley)
head(X)

##          GNP Unemployed Armed.Forces Population Year Employed
## 1947 234.289      235.6        159.0    107.608 1947   60.323
## 1948 259.426      232.5        145.6    108.632 1948   61.122
## 1949 258.054      368.2        161.6    109.773 1949   60.171
## 1950 284.599      335.1        165.0    110.929 1950   61.187
## 1951 328.975      209.9        309.9    112.075 1951   63.221
## 1952 346.999      193.2        359.4    113.270 1952   63.639

X <- scale(x = X, center = TRUE, scale = TRUE) 
# demeaning is necessary; scaling is highly-advisable
head(X)

##             GNP Unemployed Armed.Forces Population       Year   Employed
## 1947 -1.5434331 -0.8960348   -1.4609267 -1.4111352 -1.5753151 -1.4219946
## 1948 -1.2905329 -0.9292089   -1.6534776 -1.2639263 -1.3652731 -1.1944868
## 1949 -1.3043364  0.5229601   -1.4235660 -1.0998977 -1.1552311 -1.4652752
## 1950 -1.0372705  0.1687464   -1.3747098 -0.9337126 -0.9451891 -1.1759787
## 1951 -0.5908091 -1.1710587    0.7074273 -0.7689652 -0.7351470 -0.5968163
## 1952 -0.4094719 -1.3497707    1.4187163 -0.5971736 -0.5251050 -0.4777947

eigenStuff <- eigen(t(X) %*% X)
plot(eigenStuff$values, 
     ylab = "lambda_i", xlab = "i",
     type = "b", main = "Scree plot")

We probably only need the first two principal components. So we just reduced six columns into $2$!

PCA regression

What is the interpretation of the first principal component?

eigenStuff$vectors[,1]  # U1

## [1] -0.4664796 -0.3120084 -0.2046347 -0.4659838 -0.4686311 -0.4543304

head(X)

##             GNP Unemployed Armed.Forces Population       Year   Employed
## 1947 -1.5434331 -0.8960348   -1.4609267 -1.4111352 -1.5753151 -1.4219946
## 1948 -1.2905329 -0.9292089   -1.6534776 -1.2639263 -1.3652731 -1.1944868
## 1949 -1.3043364  0.5229601   -1.4235660 -1.0998977 -1.1552311 -1.4652752
## 1950 -1.0372705  0.1687464   -1.3747098 -0.9337126 -0.9451891 -1.1759787
## 1951 -0.5908091 -1.1710587    0.7074273 -0.7689652 -0.7351470 -0.5968163
## 1952 -0.4094719 -1.3497707    1.4187163 -0.5971736 -0.5251050 -0.4777947

PCA regression

What is the interpretation of the first principal component?

eigenStuff$vectors[,1]  # U1

## [1] -0.4664796 -0.3120084 -0.2046347 -0.4659838 -0.4686311 -0.4543304

plot.ts(X)

PCA regression

What is the interpretation of the second principal component?

eigenStuff$vectors[,2]  # U2

## [1]  0.03972008 -0.61091602  0.78171111 -0.05946171 -0.01389938  0.10199406

PCA regression

What is the interpretation of the second principal component?

par(mfrow=c(1,2))
plot.ts(X[,3] - X[,2], ylab = "armed forces - unemploymed")
z2 <- (X %*% eigenStuff$vectors[,2]) # second princomp
plot.ts(z2, ylab = "second prinComp")

par(mfrow=c(1,1))

PCA regression

On this “longley” data set, we have:

demeaned and standardized each predictor column
found (in this case two) principal components $\mathbf{z}_i$

Now it’s time to run the regression!

PCA regression

pcaDf <- data.frame(longley$GNP.deflator, 
                    (X %*% eigenStuff$vectors)[,1:2])
colnames(pcaDf) <- c("y", "z1", "z2")
pcaReg <- lm(y ~ z1 + z2, data = pcaDf)
summary(pcaReg)

## 
## Call:
## lm(formula = y ~ z1 + z2, data = pcaDf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8933 -1.0642 -0.4644  1.2777  2.2403 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 101.6813     0.3661 277.773  < 2e-16 ***
## z1           -5.0143     0.1773 -28.285 4.62e-13 ***
## z2            0.4600     0.3472   1.325    0.208    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.464 on 13 degrees of freedom
## Multiple R-squared:  0.984,  Adjusted R-squared:  0.9816 
## F-statistic: 400.9 on 2 and 13 DF,  p-value: 2.084e-12

PCA regression

Here’s the regression on everything

summary(lm(GNP.deflator ~ .,
           data = longley))

## 
## Call:
## lm(formula = GNP.deflator ~ ., data = longley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0086 -0.5147  0.1127  0.4227  1.5503 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  2946.85636 5647.97658   0.522   0.6144  
## GNP             0.26353    0.10815   2.437   0.0376 *
## Unemployed      0.03648    0.03024   1.206   0.2585  
## Armed.Forces    0.01116    0.01545   0.722   0.4885  
## Population     -1.73703    0.67382  -2.578   0.0298 *
## Year           -1.41880    2.94460  -0.482   0.6414  
## Employed        0.23129    1.30394   0.177   0.8631  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.195 on 9 degrees of freedom
## Multiple R-squared:  0.9926, Adjusted R-squared:  0.9877 
## F-statistic: 202.5 on 6 and 9 DF,  p-value: 4.426e-09

Module 12

STAT 6021

Correlated Predictors

Big Idea

Ridge Regression

Ridge Regression

Ridge Regression

Ridge Regression

Ridge Regression

Ridge Regression

Ridge Regression

Ridge Regression’s Bias

Ridge Regression’s Variance

The Spectral Representation Theorem

Using The Spectral Representation Theorem

Using The Spectral Representation Theorem

More realistic

More realistic

More realistic

Selecting \(k\)

Selecting \(k\)

selecting \(k\)

The Lasso

The Lasso

The Lasso

The Lasso

The Lasso

The Lasso

The Lasso

The Lasso

Lasso versus Ridge

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA Regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression

PCA regression