Taylor R. Brown, PhD
Back in module 5 we discussed multicollinearity, and how that can lead to coefficient estimates that have a high variance. This was because
\[ \text{V}\left[\boldsymbol{\hat{\beta}} \right]= \sigma^2 (\mathbf{X}^\intercal \mathbf{X})^{-1} \]
Recall from module 7:
\[ MSE(\hat{\boldsymbol{\beta}}_p) = V[\hat{\boldsymbol{\beta}}_p] + \text{Bias}[\hat{\boldsymbol{\beta}}_p]\text{Bias}[\hat{\boldsymbol{\beta}}_p]^\intercal \]
Add a penalty to the loss function:
\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \boldsymbol{\beta}^\intercal \boldsymbol{\beta} \]
where \(k \ge 0\) is the biasing parameter. Calling it \(\lambda\) is popular in the literature, too.
Simplifying yields
\[ \mathbf{y}^\intercal \mathbf{y} - 2\mathbf{y}^\intercal \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\beta}^\intercal\left[ k \mathbf{I} + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\beta} \]
NB: for the moment, we are assuming that we don’t have an intercept. More on this later.
\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \boldsymbol{\beta}^\intercal \boldsymbol{\beta} \]
This sort of loss function can be motivated a number of ways. One way is we can think of finding the Lagrangian of this constrained optimization problem:
\[\begin{gather} \min_{\boldsymbol{\beta}} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\ \text{subject to } \boldsymbol{\beta}^\intercal \boldsymbol{\beta} \le s \end{gather}\]
The Lagrangian is \[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \left( \boldsymbol{\beta}^\intercal \boldsymbol{\beta} - s \right) \] but \(ks\) doesn’t matter.
\[ \mathbf{y}^\intercal \mathbf{y} - 2\mathbf{y}^\intercal \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\beta}^\intercal\left[ k \mathbf{I} + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\beta} \]
Taking derivatives and setting the resulting expression to \(\mathbf{0}\) yields different normal equations:
\[ \left[k \mathbf{I} + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\hat{\beta}}_R = \mathbf{X}^\intercal \mathbf{y} \]
Solving
\[ \left[ k \mathbf{I} + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\hat{\beta}}_R = \mathbf{X}^\intercal \mathbf{y} \]
yields \[ \boldsymbol{\hat{\beta}}_R = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} \mathbf{X}^\intercal \mathbf{y} \]
c.f. regular OLS:
\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y} \]
The Ridge Regression estimate is a linear transformation of the OLS estimate
\[ \boldsymbol{\hat{\beta}}_R = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} \mathbf{X}^\intercal \mathbf{y} = (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} (\mathbf{X}^\intercal \mathbf{X}) \boldsymbol{\hat{\beta}} = \mathbf{Z}_k \boldsymbol{\hat{\beta}} \]
\[ E\left[ \boldsymbol{\hat{\beta}}_R \right] = \mathbf{Z}_kE\left[ \boldsymbol{\hat{\beta}} \right] = \mathbf{Z}_k \boldsymbol{\beta} \]
\[ \text{V}\left[ \boldsymbol{\hat{\beta}}_R \right] = \mathbf{Z}_k \text{V}\left[ \boldsymbol{\hat{\beta}} \right]\mathbf{Z}_k^\intercal = \sigma^2 \mathbf{Z}_k \left(\mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{Z}_k^\intercal = \sigma^2 (\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1}\left(\mathbf{X}^\intercal \mathbf{X} \right)(\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I})^{-1} \]
The book takes the trace of the following, which gives us the total variance (ignoring covariances).
A special case of the spectral representation theorem (for matrices) states that any real, symmetric, matrix is orthogonally diagonalizable. For us this means that \[ \mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D}\mathbf{U}^\intercal \] where
\[ \mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D}\mathbf{U}^\intercal \]
This is helpful for us here because
\[ [\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I}]^{-1}[\mathbf{X}^\intercal \mathbf{X}][\mathbf{X}^\intercal \mathbf{X} + k\mathbf{I}]^{-1} = \mathbf{U} \begin{bmatrix} \frac{\lambda_1}{(\lambda_1 + k)^2 } & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \frac{\lambda_{p-1}}{(\lambda_{p-1} + k)^2} \end{bmatrix} \mathbf{U}^\intercal \]
Taking the trace on both sides gives us \(\text{tr}\left(\text{V}\left[ \boldsymbol{\hat{\beta}}_R \right] \right) = \sigma^2 \sum_{i=0}^{p-1} \frac{\lambda_i}{(\lambda_i + k)^2}\)
We can also see why they call \(k\) the biasing parameter:
\[ E\left[ \boldsymbol{\hat{\beta}}_R - \boldsymbol{\beta} \right] = \mathbf{Z}_k - \mathbf{I} = \mathbf{U} \begin{bmatrix} \frac{\lambda_1}{\lambda_1 + k } & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \frac{\lambda_{p-1}}{\lambda_{p-1} + k} \end{bmatrix} \mathbf{U}^\intercal - \mathbf{U}\mathbf{U}^\intercal = \mathbf{U} \begin{bmatrix} \frac{-k}{\lambda_1 + k } & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \frac{-k}{\lambda_{p-1} + k} \end{bmatrix} \mathbf{U}^\intercal \]
In practice usually the intercept is not penalized. Instead we fix \(k\) and choose \(\hat{\boldsymbol{\beta}}_R\) to minimize
\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \sum_{j=1}^{p-1} \beta_j^2, \] where the penalty term doesn’t include \(\boldsymbol{\beta}_0\).
Taking the derivatives and setting all of them equal to \(0\) yields new normal equations:
\[ \left[k \text{ diag}(0, 1, \ldots, 1) + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\hat{\beta}}_R = \mathbf{X}^\intercal \mathbf{y} \]
\[ \left[k \text{ diag}(0, 1, \ldots, 1) + \mathbf{X}^\intercal \mathbf{X} \right] \boldsymbol{\hat{\beta}}_R = \mathbf{X}^\intercal \mathbf{y} \]
Notice the first row is the same as the OLS: \[ \sum_{i=1}^n y_i = n \hat{\beta}_0 + \hat{\beta}_1 \sum_{i=1}^n x_{i1} + \cdots + \hat{\beta}_{p-1} \sum_{i=1}^n x_{i,p-1} \]
Read the documentation, because every function is different!
Another thing that this book fails to mention is that scaling predictors is often important.
The reason this is done is so that the penalty is applied evenly to all predictors.
\[ \tilde{x}_{i,j} = \frac{x_{i,j} - \bar{x}_j}{s_j} \]
where \(\bar{x}_j = n^{-1} \sum_{i=1}^n x_{i,j}\) and \(s_j^2 = (n-1)^{-1} \sum_{i=1}^n (x_{i,j} - \bar{x}_j)^2\).
Increasing \(k\)
How do we pick the “best” \(k\)?
The book recommends me use the ridge trace, which is a plot of the ridge coefficient estimates versus \(k\).
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-18
## GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
## 1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
## 1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
## 1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
## 1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
## 1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
## 1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
glmnet
also provides a method that can select \(k\) using cross-validation: cv.glmnet()
mod2 <- cv.glmnet(x = X,
y = as.matrix(longley[,1]),
family = "gaussian",
alpha = 0, # 0 for ridge regression
standardize = T, # default is T
intercept = T,
nfolds = 3)
plot(mod2)
mod2$lambda.min
the first vertical line, minimizes out of sample MSEmod2$lambda.1se
second vertical line, adds one std. error to be conservativeRidge regression chooses \(\hat{\boldsymbol{\beta}}_R\), for a fixed \(k\), to minimize
\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \sum_{j=1}^{p-1} \beta_j^2, \] while the Lasso minimizes
\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + k \sum_{j=1}^{p-1} |\beta_j|. \]
This small change leads to very different behavior!
For any \(0 < k < \infty\), ridge regression will always have all predictors in the model. It does not perform variable selection.
The Lasso does, though! Increasing \(k\) will eliminate more and more predictors from the model, until only the intercept is left. Very useful if you want to figure out which predictors are the most useful.
To get a sense for why this is true, we can rewrite the minimization problem as follows:
\[ \min_{\beta}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\intercal(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \hspace{5mm} \text{subject to} \sum_{j=1}^{p-1}|\beta_j| \le s \]
Most of the time we have to do the following two step procedure:
cv.glmnet
)Step 1/2: perform cross-validation to choose \(k\)
mod3 <- cv.glmnet(x = X,
y = as.matrix(longley[,1]),
family = "gaussian",
alpha = 1, # 0 for ridge regression, 1 for lasso
standardize = T, # default is T
intercept = T,
nfolds = 3)
bestK <- mod3$lambda.min
cat("best k is ", bestK, "\n")
## best k is 0.3314758
Step 2/2: Use the chosen \(k\) to estimate the coefficients
mod4 <- glmnet(x = X,
y = as.matrix(longley[,1]),
family = "gaussian",
alpha = 1, # lasso=1, ridge=0 (see dox)
lambda = bestK, # calculated in the last slide
standardize = T,
intercept = T)
predict(mod4, newx = X[c(3,5),])
## s0
## 1949 88.68671
## 1951 95.04757
Instead of picking a single \(k\), we can plotting coefficient estimates versus many different \(k\)s…
Lasso beats Ridge at performing variable selection, but which model predicts the best?
Lasso beats Ridge at performing variable selection, but which model predicts the best?
Very generally, Lasso predicts better when you have too many possible predictors, and very few of them actually have coefficients that are far away from \(0\).
Ridge regression performs better when many coefficients are far away from \(0\), and they’re all roughly the same size.
Unfortunately, these are rarely known before you look at the data, so cross-validation is your friend!
As long as we’re talking about variable selection, we should mention principal components analysis.
Principal components can be used a few ways, but we use it for choosing which predictors to include in a regression model.
This is a very useful tool when you have many columns and a lot of multicollinearity. It is a principled way to combine and remove columns.
Let’s plot our two predictor columns:
what do we do about this correlation?
PCA will help us reduce the number of columns. In this case we reduce two columns into \(1\). However, there are a few pre-processing steps:
When we write \(\mathbf{X}\), we are assuming all these things have already been done.
The first principal component is a linear combination of your \(p-1\), zero-mean predictors:
\[\begin{align*} \mathbf{z}_1 &= \phi_{11} \mathbf{X}_1 + \phi_{21} \mathbf{X}_2 + \cdots + \phi_{p-1,1} \mathbf{X}_{p-1} \\ &= \mathbf{X}\boldsymbol{\phi}_1 \end{align*}\]
that maximizes the sample variance of the elements of \(\mathbf{z}_1\), restricting \(\boldsymbol{\phi}_1^\intercal\boldsymbol{\phi}_1 = \sum_{j=1}^{p-1} \phi_{j1}^2 = 1\).
So the first principal component \(\mathbf{z}_1 = \mathbf{X}\boldsymbol{\phi}_1\) solves the following constrained optimization problem:
\[ \max \frac{1}{n-1}\mathbf{z}_1^\intercal \mathbf{z}_1 \hspace{5mm} \text{subject to } \hspace{5mm} \boldsymbol{\phi}_1^\intercal\boldsymbol{\phi}_1 = 1 \]
To maximize \(\frac{1}{n-1}\mathbf{z}_1^\intercal \mathbf{z}_1\), it is handy to use the Spectral Representation Theorem \(\mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{U}^\intercal\).
\[\begin{align*} \frac{1}{n-1}\mathbf{z}_1^\intercal \mathbf{z}_1 &= \frac{1}{n-1}\left(\mathbf{X}\boldsymbol{\phi}_1\right)^\intercal\left(\mathbf{X}\boldsymbol{\phi}_1 \right) \tag{defn. of $\mathbf{z}_1$} \\ &= \frac{1}{n-1}\boldsymbol{\phi}_1^\intercal\left(\mathbf{X}^\intercal \mathbf{X} \right)\boldsymbol{\phi}_1 \tag{property of transpose}\\ &= \frac{1}{n-1}\boldsymbol{\phi}_1^\intercal\left(\mathbf{U} \mathbf{D} \mathbf{U}^\intercal \right)\boldsymbol{\phi}_1 \tag{SRT} \\ &= \frac{1}{n-1}\boldsymbol{\phi}_1^\intercal\left(\sum_{i=1}^p\lambda_i \mathbf{U}_i \mathbf{U}_i^\intercal \right)\boldsymbol{\phi} \tag{matrix mult.} \\ &= \frac{1}{n-1}\sum_{i=1}^p \lambda_i \left(\boldsymbol{\phi}_1^\intercal \mathbf{U}_i \mathbf{U}_i^\intercal \boldsymbol{\phi}_1\right) \end{align*}\]
So… what’s \(\boldsymbol{\phi}_1\)?
So, the weight vector of the first principal component is the eigenvector with the largest eigenvalue!
WLOG assume \(\lambda_1 > \lambda_2 > \cdots > \lambda_{p-1}\) when you write \(\mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{U}^\intercal\).
If we do that, we set \(\boldsymbol{\phi}_1 = \mathbf{U}_1\). In this case, the first principal component is defined by the first eigenvector.
## y x1 x2
## 1 0.2167549 1.6565861 1.7625717
## 2 -0.5424926 -1.3533537 -1.6398911
## 3 0.8911446 -0.5982720 -0.5874262
## 4 0.5959806 0.7255946 0.9545904
## 5 1.6356180 -0.5693906 -0.7319355
## 6 0.6892754 -0.5064352 0.1437340
X <- as.matrix(myDf[,-1]) # ignore y for the time being
eigenStuff <- eigen(t(X) %*% X)
eigenStuff$values # make sure first is the biggest
## [1] 257.250792 4.589225
## [1] -0.7121932 -0.7019835
# finally, combine the two columns into the one, first principal component
(firstPrinComp <- X %*% firstEigenVector)
## [,1]
## [1,] -2.417105579
## [2,] 2.115025805
## [3,] 0.838448742
## [4,] -1.186870251
## [5,] 0.919322749
## [6,] 0.259780799
## [7,] 1.098530219
## [8,] -1.696482864
## [9,] -0.006692646
## [10,] 1.989452984
## [11,] -0.257988036
## [12,] -0.657930906
## [13,] -0.780110578
## [14,] -3.019223659
## [15,] -1.868162855
## [16,] 0.786262582
## [17,] 1.026773229
## [18,] 2.466224733
## [19,] -2.370901657
## [20,] 1.913865641
## [21,] 0.537242106
## [22,] 0.425435830
## [23,] -2.212999057
## [24,] 0.144775738
## [25,] -2.660526482
## [26,] 0.571112610
## [27,] 0.392206813
## [28,] 0.190059404
## [29,] -0.675164896
## [30,] 3.161861977
## [31,] 0.038269319
## [32,] -1.687904584
## [33,] 1.283782081
## [34,] 1.716871331
## [35,] 1.936274675
## [36,] -0.780147633
## [37,] -0.646177979
## [38,] 0.547281377
## [39,] 2.592285247
## [40,] -2.047714067
## [41,] -0.023097495
## [42,] -0.887397720
## [43,] 2.200450344
## [44,] -1.409116946
## [45,] 1.977940134
## [46,] -0.919403667
## [47,] 1.445216021
## [48,] -0.696725437
## [49,] 0.061424947
## [50,] -1.852274650
## [51,] -1.620187849
## [52,] 2.211386744
## [53,] -1.100297853
## [54,] 0.721282275
## [55,] 0.462965401
## [56,] -2.311143422
## [57,] 2.488446851
## [58,] 0.343610944
## [59,] 1.887324944
## [60,] -1.423166064
## [61,] 1.140247675
## [62,] 0.016028170
## [63,] -0.487439654
## [64,] 0.808513151
## [65,] -1.859746495
## [66,] -1.045632916
## [67,] 0.335704695
## [68,] 0.733749416
## [69,] -0.738275351
## [70,] 0.504983337
## [71,] -3.569345315
## [72,] -0.049218259
## [73,] 1.050060406
## [74,] 3.089302821
## [75,] 1.005604968
## [76,] 0.672428284
## [77,] 0.520423194
## [78,] -2.357425973
## [79,] 1.907982321
## [80,] 3.349689317
## [81,] 1.592016215
## [82,] -0.079164622
## [83,] 1.393872742
## [84,] 4.416673205
## [85,] -1.344943674
## [86,] -0.684727489
## [87,] 1.232862825
## [88,] -0.973344836
## [89,] -0.478852686
## [90,] -0.597563667
## [91,] 0.012231655
## [92,] -3.201081919
## [93,] -1.105158190
## [94,] 2.383871050
## [95,] 0.360882537
## [96,] 2.338604019
## [97,] 1.667914261
## [98,] -0.930633762
## [99,] 1.683427512
## [100,] -0.934267606
The second principal component \(\mathbf{z}_2 = \mathbf{X}\mathbf{U}_2\) maximizes the variance under the same restriction, plus an extra one: \[ \mathbf{z}_1 \text{ must be uncorrelated with } \mathbf{z}_2. \]
This happens iff
\[ \mathbf{z}_1^\intercal \mathbf{z}_2 = 0. \]
We need the Spectral Representation Theorem again: \(\mathbf{X}^\intercal \mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{U}^\intercal\)
\[ \mathbf{z}_1^\intercal \mathbf{z}_2 = 0 \]
iff
\[ \left(\mathbf{X}\boldsymbol{\phi}_1\right)^\intercal \left(\mathbf{X}\boldsymbol{\phi}_2\right) = 0 \]
iff
\[ \sum_{i=1}^{p-1} \lambda_i \left(\boldsymbol{\phi}_1^\intercal \mathbf{U}_i \mathbf{U}_i^\intercal \boldsymbol{\phi}_2\right) = 0 \]
So, given \(\boldsymbol{\phi}_1\), if we wanted to pick a unit-length vector \(\boldsymbol{\phi}_2\) to maximize
\[ \frac{1}{n-1}\mathbf{z}_2^\intercal \mathbf{z}_2 = \frac{1}{n-1}\sum_{i=1}^{p-1} \lambda_i \left(\boldsymbol{\phi}_2^\intercal \mathbf{U}_i \mathbf{U}_i^\intercal \boldsymbol{\phi}_2\right) \]
subject to
\[ \frac{1}{n-1}\sum_{i=1}^{p-1} \lambda_i \left(\boldsymbol{\phi}_1^\intercal \mathbf{U}_i \mathbf{U}_i^\intercal \boldsymbol{\phi}_2\right) = 0 \]
what would it be?
In general, the weights are
So the principal components themselves are
With two columns of data, we have at most two principal components
Compare the plot of \(\mathbf{X}\) with the plot of \(\mathbf{X} \begin{bmatrix} \mathbf{U}_1 & \mathbf{U}_2 \end{bmatrix}\)
In general, the sample variance of \(\mathbf{z}_i\) is \(\lambda_i\) because \[ \frac{1}{n-1}\mathbf{z}_i^\intercal\mathbf{z}_i = \frac{1}{n-1}\mathbf{U}_i^\intercal \mathbf{U} \mathbf{D} \mathbf{U}^\intercal\mathbf{U}_i = \frac{\lambda_i}{n-1} \]
When we have more than two columns, we have to pick how many principal component predictors to ignore. We ignore principal components if they have a small variance. A plot of the variances is called a Scree plot
## GNP Unemployed Armed.Forces Population Year Employed
## 1947 234.289 235.6 159.0 107.608 1947 60.323
## 1948 259.426 232.5 145.6 108.632 1948 61.122
## 1949 258.054 368.2 161.6 109.773 1949 60.171
## 1950 284.599 335.1 165.0 110.929 1950 61.187
## 1951 328.975 209.9 309.9 112.075 1951 63.221
## 1952 346.999 193.2 359.4 113.270 1952 63.639
X <- scale(x = X, center = TRUE, scale = TRUE)
# demeaning is necessary; scaling is highly-advisable
head(X)
## GNP Unemployed Armed.Forces Population Year Employed
## 1947 -1.5434331 -0.8960348 -1.4609267 -1.4111352 -1.5753151 -1.4219946
## 1948 -1.2905329 -0.9292089 -1.6534776 -1.2639263 -1.3652731 -1.1944868
## 1949 -1.3043364 0.5229601 -1.4235660 -1.0998977 -1.1552311 -1.4652752
## 1950 -1.0372705 0.1687464 -1.3747098 -0.9337126 -0.9451891 -1.1759787
## 1951 -0.5908091 -1.1710587 0.7074273 -0.7689652 -0.7351470 -0.5968163
## 1952 -0.4094719 -1.3497707 1.4187163 -0.5971736 -0.5251050 -0.4777947
eigenStuff <- eigen(t(X) %*% X)
plot(eigenStuff$values,
ylab = "lambda_i", xlab = "i",
type = "b", main = "Scree plot")
We probably only need the first two principal components. So we just reduced six columns into \(2\)!
What is the interpretation of the first principal component?
## [1] -0.4664796 -0.3120084 -0.2046347 -0.4659838 -0.4686311 -0.4543304
## GNP Unemployed Armed.Forces Population Year Employed
## 1947 -1.5434331 -0.8960348 -1.4609267 -1.4111352 -1.5753151 -1.4219946
## 1948 -1.2905329 -0.9292089 -1.6534776 -1.2639263 -1.3652731 -1.1944868
## 1949 -1.3043364 0.5229601 -1.4235660 -1.0998977 -1.1552311 -1.4652752
## 1950 -1.0372705 0.1687464 -1.3747098 -0.9337126 -0.9451891 -1.1759787
## 1951 -0.5908091 -1.1710587 0.7074273 -0.7689652 -0.7351470 -0.5968163
## 1952 -0.4094719 -1.3497707 1.4187163 -0.5971736 -0.5251050 -0.4777947
What is the interpretation of the first principal component?
## [1] -0.4664796 -0.3120084 -0.2046347 -0.4659838 -0.4686311 -0.4543304
What is the interpretation of the second principal component?
## [1] 0.03972008 -0.61091602 0.78171111 -0.05946171 -0.01389938 0.10199406
What is the interpretation of the second principal component?
On this “longley” data set, we have:
Now it’s time to run the regression!
pcaDf <- data.frame(longley$GNP.deflator,
(X %*% eigenStuff$vectors)[,1:2])
colnames(pcaDf) <- c("y", "z1", "z2")
pcaReg <- lm(y ~ z1 + z2, data = pcaDf)
summary(pcaReg)
##
## Call:
## lm(formula = y ~ z1 + z2, data = pcaDf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8933 -1.0642 -0.4644 1.2777 2.2403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.6813 0.3661 277.773 < 2e-16 ***
## z1 -5.0143 0.1773 -28.285 4.62e-13 ***
## z2 0.4600 0.3472 1.325 0.208
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.464 on 13 degrees of freedom
## Multiple R-squared: 0.984, Adjusted R-squared: 0.9816
## F-statistic: 400.9 on 2 and 13 DF, p-value: 2.084e-12
Here’s the regression on everything
##
## Call:
## lm(formula = GNP.deflator ~ ., data = longley)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0086 -0.5147 0.1127 0.4227 1.5503
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2946.85636 5647.97658 0.522 0.6144
## GNP 0.26353 0.10815 2.437 0.0376 *
## Unemployed 0.03648 0.03024 1.206 0.2585
## Armed.Forces 0.01116 0.01545 0.722 0.4885
## Population -1.73703 0.67382 -2.578 0.0298 *
## Year -1.41880 2.94460 -0.482 0.6414
## Employed 0.23129 1.30394 0.177 0.8631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.195 on 9 degrees of freedom
## Multiple R-squared: 0.9926, Adjusted R-squared: 0.9877
## F-statistic: 202.5 on 6 and 9 DF, p-value: 4.426e-09