Module 7

STAT 6021

Taylor R. Brown, PhD

Motivation

Now we talk about variable selection which is like model selection as long as we’re restricting ourselves to regression models.

So far we’ve assumed we more-or-less know which predictors we’ll use.

We fit the full model, do an overall F test, maybe run one or two t-tests, check the residuals, maybe transform predictors and/or \(y\), and then finish off with another check of all our assumptions.

What if you have no scientific knowledge of the problem, just a .csv file?

Motivation

Q: If you have \(k\) predictors, how many possible models are there?

Motivation

A:

Multicollinearity tends to be lurking close by in this type of situation. Removing variables can help with this.

We also talked about how residuals can signify variable omissions.

Questions you should be asking

  1. Should I err on the side of a small model (underfitting), or should I aim for a model with a large number of predictors (overfitting)?

  2. Is my goal parameter inference, or prediction?

Bias and Variances and Covariances

The bias of an estimate \(\hat{\boldsymbol{\beta}}\) for a parameter vector \(\boldsymbol{\beta}\) is \[ E[\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}]. \]

The covariance matrix is importance as well: \[ V[\hat{\boldsymbol{\beta}}] = E\left\{\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^\intercal \right\}. \]

Which error measure?

Q: Bias is bad, and variance is bad. But which one do you we look at?

Mean Square Error (MSE)

A: We look at a combination: mean squared error.

\[\begin{align*} MSE(\hat{\boldsymbol{\beta}}_p) &= E[(\hat{\boldsymbol{\beta}}_p - \boldsymbol{\beta}_p)(\hat{\boldsymbol{\beta}}_p - \boldsymbol{\beta}_p)^\intercal] \tag{defn.}\\ &= E[(\hat{\boldsymbol{\beta}}_p \pm E[\hat{\boldsymbol{\beta}}_p] - \boldsymbol{\beta}_p)(\hat{\boldsymbol{\beta}}_p \pm E[\hat{\boldsymbol{\beta}}_p] - \boldsymbol{\beta}_p)^\intercal] \\ &= E[(\hat{\boldsymbol{\beta}}_p - E[\hat{\boldsymbol{\beta}}_p])(\hat{\boldsymbol{\beta}}_p - E[\hat{\boldsymbol{\beta}}_p])^\intercal] \\ &\hspace{10mm} E[(E[\hat{\boldsymbol{\beta}}_p] - \beta_p)(E[\hat{\boldsymbol{\beta}}_p] - \beta_p)] \\ &\hspace{5mm} + 2E[(\hat{\boldsymbol{\beta}}_p - E[\hat{\boldsymbol{\beta}}_p])(E[\hat{\boldsymbol{\beta}}_p] - \beta_p)^\intercal] \\ &= V[\hat{\boldsymbol{\beta}}_p] + \text{Bias}[\hat{\boldsymbol{\beta}}_p]\text{Bias}[\hat{\boldsymbol{\beta}}_p]^\intercal + 0 \tag{fundamental!} \end{align*}\]

Positive Semi-Definiteness

What does it mean when we say one matrix is bigger than or equal to another?

\[ \mathbf{M}_1 \ge \mathbf{M}_2 \]

Positive Semi-Definiteness

This

\[ \mathbf{M}_1 \ge \mathbf{M}_2 \]

is short hand for this

\[ \mathbf{M}_1 - \mathbf{M}_2 \text{ is positve semi-definite}. \]

A matrix \(\mathbf{A}\) is positive semi-definite if for all possible vectors \(\mathbf{c}\) (such that not all the elements are \(0\)) we have \[ \mathbf{c}^\intercal \mathbf{A} \mathbf{c} \ge 0. \]

Positive Semi-Definiteness

Assume for every \(\mathbf{c}\) (assuming not all elements are \(0\))

\[ V[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}^*_p] \ge V[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}_p] \] this is true iff \[ \mathbf{c}^\intercal V[\hat{\boldsymbol{\beta}}^*_p]\mathbf{c} - \mathbf{c}^\intercal V[\hat{\boldsymbol{\beta}}_p]\mathbf{c} \ge 0 \] iff \[ \mathbf{c}^\intercal \left( V[\hat{\boldsymbol{\beta}}^*_p] - V[\hat{\boldsymbol{\beta}}_p] \right) \mathbf{c} \ge 0 \]

this is the definition of \(V[\hat{\boldsymbol{\beta}}^*_p] - V[\hat{\boldsymbol{\beta}}_p]\) being positive semi-definite.

Positive Semi-Definiteness

Note \[\begin{align*} \text{MSE}[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}^*_p] &= V[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}^*_p] + \left(\mathbf{c}^\intercal E\{\hat{\boldsymbol{\beta}}^*_p - \boldsymbol{\beta}_p\} \right)^2 \\ &= \mathbf{c}^\intercal V[\hat{\boldsymbol{\beta}}^*_p]\mathbf{c} + \mathbf{c}^\intercal E\{\hat{\boldsymbol{\beta}}^*_p - \boldsymbol{\beta}_p\}E\{\hat{\boldsymbol{\beta}}^*_p - \boldsymbol{\beta}_p\}^\intercal\mathbf{c} \\ &= \mathbf{c}^\intercal\text{MSE}[\hat{\boldsymbol{\beta}}^*_p]\mathbf{c} \end{align*}\]

So it’s all about positive semi-definiteness for MSE matrices as well.

Now back to regression models…

Analyzing Misspecification

Let’s assume the full model is

\[\begin{align*} \mathbf{y} &= \overbrace{\mathbf{X}}^{n \times K} \boldsymbol{\beta} + \boldsymbol{\epsilon}\\ &= \underbrace{\mathbf{X}_p}_{n \times p} \boldsymbol{\beta}_p + \underbrace{\mathbf{X}_r}_{n \times r} \boldsymbol{\beta}_r + \boldsymbol{\epsilon}\\ \end{align*}\]

and the smaller model is \[ \mathbf{y} = \mathbf{X}_p \boldsymbol{\beta}_p + \widetilde{\boldsymbol{\epsilon}} \]

Analyzing Misspecification

Estimates for the full model

Estimates for the small model

When we’re overfitting

\[ E[\hat{\boldsymbol{\beta}}^*_p] = \left[(\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal\right]_p \mathbf{X}_p \boldsymbol{\beta}_p \] When we’re underfitting

\[ E[\hat{\boldsymbol{\beta}}_p] = \boldsymbol{\beta}_p + (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal \mathbf{X}_r \boldsymbol{\beta}_r = \boldsymbol{\beta}_p + \mathbf{A}\boldsymbol{\beta}_r \]

Analyzing Misspecification

NB: \(\mathbf{X}_p^\intercal \mathbf{X}_r = 0\) means removed columns are orthogonal to kept columns.

\[ \mathbf{X}^\intercal\mathbf{X} = \begin{bmatrix} \mathbf{X}_p^\intercal\mathbf{X}_p &\mathbf{X}_p^\intercal\mathbf{X}_r \\ \mathbf{X}_r^\intercal\mathbf{X}_p & \mathbf{X}_r^\intercal\mathbf{X}_r \end{bmatrix} = \begin{bmatrix} \mathbf{X}_p^\intercal\mathbf{X}_p &\mathbf{0} \\ \mathbf{0} & \mathbf{X}_r^\intercal\mathbf{X}_r \end{bmatrix} \] which means \[ (\mathbf{X}^\intercal\mathbf{X})^{-1} = \begin{bmatrix} (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1} &\mathbf{0} \\ \mathbf{0} & (\mathbf{X}_r^\intercal\mathbf{X}_r)^{-1} \end{bmatrix} \] and the two estimators are equivalent! This is a special case of the general formula for the inverse of a block matrix that we saw earlier.

Analyzing Misspecification

Estimates for the full model

Estimates for the small model

When we’re overfitting

\[ V[\hat{\boldsymbol{\beta}}^*_p] = \sigma^2 \left[(\mathbf{X}^\intercal\mathbf{X})^{-1}\right]_p \] When we’re underfitting

\[ V[\hat{\boldsymbol{\beta}}_p] = \sigma^2 (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1} \]

Deliberately Underfitting?

Assume the full model is true.

Wrong, smaller model: \[ MSE[\hat{\boldsymbol{\beta}}_p] = \sigma^2 (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1}+ \mathbf{A}\boldsymbol{\beta}_r\boldsymbol{\beta}_r^\intercal \mathbf{A}^\intercal \]

Right, full/bigger model: \[ MSE[\hat{\boldsymbol{\beta}}^*_p] = \sigma^2 \left[(\mathbf{X}^\intercal\mathbf{X})^{-1}\right]_p + 0 \]

Is it possible we can have smaller-MSE estimates with the restricted model?

Deliberately Underfitting?

Yes!

\(MSE[\hat{\boldsymbol{\beta}}^*_p] - MSE[\hat{\boldsymbol{\beta}}_p]\) is positive semi-definite if and only if \[ V[\hat{\boldsymbol{\beta}}^*_r] - \boldsymbol{\beta}_r\boldsymbol{\beta}_r^\intercal \] is positive semi-definite. The proof involves using the formula for inverses of block matrices.

Intuitively, we should leave out columns with coefficents that are “near” zero (relative to their standard errors).

Overfitting/Underfitting and the Mean Response

For the full model, we predict at \(\mathbf{x}^\intercal = \begin{bmatrix} \mathbf{x}_p^\intercal & \mathbf{x}_r^\intercal \end{bmatrix}\): \[ \hat{y}^* = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y} \] For the smaller model we have \[ \hat{y} = \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal\mathbf{y} \]

Overfitting/Underfitting and the Mean Response

If we’re overfitting: \[ E[\hat{y}^*] = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{X}_p \boldsymbol{\beta}_p \] and if we’re underfitting \[\begin{align*} E[\hat{y}] &= \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal(\mathbf{X}_p \boldsymbol{\beta}_p + \mathbf{X}_r \boldsymbol{\beta}_r)\\ &= \mathbf{x}_p^\intercal\boldsymbol{\beta}_p + \mathbf{x}_p^\intercal \mathbf{A} \boldsymbol{\beta}_r \end{align*}\]

Overfitting/Underfitting and the Mean Response

If we’re overfitting: \[ V[\hat{y}^*] = \sigma^2 \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{x} \] and if we’re underfitting \[ V[\hat{y}] = \sigma^2 \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{x}_p \]

Decomposing MSPE

If your observations are independent, there is a similar decomposition for mean square prediction error (proof: add and subtract expectations)

\[ \text{MSPE}[\hat{y}] := E\left(y' - \hat{y}\right)^2 = E\left[\left(y' - E[y']\right)^2\right] + E\left[(\hat{y}- E[\hat{y}])^2\right] + \left(E[\hat{y}] - E[y'] \right)^2 \]

Forecast errors is affected by all of the following

Deliberately Underfitting?

Assume the full model is true.

Notation

Question

Deliberately Underfitting

\[ \text{MSPE}[\hat{y}^*] \ge \text{MSPE}[\hat{y}] \] if and only if \[ V[\hat{\boldsymbol{\beta}}^*_r] - \boldsymbol{\beta}_r\boldsymbol{\beta}_r^\intercal \]

Same condition as before!

Model Selection Criteria

Most (all?) model selection criteria are based on the idea of minimizing out-of-sample error/loss/badness.

Know the difference between the data used to estimate your model \(\{y_i, \mathbf{x}_i \}_{i=1}^n\), and a future/unseen data point \(y'_i, \mathbf{x}^\intercal_i\).

Once again, the predictor data is assumed nonrandom, or if it isn’t, then everything is conditoning on it.

Mallows’ Cp

If you use Mallows’ Cp to pick a model, you are trying to pick the model with the lowest out-of-sample mean squared prediction error:

\[ E[(y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] \]

Note

Mallows’ Cp

In-sample MSPE: \[\begin{align*} E[(y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] &= V[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \\ &= V[y_i] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] - 2 \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) + \left(E[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \end{align*}\]

Out-of-sample MSPE: \[\begin{align*} E[(y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] &= V[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \\ &= V[y_i'] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] - 2 \text{Cov}\left(y_i', \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) + \left(E[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \\ &= V[y_i'] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \tag{indep.} \\ &= V[y_i] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \tag{identicalness.} \\ &= E[(y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] + 2 \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) \tag{above.} \end{align*}\]

Mallows’ Cp

So \[ E[(y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] = E[(y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] + 2 \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) \]

which means

\[ E \left[ \frac{1}{n}\sum_{i=1}^n (y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] = E\left[\frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] + 2 \frac{1}{n}\sum_{i=1}^n \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) \]

which is the same as

\[ E \left[ \frac{1}{n}\sum_{i=1}^n (y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] = E\left[\frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] + 2 \frac{\sigma^2}{n}p \]

because \(\text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) = \sigma^2 \mathbf{H}_{ii}\) and because \(\text{tr}(\mathbf{H}) = p\)

Mallows’ Cp

Based on \[ E \left[ \frac{1}{n}\sum_{i=1}^n (y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] = \underbrace{E\left[\frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right]}_{\text{in-sample average error}} + \underbrace{ 2 \frac{\sigma^2}{n}p}_{\text{penalty}} \]

Mallows’ Cp for a particular model is \[ C_p := \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2 + 2\frac{\hat{\sigma}^2_{\text{full}}}{n}p \]

where \(\hat{\sigma}^2_{\text{full}}\) is from the largest possible model (it’s unbiased assuming the true model’s predictors are contained in the full model). So, pick the model with the smallest!

Mallows’ Cp

Alternatively, sometimes people divide through by \(\hat{\sigma}^2_{\text{full}}/n\) and then subtract \(n\): \[ \frac{1}{\hat{\sigma}^2_{\text{full}}}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2 + 2p - n = \frac{SS_{Res}(p) }{\hat{\sigma}^2_{\text{full}}} + 2p - n \]

This has expectation roughly \(p\), so people like to plot it versus \(p\). However, the goal is still to pick the model with the smallest value (pick model \(C\) in this case).

AIC

Akaike’s Information Criterion (AIC) suggests picking the model that minimizes out of sample cross entropy (a loss different than squared error)

\[ E\left[-2 \sum_{i=1}^n \log \hat{p}(y_i') \right] \]

Entropy is just the average “surprise” of your prediction:

AIC

For a particular model with \(p\) columns in the design matrix \[\begin{align*} \text{AIC} &:= - 2 \log(L) + 2p \\ &= n \log\left(\frac{SS_{Res}}{n} \right) + 2p \end{align*}\]

where \[\begin{align*} \log L &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\hat{\sigma}^2 ) - \frac{1}{2 \hat{\sigma}^2}\sum_{i=1}^n \left( y_i - \mathbf{x}^\intercal \hat{\boldsymbol{\beta}}_p \right)^2 \\ &= \underbrace{-\frac{n}{2}(\log(2\pi) +1)}_{\text{often ignored}}- \frac{n}{2}\log\left(\frac{SS_{Res}}{n} \right) \end{align*}\] is the normal density with the maximum likelihood estimates plugged in. Recall \(\hat{\sigma}^2 = SS_{Res}/n\), the MLE, is different from MSE because of the denominator.

Honorable mentions

using R

library(leaps)
head(carsdf)
##                    mpg cyl disp  hp drat    wt  qsec
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
## Valiant           18.1   6  225 105 2.76 3.460 20.22
allMods <- regsubsets(mpg ~ ., data = carsdf, nbest = 10, nvmax = NULL, method = "exhaustive")
#summary(allMods)

using R

Pick the top row!

#?plot.regsubsets
plot(allMods, scale = "Cp")

using R

library(car)
## Loading required package: carData
subsets(allMods, statistic="cp", legend = FALSE, min.size = 2, main = "Mallow Cp")
##      Abbreviation
## cyl             c
## disp           ds
## hp              h
## drat           dr
## wt              w
## qsec            q
abline(a = 1, b = 1, lty = 2)

using R

leaps() doesn’t calculate AIC, so we have to use bestglm()

library(bestglm)
## y var has to come last and be named y
carsdf.for.bestglm <- within(carsdf, 
{
    y    <- mpg         # mpg into
    mpg  <- NULL        # Delete old mpg
})
carsdf.for.bestglm <-
    carsdf.for.bestglm[, c("cyl", "disp", "hp", "drat", "wt", "qsec", "y")]
allModsWithAIC <-
    bestglm(Xy = carsdf.for.bestglm,
            family = gaussian, 
            IC = "AIC",           
            method = "exhaustive")
allModsWithAIC$BestModels
##     cyl  disp    hp  drat   wt  qsec Criterion
## 1  TRUE FALSE  TRUE FALSE TRUE FALSE  60.66456
## 2  TRUE FALSE FALSE FALSE TRUE FALSE  61.19800
## 3  TRUE FALSE FALSE FALSE TRUE  TRUE  61.37837
## 4  TRUE  TRUE  TRUE FALSE TRUE FALSE  61.52554
## 5 FALSE FALSE  TRUE FALSE TRUE FALSE  61.84027

Picks the same model as Mallows’ Cp!

Stepwise Model Selection

Recall that the number of models grows exponentially in the number of possible predictors.

Sometimes you’ll have too many columns, and doing an exhaustive search will be computationally prohibitive.

We’ll mention

Forward Selection

For each absent predictor, estimate the model with this new predictor added. Add predictor with lowest p-value, only if it’s below a threshold.

Forward Selection

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
fullMod <- lm(mpg ~ ., data = carsdf)
ols_step_forward_p(fullMod, penter = .3, details = T)
## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 
## We are selecting variables based on p value...
## 
## 
## Forward Selection: Step 1 
## 
## - wt 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.868       RMSE                3.046 
## R-Squared               0.753       Coef. Var          15.161 
## Adj. R-Squared          0.745       MSE                 9.277 
## Pred R-Squared          0.709       MAE                 2.341 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     847.725         1        847.725    91.375    0.0000 
## Residual       278.322        30          9.277                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    37.285         1.878                 19.858    0.000    33.450    41.120 
##          wt    -5.344         0.559       -0.868    -9.559    0.000    -6.486    -4.203 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 2 
## 
## - cyl 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.911       RMSE                2.568 
## R-Squared               0.830       Coef. Var          12.780 
## Adj. R-Squared          0.819       MSE                 6.592 
## Pred R-Squared          0.790       MAE                 1.921 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     934.875         2        467.438    70.908    0.0000 
## Residual       191.172        29          6.592                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    39.686         1.715                 23.141    0.000    36.179    43.194 
##          wt    -3.191         0.757       -0.518    -4.216    0.000    -4.739    -1.643 
##         cyl    -1.508         0.415       -0.447    -3.636    0.001    -2.356    -0.660 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 3 
## 
## - hp 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.918       RMSE                2.512 
## R-Squared               0.843       Coef. Var          12.501 
## Adj. R-Squared          0.826       MSE                 6.308 
## Pred R-Squared          0.796       MAE                 1.845 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     949.427         3        316.476    50.171    0.0000 
## Residual       176.621        28          6.308                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    38.752         1.787                 21.687    0.000    35.092    42.412 
##          wt    -3.167         0.741       -0.514    -4.276    0.000    -4.684    -1.650 
##         cyl    -0.942         0.551       -0.279    -1.709    0.098    -2.070     0.187 
##          hp    -0.018         0.012       -0.205    -1.519    0.140    -0.042     0.006 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## No more variables to be added.
## 
## Variables Entered: 
## 
## + wt 
## + cyl 
## + hp 
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.918       RMSE                2.512 
## R-Squared               0.843       Coef. Var          12.501 
## Adj. R-Squared          0.826       MSE                 6.308 
## Pred R-Squared          0.796       MAE                 1.845 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     949.427         3        316.476    50.171    0.0000 
## Residual       176.621        28          6.308                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    38.752         1.787                 21.687    0.000    35.092    42.412 
##          wt    -3.167         0.741       -0.514    -4.276    0.000    -4.684    -1.650 
##         cyl    -0.942         0.551       -0.279    -1.709    0.098    -2.070     0.187 
##          hp    -0.018         0.012       -0.205    -1.519    0.140    -0.042     0.006 
## ----------------------------------------------------------------------------------------
## 
##                             Selection Summary                             
## -------------------------------------------------------------------------
##         Variable                  Adj.                                       
## Step    Entered     R-Square    R-Square     C(p)        AIC        RMSE     
## -------------------------------------------------------------------------
##    1    wt            0.7528      0.7446    14.5629    166.0294    3.0459    
##    2    cyl           0.8302      0.8185     3.2353    156.0101    2.5675    
##    3    hp            0.8431      0.8263     3.0100    155.4766    2.5115    
## -------------------------------------------------------------------------

Backward Selection

Backward selection starts with the full model, but it uses the same F tests as forward selection.

Test the current model against all submodels that have one fewer predictor. Recall that the null hypothesis is that the full model is true. If all p-values are larger than the threshold, move to the model with the highest p-value.

Backward Selection

Backward selection picks the same model as well.

library(olsrr)
fullMod <- lm(mpg ~ ., data = carsdf)
ols_step_backward_p(fullMod, prem = .3, details = T)
## Backward Elimination Method 
## ---------------------------
## 
## Candidate Terms: 
## 
## 1 . cyl 
## 2 . disp 
## 3 . hp 
## 4 . drat 
## 5 . wt 
## 6 . qsec 
## 
## We are eliminating variables based on p value...
## 
## - qsec 
## 
## Backward Elimination: Step 1 
## 
##  Variable qsec Removed 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.923       RMSE                2.538 
## R-Squared               0.851       Coef. Var          12.631 
## Adj. R-Squared          0.823       MSE                 6.439 
## Pred R-Squared          0.780       MAE                 1.795 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     958.621         5        191.724    29.773    0.0000 
## Residual       167.426        26          6.439                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    36.008         7.571                  4.756    0.000    20.445    51.572 
##         cyl    -1.107         0.716       -0.328    -1.547    0.134    -2.579     0.364 
##        disp     0.012         0.012        0.254     1.039    0.308    -0.012     0.037 
##          hp    -0.024         0.013       -0.273    -1.809    0.082    -0.051     0.003 
##        drat     0.952         1.391        0.084     0.685    0.500    -1.907     3.811 
##          wt    -3.673         1.059       -0.596    -3.469    0.002    -5.850    -1.496 
## ----------------------------------------------------------------------------------------
## 
## 
## - drat 
## 
## Backward Elimination: Step 2 
## 
##  Variable drat Removed 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.921       RMSE                2.513 
## R-Squared               0.849       Coef. Var          12.506 
## Adj. R-Squared          0.826       MSE                 6.313 
## Pred R-Squared          0.791       MAE                 1.771 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     955.603         4        238.901    37.844    0.0000 
## Residual       170.444        27          6.313                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    40.829         2.757                 14.807    0.000    35.171    46.486 
##         cyl    -1.293         0.656       -0.383    -1.972    0.059    -2.639     0.052 
##        disp     0.012         0.012        0.239     0.989    0.331    -0.012     0.036 
##          hp    -0.021         0.012       -0.234    -1.691    0.102    -0.045     0.004 
##          wt    -3.854         1.015       -0.626    -3.795    0.001    -5.937    -1.770 
## ----------------------------------------------------------------------------------------
## 
## 
## - disp 
## 
## Backward Elimination: Step 3 
## 
##  Variable disp Removed 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.918       RMSE                2.512 
## R-Squared               0.843       Coef. Var          12.501 
## Adj. R-Squared          0.826       MSE                 6.308 
## Pred R-Squared          0.796       MAE                 1.845 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     949.427         3        316.476    50.171    0.0000 
## Residual       176.621        28          6.308                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    38.752         1.787                 21.687    0.000    35.092    42.412 
##         cyl    -0.942         0.551       -0.279    -1.709    0.098    -2.070     0.187 
##          hp    -0.018         0.012       -0.205    -1.519    0.140    -0.042     0.006 
##          wt    -3.167         0.741       -0.514    -4.276    0.000    -4.684    -1.650 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## No more variables satisfy the condition of p value = 0.3
## 
## 
## Variables Removed: 
## 
## - qsec 
## - drat 
## - disp 
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.918       RMSE                2.512 
## R-Squared               0.843       Coef. Var          12.501 
## Adj. R-Squared          0.826       MSE                 6.308 
## Pred R-Squared          0.796       MAE                 1.845 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     949.427         3        316.476    50.171    0.0000 
## Residual       176.621        28          6.308                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    38.752         1.787                 21.687    0.000    35.092    42.412 
##         cyl    -0.942         0.551       -0.279    -1.709    0.098    -2.070     0.187 
##          hp    -0.018         0.012       -0.205    -1.519    0.140    -0.042     0.006 
##          wt    -3.167         0.741       -0.514    -4.276    0.000    -4.684    -1.650 
## ----------------------------------------------------------------------------------------
## 
## 
##                           Elimination Summary                            
## ------------------------------------------------------------------------
##         Variable                  Adj.                                      
## Step    Removed     R-Square    R-Square     C(p)       AIC        RMSE     
## ------------------------------------------------------------------------
##    1    qsec          0.8513      0.8227    5.6040    157.7659    2.5376    
##    2    drat          0.8486      0.8262    4.0655    156.3376    2.5125    
##    3    disp          0.8431      0.8263    3.0100    155.4766    2.5115    
## ------------------------------------------------------------------------

Stepwise Regression

If you use stepwise regression, you must specify two thresholds: one for entering variables into the model, and one for removing variables.

Stepwise is like forward selection with an added backward selection part. Once you’ve added in a variable with a forward selection part, you re-check all the variables that are currently in there. This second check is identical to what is done in the backward selection procedure.

Stepwise Regression

library(olsrr)
fullMod <- lm(mpg ~ ., data = carsdf)
ols_step_both_p(fullMod, pent = .1, prem = .3, details = T)
## Stepwise Selection Method   
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 
## We are selecting variables based on p value...
## 
## 
## Stepwise Selection: Step 1 
## 
## - wt added 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.868       RMSE                3.046 
## R-Squared               0.753       Coef. Var          15.161 
## Adj. R-Squared          0.745       MSE                 9.277 
## Pred R-Squared          0.709       MAE                 2.341 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     847.725         1        847.725    91.375    0.0000 
## Residual       278.322        30          9.277                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    37.285         1.878                 19.858    0.000    33.450    41.120 
##          wt    -5.344         0.559       -0.868    -9.559    0.000    -6.486    -4.203 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## Stepwise Selection: Step 2 
## 
## - cyl added 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.911       RMSE                2.568 
## R-Squared               0.830       Coef. Var          12.780 
## Adj. R-Squared          0.819       MSE                 6.592 
## Pred R-Squared          0.790       MAE                 1.921 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     934.875         2        467.438    70.908    0.0000 
## Residual       191.172        29          6.592                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    39.686         1.715                 23.141    0.000    36.179    43.194 
##          wt    -3.191         0.757       -0.518    -4.216    0.000    -4.739    -1.643 
##         cyl    -1.508         0.415       -0.447    -3.636    0.001    -2.356    -0.660 
## ----------------------------------------------------------------------------------------
## 
## 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.911       RMSE                2.568 
## R-Squared               0.830       Coef. Var          12.780 
## Adj. R-Squared          0.819       MSE                 6.592 
## Pred R-Squared          0.790       MAE                 1.921 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     934.875         2        467.438    70.908    0.0000 
## Residual       191.172        29          6.592                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    39.686         1.715                 23.141    0.000    36.179    43.194 
##          wt    -3.191         0.757       -0.518    -4.216    0.000    -4.739    -1.643 
##         cyl    -1.508         0.415       -0.447    -3.636    0.001    -2.356    -0.660 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## No more variables to be added/removed.
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.911       RMSE                2.568 
## R-Squared               0.830       Coef. Var          12.780 
## Adj. R-Squared          0.819       MSE                 6.592 
## Pred R-Squared          0.790       MAE                 1.921 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     934.875         2        467.438    70.908    0.0000 
## Residual       191.172        29          6.592                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    39.686         1.715                 23.141    0.000    36.179    43.194 
##          wt    -3.191         0.757       -0.518    -4.216    0.000    -4.739    -1.643 
##         cyl    -1.508         0.415       -0.447    -3.636    0.001    -2.356    -0.660 
## ----------------------------------------------------------------------------------------
## 
##                              Stepwise Selection Summary                               
## -------------------------------------------------------------------------------------
##                      Added/                   Adj.                                       
## Step    Variable    Removed     R-Square    R-Square     C(p)        AIC        RMSE     
## -------------------------------------------------------------------------------------
##    1       wt       addition       0.753       0.745    14.5630    166.0294    3.0459    
##    2      cyl       addition       0.830       0.819     3.2350    156.0101    2.5675    
## -------------------------------------------------------------------------------------

Comments

Model Adequacy Checking versus Model Validation

The text distinguishes between model adequacy checking versus model validation.

Model adequacy checking includes residual analysis, testing for lack of fit, searching for high leverage or overly influential observations, and other internal analyses that investigate the fit of the regression model to the available data.

Model validation, however, is directed toward determining if the model will function successfully in its intended operating environment.

Why Model Validation?

The data generating process may change:

Model user will ignore/forget the Model Developer’s decisions

How to do Model Validation?

Depends on what your intention is:

  1. check the coefficients’ signs/magnitudes (are VIFS > 5?)
  2. check predictions (split the data, or on fresh data)

They mention cross-validation as a way to validate a model, but it is also a commonly-used model selection tool.

Cross-Validation

Here the data is repeatedly partitioned into different training-set-test-set pairs (aka folds) for each model we want to evaluate.

  1. The partitions are nonrandom, test sets are disjoint
  2. for each split/estimation/prediction, we never use a data point twice
  3. for each split/estimation/prediction, we lose parameter estimation accuracy because each training set is smaller than the full set
  4. however, we get to average over many prediction scores, which reduces variance
  5. there is still a bias that we have to estimate (but it’s usually smaller than AIC)
  6. it can be computationally brutal to calculate for some models (but it’s not that bad for linear regression models!)

LOO Cross-Validation and the PRESS statistic

In leave-one-out (LOO) cross validation, each test set is of size \(1\). That means there are \(n\) folds.

To predict \(y_i\), we use \(\mathbf{X}_{(i)}\) and \(\mathbf{y}_{(i)}\). Comparing the two, we get a squared prediction error: \((y_i - \hat{y}_{(i)})^2\). Then we add them up to get the predicted residual error sum of squares (PRESS) \[ \text{PRESS}_p = \sum_{i=1}^n(y_i - \hat{y}_{(i)})^2. \]

The insanely-cool thing about regression models is that for each model you’re cross-validating, you don’t need to re-fit the model on the \(n\) folds!

\[ \text{PRESS}_p = \sum_{i=1}^n \left(\frac{e_i}{1 - h_{ii}} \right)^2. \]

PRESS proof

This is too cool not to prove:

\[ \hat{y}_{(i)} = \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_{(i)} = \mathbf{x}_{i}^\intercal (\mathbf{X}_{(i)}^\intercal\mathbf{X}_{(i)})^{-1}\mathbf{X}_{(i)}^\intercal \mathbf{y}_{(i)} \]

The key is the Sherman-Morrison formula. \[ (\mathbf{X}_{(i)}^\intercal\mathbf{X}_{(i)})^{-1} = \left(\mathbf{X}^\intercal \mathbf{X} - \mathbf{x}_i\mathbf{x}_i^\intercal \right)^{-1} = (\mathbf{X}^\intercal \mathbf{X})^{-1} + \frac{ (\mathbf{X}^\intercal \mathbf{X})^{-1}\mathbf{x}_i \mathbf{x}_i^\intercal(\mathbf{X}^\intercal \mathbf{X})^{-1} }{1 - \mathbf{H}_{ii}} \]

So, because \(\mathbf{X}^\intercal_{(i)}\mathbf{y}_{(i)} = \mathbf{X}^\intercal\mathbf{y} - \overbrace{\mathbf{x}_i}^{\text{row i}} y_i\), \[ \hat{\boldsymbol{\beta}}_{(i)} = \hat{\boldsymbol{\beta}} - \frac{e_i (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{x}_i }{1-\mathbf{H}_{ii}} \] leading us to get \[ y_i - \hat{y}_{(i)} = y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}} + e_i \mathbf{x}_i^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1}\mathbf{x}_i/(1-\mathbf{H}_{ii}) =\left(\frac{e_i}{1 - \mathbf{H}_{ii} } \right) \] \(\square\)

PRESS statistic

Minimizing \(\text{PRESS}\) is equivalent to maximizing \(R^2_{\text{pred}} = 1 - \frac{\text{PRESS}}{ SS_T }\)

library(olsrr)
fullMod <- lm(mpg ~ ., data = carsdf)
allMods <- ols_step_all_possible(fullMod)
allMods$predictors[allMods$predrsq == max(allMods$predrsq)]
## [1] "cyl hp wt"

Same model :)