Module 7

STAT 6021

Taylor R. Brown, PhD

Motivation

Now we talk about variable selection which is like model selection as long as we’re restricting ourselves to regression models.

So far we’ve assumed we more-or-less know which predictors we’ll use.

We fit the full model, do an overall F test, maybe run one or two t-tests, check the residuals, maybe transform predictors and/or \(y\), and then finish off with another check of all our assumptions.

What if you have no scientific knowledge of the problem, just a .csv file?

Motivation

Q: If you have \(k\) predictors, how many possible models are there?

Motivation

If you know there’s an intercept: \(2^{k}\)
If you don’t know about the intercept: \(2^{k+1}\)
If you start making more predictors by using interactions and/or powers …

Multicollinearity tends to be lurking close by in this type of situation. Removing variables can help with this.

We also talked about how residuals can signify variable omissions.

Questions you should be asking

Should I err on the side of a small model (underfitting), or should I aim for a model with a large number of predictors (overfitting)?
Is my goal parameter inference, or prediction?

Bias and Variances and Covariances

The bias of an estimate \(\hat{\boldsymbol{\beta}}\) for a parameter vector \(\boldsymbol{\beta}\) is \[ E[\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}]. \]

The covariance matrix is importance as well: \[ V[\hat{\boldsymbol{\beta}}] = E\left\{\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^\intercal \right\}. \]

Which error measure?

Q: Bias is bad, and variance is bad. But which one do you we look at?

Mean Square Error (MSE)

A: We look at a combination: mean squared error.

\[\begin{align*} MSE(\hat{\boldsymbol{\beta}}_p) &= E[(\hat{\boldsymbol{\beta}}_p - \boldsymbol{\beta}_p)(\hat{\boldsymbol{\beta}}_p - \boldsymbol{\beta}_p)^\intercal] \tag{defn.}\\ &= E[(\hat{\boldsymbol{\beta}}_p \pm E[\hat{\boldsymbol{\beta}}_p] - \boldsymbol{\beta}_p)(\hat{\boldsymbol{\beta}}_p \pm E[\hat{\boldsymbol{\beta}}_p] - \boldsymbol{\beta}_p)^\intercal] \\ &= E[(\hat{\boldsymbol{\beta}}_p - E[\hat{\boldsymbol{\beta}}_p])(\hat{\boldsymbol{\beta}}_p - E[\hat{\boldsymbol{\beta}}_p])^\intercal] \\ &\hspace{10mm} E[(E[\hat{\boldsymbol{\beta}}_p] - \beta_p)(E[\hat{\boldsymbol{\beta}}_p] - \beta_p)] \\ &\hspace{5mm} + 2E[(\hat{\boldsymbol{\beta}}_p - E[\hat{\boldsymbol{\beta}}_p])(E[\hat{\boldsymbol{\beta}}_p] - \beta_p)^\intercal] \\ &= V[\hat{\boldsymbol{\beta}}_p] + \text{Bias}[\hat{\boldsymbol{\beta}}_p]\text{Bias}[\hat{\boldsymbol{\beta}}_p]^\intercal + 0 \tag{fundamental!} \end{align*}\]

Positive Semi-Definiteness

What does it mean when we say one matrix is bigger than or equal to another?

\[ \mathbf{M}_1 \ge \mathbf{M}_2 \]

Positive Semi-Definiteness

This

\[ \mathbf{M}_1 \ge \mathbf{M}_2 \]

is short hand for this

\[ \mathbf{M}_1 - \mathbf{M}_2 \text{ is positve semi-definite}. \]

A matrix \(\mathbf{A}\) is positive semi-definite if for all possible vectors \(\mathbf{c}\) (such that not all the elements are \(0\)) we have \[ \mathbf{c}^\intercal \mathbf{A} \mathbf{c} \ge 0. \]

Positive Semi-Definiteness

Assume for every \(\mathbf{c}\) (assuming not all elements are \(0\))

\[ V[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}^*_p] \ge V[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}_p] \] this is true iff \[ \mathbf{c}^\intercal V[\hat{\boldsymbol{\beta}}^*_p]\mathbf{c} - \mathbf{c}^\intercal V[\hat{\boldsymbol{\beta}}_p]\mathbf{c} \ge 0 \] iff \[ \mathbf{c}^\intercal \left( V[\hat{\boldsymbol{\beta}}^*_p] - V[\hat{\boldsymbol{\beta}}_p] \right) \mathbf{c} \ge 0 \]

this is the definition of \(V[\hat{\boldsymbol{\beta}}^*_p] - V[\hat{\boldsymbol{\beta}}_p]\) being positive semi-definite.

Positive Semi-Definiteness

Note \[\begin{align*} \text{MSE}[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}^*_p] &= V[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}^*_p] + \left(\mathbf{c}^\intercal E\{\hat{\boldsymbol{\beta}}^*_p - \boldsymbol{\beta}_p\} \right)^2 \\ &= \mathbf{c}^\intercal V[\hat{\boldsymbol{\beta}}^*_p]\mathbf{c} + \mathbf{c}^\intercal E\{\hat{\boldsymbol{\beta}}^*_p - \boldsymbol{\beta}_p\}E\{\hat{\boldsymbol{\beta}}^*_p - \boldsymbol{\beta}_p\}^\intercal\mathbf{c} \\ &= \mathbf{c}^\intercal\text{MSE}[\hat{\boldsymbol{\beta}}^*_p]\mathbf{c} \end{align*}\]

So it’s all about positive semi-definiteness for MSE matrices as well.

Now back to regression models…

Analyzing Misspecification

Let’s assume the full model is

\[\begin{align*} \mathbf{y} &= \overbrace{\mathbf{X}}^{n \times K} \boldsymbol{\beta} + \boldsymbol{\epsilon}\\ &= \underbrace{\mathbf{X}_p}_{n \times p} \boldsymbol{\beta}_p + \underbrace{\mathbf{X}_r}_{n \times r} \boldsymbol{\beta}_r + \boldsymbol{\epsilon}\\ \end{align*}\]

and the smaller model is \[ \mathbf{y} = \mathbf{X}_p \boldsymbol{\beta}_p + \widetilde{\boldsymbol{\epsilon}} \]

\(r\) is the number of removed columns
\(k = p-1 = K-r\) is the number of predictors left in the smaller model
both noise vectors have the same covariance matrix: \(\sigma^2 \mathbf{I}\)

Analyzing Misspecification

Estimates for the full model

\(\hat{\boldsymbol{\beta}}^* = (\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}\)

Estimates for the small model

\(\hat{\boldsymbol{\beta}}_p = (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal \mathbf{y}\)

When we’re overfitting

\[ E[\hat{\boldsymbol{\beta}}^*_p] = \left[(\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal\right]_p \mathbf{X}_p \boldsymbol{\beta}_p \] When we’re underfitting

\[ E[\hat{\boldsymbol{\beta}}_p] = \boldsymbol{\beta}_p + (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal \mathbf{X}_r \boldsymbol{\beta}_r = \boldsymbol{\beta}_p + \mathbf{A}\boldsymbol{\beta}_r \]

Analyzing Misspecification

NB: \(\mathbf{X}_p^\intercal \mathbf{X}_r = 0\) means removed columns are orthogonal to kept columns.

\[ \mathbf{X}^\intercal\mathbf{X} = \begin{bmatrix} \mathbf{X}_p^\intercal\mathbf{X}_p &\mathbf{X}_p^\intercal\mathbf{X}_r \\ \mathbf{X}_r^\intercal\mathbf{X}_p & \mathbf{X}_r^\intercal\mathbf{X}_r \end{bmatrix} = \begin{bmatrix} \mathbf{X}_p^\intercal\mathbf{X}_p &\mathbf{0} \\ \mathbf{0} & \mathbf{X}_r^\intercal\mathbf{X}_r \end{bmatrix} \] which means \[ (\mathbf{X}^\intercal\mathbf{X})^{-1} = \begin{bmatrix} (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1} &\mathbf{0} \\ \mathbf{0} & (\mathbf{X}_r^\intercal\mathbf{X}_r)^{-1} \end{bmatrix} \] and the two estimators are equivalent! This is a special case of the general formula for the inverse of a block matrix that we saw earlier.

Analyzing Misspecification

Estimates for the full model

\(\hat{\boldsymbol{\beta}}^* = (\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}\)

Estimates for the small model

\(\hat{\boldsymbol{\beta}}_p = (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal \mathbf{y}\)

When we’re overfitting

\[ V[\hat{\boldsymbol{\beta}}^*_p] = \sigma^2 \left[(\mathbf{X}^\intercal\mathbf{X})^{-1}\right]_p \] When we’re underfitting

\[ V[\hat{\boldsymbol{\beta}}_p] = \sigma^2 (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1} \]

Deliberately Underfitting?

Assume the full model is true.

Wrong, smaller model: \[ MSE[\hat{\boldsymbol{\beta}}_p] = \sigma^2 (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1}+ \mathbf{A}\boldsymbol{\beta}_r\boldsymbol{\beta}_r^\intercal \mathbf{A}^\intercal \]

Right, full/bigger model: \[ MSE[\hat{\boldsymbol{\beta}}^*_p] = \sigma^2 \left[(\mathbf{X}^\intercal\mathbf{X})^{-1}\right]_p + 0 \]

Is it possible we can have smaller-MSE estimates with the restricted model?

Deliberately Underfitting?

Yes!

\(MSE[\hat{\boldsymbol{\beta}}^*_p] - MSE[\hat{\boldsymbol{\beta}}_p]\) is positive semi-definite if and only if \[ V[\hat{\boldsymbol{\beta}}^*_r] - \boldsymbol{\beta}_r\boldsymbol{\beta}_r^\intercal \] is positive semi-definite. The proof involves using the formula for inverses of block matrices.

Intuitively, we should leave out columns with coefficents that are “near” zero (relative to their standard errors).

Overfitting/Underfitting and the Mean Response

For the full model, we predict at \(\mathbf{x}^\intercal = \begin{bmatrix} \mathbf{x}_p^\intercal & \mathbf{x}_r^\intercal \end{bmatrix}\): \[ \hat{y}^* = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y} \] For the smaller model we have \[ \hat{y} = \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal\mathbf{y} \]

Overfitting/Underfitting and the Mean Response

\(\hat{y}^* = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y}\)
\(\hat{y} = \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal\mathbf{y}\)

If we’re overfitting: \[ E[\hat{y}^*] = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{X}_p \boldsymbol{\beta}_p \] and if we’re underfitting \[\begin{align*} E[\hat{y}] &= \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal(\mathbf{X}_p \boldsymbol{\beta}_p + \mathbf{X}_r \boldsymbol{\beta}_r)\\ &= \mathbf{x}_p^\intercal\boldsymbol{\beta}_p + \mathbf{x}_p^\intercal \mathbf{A} \boldsymbol{\beta}_r \end{align*}\]

Overfitting/Underfitting and the Mean Response

\(\hat{y}^* = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y}\)
\(\hat{y} = \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal\mathbf{y}\)

If we’re overfitting: \[ V[\hat{y}^*] = \sigma^2 \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{x} \] and if we’re underfitting \[ V[\hat{y}] = \sigma^2 \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{x}_p \]

Decomposing MSPE

If your observations are independent, there is a similar decomposition for mean square prediction error (proof: add and subtract expectations)

\[ \text{MSPE}[\hat{y}] := E\left(y' - \hat{y}\right)^2 = E\left[\left(y' - E[y']\right)^2\right] + E\left[(\hat{y}- E[\hat{y}])^2\right] + \left(E[\hat{y}] - E[y'] \right)^2 \]

the unseen data point is \(y'\)
the model’s prediction is \(\hat{y}\)

Forecast errors is affected by all of the following

inherent/future noise: \(V[y'] = E\left[\left(y' - E[y']\right)^2\right] = \sigma^2\)
prediction/historical data noise: \(V[\hat{y}]\)
miscalibration/bias/model-risk: \((E[\hat{y}] - E[y'])^2\)

Deliberately Underfitting?

Assume the full model is true.

Notation

the unseen data point is \(y'\)
the full model prediction is \(\hat{y}^*\)
the reduced model prediction is \(\hat{y}\)

Question

when is it better to predict with the reduced model?

Deliberately Underfitting

\[ \text{MSPE}[\hat{y}^*] \ge \text{MSPE}[\hat{y}] \] if and only if \[ V[\hat{\boldsymbol{\beta}}^*_r] - \boldsymbol{\beta}_r\boldsymbol{\beta}_r^\intercal \]

Same condition as before!

Model Selection Criteria

Most (all?) model selection criteria are based on the idea of minimizing out-of-sample error/loss/badness.

Know the difference between the data used to estimate your model \(\{y_i, \mathbf{x}_i \}_{i=1}^n\), and a future/unseen data point \(y'_i, \mathbf{x}^\intercal_i\).

Once again, the predictor data is assumed nonrandom, or if it isn’t, then everything is conditoning on it.

Mallows’ Cp

If you use Mallows’ Cp to pick a model, you are trying to pick the model with the lowest out-of-sample mean squared prediction error:

\[ E[(y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] \]

Note

This is very different from in-sample mean square prediction error!
\(\hat{\boldsymbol{\beta}}_p\) is from a particular model using \(\mathbf{X}_p\) and \(\mathbf{y}\)
the expectation is taken with respect to everything that’s random: \(\{y_i\}\) and \(y_i'\)

Mallows’ Cp

In-sample MSPE: \[\begin{align*} E[(y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] &= V[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \\ &= V[y_i] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] - 2 \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) + \left(E[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \end{align*}\]

Out-of-sample MSPE: \[\begin{align*} E[(y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] &= V[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \\ &= V[y_i'] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] - 2 \text{Cov}\left(y_i', \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) + \left(E[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \\ &= V[y_i'] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \tag{indep.} \\ &= V[y_i] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \tag{identicalness.} \\ &= E[(y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] + 2 \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) \tag{above.} \end{align*}\]

Mallows’ Cp

So \[ E[(y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] = E[(y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] + 2 \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) \]

which means

\[ E \left[ \frac{1}{n}\sum_{i=1}^n (y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] = E\left[\frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] + 2 \frac{1}{n}\sum_{i=1}^n \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) \]

which is the same as

because \(\text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) = \sigma^2 \mathbf{H}_{ii}\) and because \(\text{tr}(\mathbf{H}) = p\)

Mallows’ Cp

Based on \[ E \left[ \frac{1}{n}\sum_{i=1}^n (y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] = \underbrace{E\left[\frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right]}_{\text{in-sample average error}} + \underbrace{ 2 \frac{\sigma^2}{n}p}_{\text{penalty}} \]

Mallows’ Cp for a particular model is \[ C_p := \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2 + 2\frac{\hat{\sigma}^2_{\text{full}}}{n}p \]

where \(\hat{\sigma}^2_{\text{full}}\) is from the largest possible model (it’s unbiased assuming the true model’s predictors are contained in the full model). So, pick the model with the smallest!

Mallows’ Cp

Alternatively, sometimes people divide through by \(\hat{\sigma}^2_{\text{full}}/n\) and then subtract \(n\): \[ \frac{1}{\hat{\sigma}^2_{\text{full}}}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2 + 2p - n = \frac{SS_{Res}(p) }{\hat{\sigma}^2_{\text{full}}} + 2p - n \]

This has expectation roughly \(p\), so people like to plot it versus \(p\). However, the goal is still to pick the model with the smallest value (pick model \(C\) in this case).

AIC

Akaike’s Information Criterion (AIC) suggests picking the model that minimizes out of sample cross entropy (a loss different than squared error)

\[ E\left[-2 \sum_{i=1}^n \log \hat{p}(y_i') \right] \]

Entropy is just the average “surprise” of your prediction:

\(\hat{p}( \cdot)\) is a probabilistic prediction model estimated on old data \(\{y_i\}\). WLOG let’s say it’s a density.
you want to plug new values \(y_i'\) into the density, and high densities are better (no surprises)
this is equivalent to the negative twice the log of the density being low
the \(y_i'\)s you’re plugging in are random, so you take the average
sometimes people maximize \(E\left[\sum_{i=1}^n \log \hat{p}(y_i') \right]\)

AIC

For a particular model with \(p\) columns in the design matrix \[\begin{align*} \text{AIC} &:= - 2 \log(L) + 2p \\ &= n \log\left(\frac{SS_{Res}}{n} \right) + 2p \end{align*}\]

where \[\begin{align*} \log L &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\hat{\sigma}^2 ) - \frac{1}{2 \hat{\sigma}^2}\sum_{i=1}^n \left( y_i - \mathbf{x}^\intercal \hat{\boldsymbol{\beta}}_p \right)^2 \\ &= \underbrace{-\frac{n}{2}(\log(2\pi) +1)}_{\text{often ignored}}- \frac{n}{2}\log\left(\frac{SS_{Res}}{n} \right) \end{align*}\] is the normal density with the maximum likelihood estimates plugged in. Recall \(\hat{\sigma}^2 = SS_{Res}/n\), the MLE, is different from MSE because of the denominator.

Honorable mentions

BIC, DIC, AICc, WAIC, etc.
\(R^2\) and adjusted \(R^2\) (mentioned in module 3)
cross validation, the PRESS statistic (coming soon!)

using `R`

library(leaps)
head(carsdf)

##                    mpg cyl disp  hp drat    wt  qsec
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
## Valiant           18.1   6  225 105 2.76 3.460 20.22

allMods <- regsubsets(mpg ~ ., data = carsdf, nbest = 10, nvmax = NULL, method = "exhaustive")
#summary(allMods)

using `R`

Pick the top row!

#?plot.regsubsets
plot(allMods, scale = "Cp")

using `R`

library(car)

## Loading required package: carData

subsets(allMods, statistic="cp", legend = FALSE, min.size = 2, main = "Mallow Cp")

##      Abbreviation
## cyl             c
## disp           ds
## hp              h
## drat           dr
## wt              w
## qsec            q

abline(a = 1, b = 1, lty = 2)

using `R`

leaps() doesn’t calculate AIC, so we have to use bestglm()

library(bestglm)
## y var has to come last and be named y
carsdf.for.bestglm <- within(carsdf, 
{
    y    <- mpg         # mpg into
    mpg  <- NULL        # Delete old mpg
})
carsdf.for.bestglm <-
    carsdf.for.bestglm[, c("cyl", "disp", "hp", "drat", "wt", "qsec", "y")]
allModsWithAIC <-
    bestglm(Xy = carsdf.for.bestglm,
            family = gaussian, 
            IC = "AIC",           
            method = "exhaustive")
allModsWithAIC$BestModels

##     cyl  disp    hp  drat   wt  qsec Criterion
## 1  TRUE FALSE  TRUE FALSE TRUE FALSE  60.66456
## 2  TRUE FALSE FALSE FALSE TRUE FALSE  61.19800
## 3  TRUE FALSE FALSE FALSE TRUE  TRUE  61.37837
## 4  TRUE  TRUE  TRUE FALSE TRUE FALSE  61.52554
## 5 FALSE FALSE  TRUE FALSE TRUE FALSE  61.84027

Picks the same model as Mallows’ Cp!

Stepwise Model Selection

Recall that the number of models grows exponentially in the number of possible predictors.

Sometimes you’ll have too many columns, and doing an exhaustive search will be computationally prohibitive.

We’ll mention

forward selection
backward elimination
stepwise (combination of the above)

Forward Selection

For each absent predictor, estimate the model with this new predictor added. Add predictor with lowest p-value, only if it’s below a threshold.

start with smallest model: \(y = \beta_0 + \epsilon\)
p-values come from the \(t\)-test, which is the same as the partial \(F\)-test when the bigger model has only one more predictor.

Forward Selection

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

fullMod <- lm(mpg ~ ., data = carsdf)
ols_step_forward_p(fullMod, penter = .3, details = T)

## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 
## We are selecting variables based on p value...
## 
## 
## Forward Selection: Step 1 
## 
## - wt 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.868       RMSE                3.046 
## R-Squared               0.753       Coef. Var          15.161 
## Adj. R-Squared          0.745       MSE                 9.277 
## Pred R-Squared          0.709       MAE                 2.341 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     847.725         1        847.725    91.375    0.0000 
## Residual       278.322        30          9.277                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    37.285         1.878                 19.858    0.000    33.450    41.120 
##          wt    -5.344         0.559       -0.868    -9.559    0.000    -6.486    -4.203 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 2 
## 
## - cyl 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.911       RMSE                2.568 
## R-Squared               0.830       Coef. Var          12.780 
## Adj. R-Squared          0.819       MSE                 6.592 
## Pred R-Squared          0.790       MAE                 1.921 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     934.875         2        467.438    70.908    0.0000 
## Residual       191.172        29          6.592                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    39.686         1.715                 23.141    0.000    36.179    43.194 
##          wt    -3.191         0.757       -0.518    -4.216    0.000    -4.739    -1.643 
##         cyl    -1.508         0.415       -0.447    -3.636    0.001    -2.356    -0.660 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 3 
## 
## - hp 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.918       RMSE                2.512 
## R-Squared               0.843       Coef. Var          12.501 
## Adj. R-Squared          0.826       MSE                 6.308 
## Pred R-Squared          0.796       MAE                 1.845 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     949.427         3        316.476    50.171    0.0000 
## Residual       176.621        28          6.308                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    38.752         1.787                 21.687    0.000    35.092    42.412 
##          wt    -3.167         0.741       -0.514    -4.276    0.000    -4.684    -1.650 
##         cyl    -0.942         0.551       -0.279    -1.709    0.098    -2.070     0.187 
##          hp    -0.018         0.012       -0.205    -1.519    0.140    -0.042     0.006 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## No more variables to be added.
## 
## Variables Entered: 
## 
## + wt 
## + cyl 
## + hp 
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.918       RMSE                2.512 
## R-Squared               0.843       Coef. Var          12.501 
## Adj. R-Squared          0.826       MSE                 6.308 
## Pred R-Squared          0.796       MAE                 1.845 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     949.427         3        316.476    50.171    0.0000 
## Residual       176.621        28          6.308                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    38.752         1.787                 21.687    0.000    35.092    42.412 
##          wt    -3.167         0.741       -0.514    -4.276    0.000    -4.684    -1.650 
##         cyl    -0.942         0.551       -0.279    -1.709    0.098    -2.070     0.187 
##          hp    -0.018         0.012       -0.205    -1.519    0.140    -0.042     0.006 
## ----------------------------------------------------------------------------------------

## 
##                             Selection Summary                             
## -------------------------------------------------------------------------
##         Variable                  Adj.                                       
## Step    Entered     R-Square    R-Square     C(p)        AIC        RMSE     
## -------------------------------------------------------------------------
##    1    wt            0.7528      0.7446    14.5629    166.0294    3.0459    
##    2    cyl           0.8302      0.8185     3.2353    156.0101    2.5675    
##    3    hp            0.8431      0.8263     3.0100    155.4766    2.5115    
## -------------------------------------------------------------------------

Backward Selection

Backward selection starts with the full model, but it uses the same F tests as forward selection.

Test the current model against all submodels that have one fewer predictor. Recall that the null hypothesis is that the full model is true. If all p-values are larger than the threshold, move to the model with the highest p-value.

Backward Selection

Backward selection picks the same model as well.

library(olsrr)
fullMod <- lm(mpg ~ ., data = carsdf)
ols_step_backward_p(fullMod, prem = .3, details = T)

## Backward Elimination Method 
## ---------------------------
## 
## Candidate Terms: 
## 
## 1 . cyl 
## 2 . disp 
## 3 . hp 
## 4 . drat 
## 5 . wt 
## 6 . qsec 
## 
## We are eliminating variables based on p value...
## 
## - qsec 
## 
## Backward Elimination: Step 1 
## 
##  Variable qsec Removed 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.923       RMSE                2.538 
## R-Squared               0.851       Coef. Var          12.631 
## Adj. R-Squared          0.823       MSE                 6.439 
## Pred R-Squared          0.780       MAE                 1.795 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     958.621         5        191.724    29.773    0.0000 
## Residual       167.426        26          6.439                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    36.008         7.571                  4.756    0.000    20.445    51.572 
##         cyl    -1.107         0.716       -0.328    -1.547    0.134    -2.579     0.364 
##        disp     0.012         0.012        0.254     1.039    0.308    -0.012     0.037 
##          hp    -0.024         0.013       -0.273    -1.809    0.082    -0.051     0.003 
##        drat     0.952         1.391        0.084     0.685    0.500    -1.907     3.811 
##          wt    -3.673         1.059       -0.596    -3.469    0.002    -5.850    -1.496 
## ----------------------------------------------------------------------------------------
## 
## 
## - drat 
## 
## Backward Elimination: Step 2 
## 
##  Variable drat Removed 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.921       RMSE                2.513 
## R-Squared               0.849       Coef. Var          12.506 
## Adj. R-Squared          0.826       MSE                 6.313 
## Pred R-Squared          0.791       MAE                 1.771 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     955.603         4        238.901    37.844    0.0000 
## Residual       170.444        27          6.313                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    40.829         2.757                 14.807    0.000    35.171    46.486 
##         cyl    -1.293         0.656       -0.383    -1.972    0.059    -2.639     0.052 
##        disp     0.012         0.012        0.239     0.989    0.331    -0.012     0.036 
##          hp    -0.021         0.012       -0.234    -1.691    0.102    -0.045     0.004 
##          wt    -3.854         1.015       -0.626    -3.795    0.001    -5.937    -1.770 
## ----------------------------------------------------------------------------------------
## 
## 
## - disp 
## 
## Backward Elimination: Step 3 
## 
##  Variable disp Removed 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.918       RMSE                2.512 
## R-Squared               0.843       Coef. Var          12.501 
## Adj. R-Squared          0.826       MSE                 6.308 
## Pred R-Squared          0.796       MAE                 1.845 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     949.427         3        316.476    50.171    0.0000 
## Residual       176.621        28          6.308                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    38.752         1.787                 21.687    0.000    35.092    42.412 
##         cyl    -0.942         0.551       -0.279    -1.709    0.098    -2.070     0.187 
##          hp    -0.018         0.012       -0.205    -1.519    0.140    -0.042     0.006 
##          wt    -3.167         0.741       -0.514    -4.276    0.000    -4.684    -1.650 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## No more variables satisfy the condition of p value = 0.3
## 
## 
## Variables Removed: 
## 
## - qsec 
## - drat 
## - disp 
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.918       RMSE                2.512 
## R-Squared               0.843       Coef. Var          12.501 
## Adj. R-Squared          0.826       MSE                 6.308 
## Pred R-Squared          0.796       MAE                 1.845 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     949.427         3        316.476    50.171    0.0000 
## Residual       176.621        28          6.308                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    38.752         1.787                 21.687    0.000    35.092    42.412 
##         cyl    -0.942         0.551       -0.279    -1.709    0.098    -2.070     0.187 
##          hp    -0.018         0.012       -0.205    -1.519    0.140    -0.042     0.006 
##          wt    -3.167         0.741       -0.514    -4.276    0.000    -4.684    -1.650 
## ----------------------------------------------------------------------------------------

## 
## 
##                           Elimination Summary                            
## ------------------------------------------------------------------------
##         Variable                  Adj.                                      
## Step    Removed     R-Square    R-Square     C(p)       AIC        RMSE     
## ------------------------------------------------------------------------
##    1    qsec          0.8513      0.8227    5.6040    157.7659    2.5376    
##    2    drat          0.8486      0.8262    4.0655    156.3376    2.5125    
##    3    disp          0.8431      0.8263    3.0100    155.4766    2.5115    
## ------------------------------------------------------------------------

Stepwise Regression

If you use stepwise regression, you must specify two thresholds: one for entering variables into the model, and one for removing variables.

Stepwise is like forward selection with an added backward selection part. Once you’ve added in a variable with a forward selection part, you re-check all the variables that are currently in there. This second check is identical to what is done in the backward selection procedure.

Stepwise Regression

library(olsrr)
fullMod <- lm(mpg ~ ., data = carsdf)
ols_step_both_p(fullMod, pent = .1, prem = .3, details = T)

## Stepwise Selection Method   
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. cyl 
## 2. disp 
## 3. hp 
## 4. drat 
## 5. wt 
## 6. qsec 
## 
## We are selecting variables based on p value...
## 
## 
## Stepwise Selection: Step 1 
## 
## - wt added 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.868       RMSE                3.046 
## R-Squared               0.753       Coef. Var          15.161 
## Adj. R-Squared          0.745       MSE                 9.277 
## Pred R-Squared          0.709       MAE                 2.341 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     847.725         1        847.725    91.375    0.0000 
## Residual       278.322        30          9.277                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    37.285         1.878                 19.858    0.000    33.450    41.120 
##          wt    -5.344         0.559       -0.868    -9.559    0.000    -6.486    -4.203 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## Stepwise Selection: Step 2 
## 
## - cyl added 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.911       RMSE                2.568 
## R-Squared               0.830       Coef. Var          12.780 
## Adj. R-Squared          0.819       MSE                 6.592 
## Pred R-Squared          0.790       MAE                 1.921 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     934.875         2        467.438    70.908    0.0000 
## Residual       191.172        29          6.592                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    39.686         1.715                 23.141    0.000    36.179    43.194 
##          wt    -3.191         0.757       -0.518    -4.216    0.000    -4.739    -1.643 
##         cyl    -1.508         0.415       -0.447    -3.636    0.001    -2.356    -0.660 
## ----------------------------------------------------------------------------------------
## 
## 
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.911       RMSE                2.568 
## R-Squared               0.830       Coef. Var          12.780 
## Adj. R-Squared          0.819       MSE                 6.592 
## Pred R-Squared          0.790       MAE                 1.921 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     934.875         2        467.438    70.908    0.0000 
## Residual       191.172        29          6.592                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    39.686         1.715                 23.141    0.000    36.179    43.194 
##          wt    -3.191         0.757       -0.518    -4.216    0.000    -4.739    -1.643 
##         cyl    -1.508         0.415       -0.447    -3.636    0.001    -2.356    -0.660 
## ----------------------------------------------------------------------------------------
## 
## 
## 
## No more variables to be added/removed.
## 
## 
## Final Model Output 
## ------------------
## 
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.911       RMSE                2.568 
## R-Squared               0.830       Coef. Var          12.780 
## Adj. R-Squared          0.819       MSE                 6.592 
## Pred R-Squared          0.790       MAE                 1.921 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                 Sum of                                              
##                Squares        DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression     934.875         2        467.438    70.908    0.0000 
## Residual       191.172        29          6.592                     
## Total         1126.047        31                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    39.686         1.715                 23.141    0.000    36.179    43.194 
##          wt    -3.191         0.757       -0.518    -4.216    0.000    -4.739    -1.643 
##         cyl    -1.508         0.415       -0.447    -3.636    0.001    -2.356    -0.660 
## ----------------------------------------------------------------------------------------

## 
##                              Stepwise Selection Summary                               
## -------------------------------------------------------------------------------------
##                      Added/                   Adj.                                       
## Step    Variable    Removed     R-Square    R-Square     C(p)        AIC        RMSE     
## -------------------------------------------------------------------------------------
##    1       wt       addition       0.753       0.745    14.5630    166.0294    3.0459    
##    2      cyl       addition       0.830       0.819     3.2350    156.0101    2.5675    
## -------------------------------------------------------------------------------------

Comments

Forward/Backward/Stepwise have no guarantees of any kind of optimality
end of section 10.2 has some suggestions that may or may not work
do the exhaustive search if you can, although there are no guarantees here, either.
remember that Mallows’ and AIC are very noisy
looking at hypothesis tests after doing model selection is misleading/worthless

Model Adequacy Checking versus Model Validation

The text distinguishes between model adequacy checking versus model validation.

Model adequacy checking includes residual analysis, testing for lack of fit, searching for high leverage or overly influential observations, and other internal analyses that investigate the fit of the regression model to the available data.

Model validation, however, is directed toward determining if the model will function successfully in its intended operating environment.

Why Model Validation?

The data generating process may change:

Influential factors that were unknown during the model-building stage may significantly affect the new observations, rendering the predictions almost useless.
The correlative structure between the regressors may differ in the model-building and prediction data

Model user will ignore/forget the Model Developer’s decisions

start trying to interpret p-values (i.e. p-hacking)
use for prediction when it was intended for inference, or vice versa

How to do Model Validation?

Depends on what your intention is:

check the coefficients’ signs/magnitudes (are VIFS > 5?)
check predictions (split the data, or on fresh data)

They mention cross-validation as a way to validate a model, but it is also a commonly-used model selection tool.

Cross-Validation

Here the data is repeatedly partitioned into different training-set-test-set pairs (aka folds) for each model we want to evaluate.

The partitions are nonrandom, test sets are disjoint
for each split/estimation/prediction, we never use a data point twice
for each split/estimation/prediction, we lose parameter estimation accuracy because each training set is smaller than the full set
however, we get to average over many prediction scores, which reduces variance
there is still a bias that we have to estimate (but it’s usually smaller than AIC)
it can be computationally brutal to calculate for some models (but it’s not that bad for linear regression models!)

LOO Cross-Validation and the PRESS statistic

In leave-one-out (LOO) cross validation, each test set is of size \(1\). That means there are \(n\) folds.

To predict \(y_i\), we use \(\mathbf{X}_{(i)}\) and \(\mathbf{y}_{(i)}\). Comparing the two, we get a squared prediction error: \((y_i - \hat{y}_{(i)})^2\). Then we add them up to get the predicted residual error sum of squares (PRESS) \[ \text{PRESS}_p = \sum_{i=1}^n(y_i - \hat{y}_{(i)})^2. \]

The insanely-cool thing about regression models is that for each model you’re cross-validating, you don’t need to re-fit the model on the \(n\) folds!

\[ \text{PRESS}_p = \sum_{i=1}^n \left(\frac{e_i}{1 - h_{ii}} \right)^2. \]

PRESS proof

This is too cool not to prove:

\[ \hat{y}_{(i)} = \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_{(i)} = \mathbf{x}_{i}^\intercal (\mathbf{X}_{(i)}^\intercal\mathbf{X}_{(i)})^{-1}\mathbf{X}_{(i)}^\intercal \mathbf{y}_{(i)} \]

The key is the Sherman-Morrison formula. \[ (\mathbf{X}_{(i)}^\intercal\mathbf{X}_{(i)})^{-1} = \left(\mathbf{X}^\intercal \mathbf{X} - \mathbf{x}_i\mathbf{x}_i^\intercal \right)^{-1} = (\mathbf{X}^\intercal \mathbf{X})^{-1} + \frac{ (\mathbf{X}^\intercal \mathbf{X})^{-1}\mathbf{x}_i \mathbf{x}_i^\intercal(\mathbf{X}^\intercal \mathbf{X})^{-1} }{1 - \mathbf{H}_{ii}} \]

So, because \(\mathbf{X}^\intercal_{(i)}\mathbf{y}_{(i)} = \mathbf{X}^\intercal\mathbf{y} - \overbrace{\mathbf{x}_i}^{\text{row i}} y_i\), \[ \hat{\boldsymbol{\beta}}_{(i)} = \hat{\boldsymbol{\beta}} - \frac{e_i (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{x}_i }{1-\mathbf{H}_{ii}} \] leading us to get \[ y_i - \hat{y}_{(i)} = y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}} + e_i \mathbf{x}_i^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1}\mathbf{x}_i/(1-\mathbf{H}_{ii}) =\left(\frac{e_i}{1 - \mathbf{H}_{ii} } \right) \] \(\square\)

PRESS statistic

Minimizing \(\text{PRESS}\) is equivalent to maximizing \(R^2_{\text{pred}} = 1 - \frac{\text{PRESS}}{ SS_T }\)

library(olsrr)
fullMod <- lm(mpg ~ ., data = carsdf)
allMods <- ols_step_all_possible(fullMod)
allMods$predictors[allMods$predrsq == max(allMods$predrsq)]

## [1] "cyl hp wt"

Same model :)

Module 7

STAT 6021

Motivation

Motivation

Motivation

Questions you should be asking

Bias and Variances and Covariances

Which error measure?

Mean Square Error (MSE)

Positive Semi-Definiteness

Positive Semi-Definiteness

Positive Semi-Definiteness

Positive Semi-Definiteness

Analyzing Misspecification

Analyzing Misspecification

Analyzing Misspecification

Analyzing Misspecification

Deliberately Underfitting?

Deliberately Underfitting?

Overfitting/Underfitting and the Mean Response

Overfitting/Underfitting and the Mean Response

Overfitting/Underfitting and the Mean Response

Decomposing MSPE

Forecast errors is affected by all of the following

Deliberately Underfitting?

Notation

Question

Deliberately Underfitting

Model Selection Criteria

Mallows’ Cp

Note

Mallows’ Cp

Mallows’ Cp

Mallows’ Cp

Mallows’ Cp

AIC

AIC

Honorable mentions

using R

using R

using R

using R

Stepwise Model Selection

Forward Selection

Forward Selection

Backward Selection

Backward Selection

Stepwise Regression

Stepwise Regression

Comments

Model Adequacy Checking versus Model Validation

Why Model Validation?

How to do Model Validation?

Cross-Validation

LOO Cross-Validation and the PRESS statistic

PRESS proof

PRESS statistic

using `R`

using `R`

using `R`

using `R`