Taylor R. Brown, PhD
Now we talk about variable selection which is like model selection as long as we’re restricting ourselves to regression models.
So far we’ve assumed we more-or-less know which predictors we’ll use.
We fit the full model, do an overall F test, maybe run one or two t-tests, check the residuals, maybe transform predictors and/or \(y\), and then finish off with another check of all our assumptions.
What if you have no scientific knowledge of the problem, just a .csv
file?
Q: If you have \(k\) predictors, how many possible models are there?
A:
If you know there’s an intercept: \(2^{k}\)
If you don’t know about the intercept: \(2^{k+1}\)
If you start making more predictors by using interactions and/or powers …
Multicollinearity tends to be lurking close by in this type of situation. Removing variables can help with this.
We also talked about how residuals can signify variable omissions.
Should I err on the side of a small model (underfitting), or should I aim for a model with a large number of predictors (overfitting)?
Is my goal parameter inference, or prediction?
The bias of an estimate \(\hat{\boldsymbol{\beta}}\) for a parameter vector \(\boldsymbol{\beta}\) is \[ E[\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}]. \]
The covariance matrix is importance as well: \[ V[\hat{\boldsymbol{\beta}}] = E\left\{\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)\left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^\intercal \right\}. \]
Q: Bias is bad, and variance is bad. But which one do you we look at?
A: We look at a combination: mean squared error.
\[\begin{align*} MSE(\hat{\boldsymbol{\beta}}_p) &= E[(\hat{\boldsymbol{\beta}}_p - \boldsymbol{\beta}_p)(\hat{\boldsymbol{\beta}}_p - \boldsymbol{\beta}_p)^\intercal] \tag{defn.}\\ &= E[(\hat{\boldsymbol{\beta}}_p \pm E[\hat{\boldsymbol{\beta}}_p] - \boldsymbol{\beta}_p)(\hat{\boldsymbol{\beta}}_p \pm E[\hat{\boldsymbol{\beta}}_p] - \boldsymbol{\beta}_p)^\intercal] \\ &= E[(\hat{\boldsymbol{\beta}}_p - E[\hat{\boldsymbol{\beta}}_p])(\hat{\boldsymbol{\beta}}_p - E[\hat{\boldsymbol{\beta}}_p])^\intercal] \\ &\hspace{10mm} E[(E[\hat{\boldsymbol{\beta}}_p] - \beta_p)(E[\hat{\boldsymbol{\beta}}_p] - \beta_p)] \\ &\hspace{5mm} + 2E[(\hat{\boldsymbol{\beta}}_p - E[\hat{\boldsymbol{\beta}}_p])(E[\hat{\boldsymbol{\beta}}_p] - \beta_p)^\intercal] \\ &= V[\hat{\boldsymbol{\beta}}_p] + \text{Bias}[\hat{\boldsymbol{\beta}}_p]\text{Bias}[\hat{\boldsymbol{\beta}}_p]^\intercal + 0 \tag{fundamental!} \end{align*}\]
What does it mean when we say one matrix is bigger than or equal to another?
\[ \mathbf{M}_1 \ge \mathbf{M}_2 \]
This
\[ \mathbf{M}_1 \ge \mathbf{M}_2 \]
is short hand for this
\[ \mathbf{M}_1 - \mathbf{M}_2 \text{ is positve semi-definite}. \]
A matrix \(\mathbf{A}\) is positive semi-definite if for all possible vectors \(\mathbf{c}\) (such that not all the elements are \(0\)) we have \[ \mathbf{c}^\intercal \mathbf{A} \mathbf{c} \ge 0. \]
Assume for every \(\mathbf{c}\) (assuming not all elements are \(0\))
\[ V[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}^*_p] \ge V[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}_p] \] this is true iff \[ \mathbf{c}^\intercal V[\hat{\boldsymbol{\beta}}^*_p]\mathbf{c} - \mathbf{c}^\intercal V[\hat{\boldsymbol{\beta}}_p]\mathbf{c} \ge 0 \] iff \[ \mathbf{c}^\intercal \left( V[\hat{\boldsymbol{\beta}}^*_p] - V[\hat{\boldsymbol{\beta}}_p] \right) \mathbf{c} \ge 0 \]
this is the definition of \(V[\hat{\boldsymbol{\beta}}^*_p] - V[\hat{\boldsymbol{\beta}}_p]\) being positive semi-definite.
Note \[\begin{align*} \text{MSE}[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}^*_p] &= V[\mathbf{c}^\intercal\hat{\boldsymbol{\beta}}^*_p] + \left(\mathbf{c}^\intercal E\{\hat{\boldsymbol{\beta}}^*_p - \boldsymbol{\beta}_p\} \right)^2 \\ &= \mathbf{c}^\intercal V[\hat{\boldsymbol{\beta}}^*_p]\mathbf{c} + \mathbf{c}^\intercal E\{\hat{\boldsymbol{\beta}}^*_p - \boldsymbol{\beta}_p\}E\{\hat{\boldsymbol{\beta}}^*_p - \boldsymbol{\beta}_p\}^\intercal\mathbf{c} \\ &= \mathbf{c}^\intercal\text{MSE}[\hat{\boldsymbol{\beta}}^*_p]\mathbf{c} \end{align*}\]
So it’s all about positive semi-definiteness for MSE matrices as well.
Now back to regression models…
Let’s assume the full model is
\[\begin{align*} \mathbf{y} &= \overbrace{\mathbf{X}}^{n \times K} \boldsymbol{\beta} + \boldsymbol{\epsilon}\\ &= \underbrace{\mathbf{X}_p}_{n \times p} \boldsymbol{\beta}_p + \underbrace{\mathbf{X}_r}_{n \times r} \boldsymbol{\beta}_r + \boldsymbol{\epsilon}\\ \end{align*}\]
and the smaller model is \[ \mathbf{y} = \mathbf{X}_p \boldsymbol{\beta}_p + \widetilde{\boldsymbol{\epsilon}} \]
Estimates for the full model
Estimates for the small model
When we’re overfitting
\[ E[\hat{\boldsymbol{\beta}}^*_p] = \left[(\mathbf{X}^\intercal\mathbf{X})^{-1} \mathbf{X}^\intercal\right]_p \mathbf{X}_p \boldsymbol{\beta}_p \] When we’re underfitting
\[ E[\hat{\boldsymbol{\beta}}_p] = \boldsymbol{\beta}_p + (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal \mathbf{X}_r \boldsymbol{\beta}_r = \boldsymbol{\beta}_p + \mathbf{A}\boldsymbol{\beta}_r \]
NB: \(\mathbf{X}_p^\intercal \mathbf{X}_r = 0\) means removed columns are orthogonal to kept columns.
\[ \mathbf{X}^\intercal\mathbf{X} = \begin{bmatrix} \mathbf{X}_p^\intercal\mathbf{X}_p &\mathbf{X}_p^\intercal\mathbf{X}_r \\ \mathbf{X}_r^\intercal\mathbf{X}_p & \mathbf{X}_r^\intercal\mathbf{X}_r \end{bmatrix} = \begin{bmatrix} \mathbf{X}_p^\intercal\mathbf{X}_p &\mathbf{0} \\ \mathbf{0} & \mathbf{X}_r^\intercal\mathbf{X}_r \end{bmatrix} \] which means \[ (\mathbf{X}^\intercal\mathbf{X})^{-1} = \begin{bmatrix} (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1} &\mathbf{0} \\ \mathbf{0} & (\mathbf{X}_r^\intercal\mathbf{X}_r)^{-1} \end{bmatrix} \] and the two estimators are equivalent! This is a special case of the general formula for the inverse of a block matrix that we saw earlier.
Estimates for the full model
Estimates for the small model
When we’re overfitting
\[ V[\hat{\boldsymbol{\beta}}^*_p] = \sigma^2 \left[(\mathbf{X}^\intercal\mathbf{X})^{-1}\right]_p \] When we’re underfitting
\[ V[\hat{\boldsymbol{\beta}}_p] = \sigma^2 (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1} \]
Assume the full model is true.
Wrong, smaller model: \[ MSE[\hat{\boldsymbol{\beta}}_p] = \sigma^2 (\mathbf{X}_p^\intercal\mathbf{X}_p)^{-1}+ \mathbf{A}\boldsymbol{\beta}_r\boldsymbol{\beta}_r^\intercal \mathbf{A}^\intercal \]
Right, full/bigger model: \[ MSE[\hat{\boldsymbol{\beta}}^*_p] = \sigma^2 \left[(\mathbf{X}^\intercal\mathbf{X})^{-1}\right]_p + 0 \]
Is it possible we can have smaller-MSE estimates with the restricted model?
Yes!
\(MSE[\hat{\boldsymbol{\beta}}^*_p] - MSE[\hat{\boldsymbol{\beta}}_p]\) is positive semi-definite if and only if \[ V[\hat{\boldsymbol{\beta}}^*_r] - \boldsymbol{\beta}_r\boldsymbol{\beta}_r^\intercal \] is positive semi-definite. The proof involves using the formula for inverses of block matrices.
Intuitively, we should leave out columns with coefficents that are “near” zero (relative to their standard errors).
For the full model, we predict at \(\mathbf{x}^\intercal = \begin{bmatrix} \mathbf{x}_p^\intercal & \mathbf{x}_r^\intercal \end{bmatrix}\): \[ \hat{y}^* = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y} \] For the smaller model we have \[ \hat{y} = \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal\mathbf{y} \]
\(\hat{y}^* = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y}\)
\(\hat{y} = \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal\mathbf{y}\)
If we’re overfitting: \[ E[\hat{y}^*] = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{X}_p \boldsymbol{\beta}_p \] and if we’re underfitting \[\begin{align*} E[\hat{y}] &= \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal(\mathbf{X}_p \boldsymbol{\beta}_p + \mathbf{X}_r \boldsymbol{\beta}_r)\\ &= \mathbf{x}_p^\intercal\boldsymbol{\beta}_p + \mathbf{x}_p^\intercal \mathbf{A} \boldsymbol{\beta}_r \end{align*}\]
\(\hat{y}^* = \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y}\)
\(\hat{y} = \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{X}_p^\intercal\mathbf{y}\)
If we’re overfitting: \[ V[\hat{y}^*] = \sigma^2 \mathbf{x}^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{x} \] and if we’re underfitting \[ V[\hat{y}] = \sigma^2 \mathbf{x}_p^\intercal (\mathbf{X}_p^\intercal \mathbf{X}_p)^{-1} \mathbf{x}_p \]
If your observations are independent, there is a similar decomposition for mean square prediction error (proof: add and subtract expectations)
\[ \text{MSPE}[\hat{y}] := E\left(y' - \hat{y}\right)^2 = E\left[\left(y' - E[y']\right)^2\right] + E\left[(\hat{y}- E[\hat{y}])^2\right] + \left(E[\hat{y}] - E[y'] \right)^2 \]
the unseen data point is \(y'\)
the model’s prediction is \(\hat{y}\)
inherent/future noise: \(V[y'] = E\left[\left(y' - E[y']\right)^2\right] = \sigma^2\)
prediction/historical data noise: \(V[\hat{y}]\)
miscalibration/bias/model-risk: \((E[\hat{y}] - E[y'])^2\)
Assume the full model is true.
the unseen data point is \(y'\)
the full model prediction is \(\hat{y}^*\)
the reduced model prediction is \(\hat{y}\)
\[ \text{MSPE}[\hat{y}^*] \ge \text{MSPE}[\hat{y}] \] if and only if \[ V[\hat{\boldsymbol{\beta}}^*_r] - \boldsymbol{\beta}_r\boldsymbol{\beta}_r^\intercal \]
Same condition as before!
Most (all?) model selection criteria are based on the idea of minimizing out-of-sample error/loss/badness.
Know the difference between the data used to estimate your model \(\{y_i, \mathbf{x}_i \}_{i=1}^n\), and a future/unseen data point \(y'_i, \mathbf{x}^\intercal_i\).
Once again, the predictor data is assumed nonrandom, or if it isn’t, then everything is conditoning on it.
If you use Mallows’ Cp to pick a model, you are trying to pick the model with the lowest out-of-sample mean squared prediction error:
\[ E[(y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] \]
This is very different from in-sample mean square prediction error!
\(\hat{\boldsymbol{\beta}}_p\) is from a particular model using \(\mathbf{X}_p\) and \(\mathbf{y}\)
the expectation is taken with respect to everything that’s random: \(\{y_i\}\) and \(y_i'\)
In-sample MSPE: \[\begin{align*} E[(y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] &= V[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \\ &= V[y_i] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] - 2 \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) + \left(E[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \end{align*}\]
Out-of-sample MSPE: \[\begin{align*} E[(y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] &= V[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \\ &= V[y_i'] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] - 2 \text{Cov}\left(y_i', \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) + \left(E[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \\ &= V[y_i'] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \tag{indep.} \\ &= V[y_i] + V[\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p] + \left(E[y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p ] \right)^2 \tag{identicalness.} \\ &= E[(y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] + 2 \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) \tag{above.} \end{align*}\]
So \[ E[(y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] = E[(y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2] + 2 \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) \]
which means
\[ E \left[ \frac{1}{n}\sum_{i=1}^n (y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] = E\left[\frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] + 2 \frac{1}{n}\sum_{i=1}^n \text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) \]
which is the same as
\[ E \left[ \frac{1}{n}\sum_{i=1}^n (y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] = E\left[\frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] + 2 \frac{\sigma^2}{n}p \]
because \(\text{Cov}\left(y_i, \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p \right) = \sigma^2 \mathbf{H}_{ii}\) and because \(\text{tr}(\mathbf{H}) = p\)
Based on \[ E \left[ \frac{1}{n}\sum_{i=1}^n (y_i' - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right] = \underbrace{E\left[\frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2\right]}_{\text{in-sample average error}} + \underbrace{ 2 \frac{\sigma^2}{n}p}_{\text{penalty}} \]
Mallows’ Cp for a particular model is \[ C_p := \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2 + 2\frac{\hat{\sigma}^2_{\text{full}}}{n}p \]
where \(\hat{\sigma}^2_{\text{full}}\) is from the largest possible model (it’s unbiased assuming the true model’s predictors are contained in the full model). So, pick the model with the smallest!
Alternatively, sometimes people divide through by \(\hat{\sigma}^2_{\text{full}}/n\) and then subtract \(n\): \[ \frac{1}{\hat{\sigma}^2_{\text{full}}}\sum_{i=1}^n (y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_p )^2 + 2p - n = \frac{SS_{Res}(p) }{\hat{\sigma}^2_{\text{full}}} + 2p - n \]
This has expectation roughly \(p\), so people like to plot it versus \(p\). However, the goal is still to pick the model with the smallest value (pick model \(C\) in this case).
Akaike’s Information Criterion (AIC) suggests picking the model that minimizes out of sample cross entropy (a loss different than squared error)
\[ E\left[-2 \sum_{i=1}^n \log \hat{p}(y_i') \right] \]
Entropy is just the average “surprise” of your prediction:
\(\hat{p}( \cdot)\) is a probabilistic prediction model estimated on old data \(\{y_i\}\). WLOG let’s say it’s a density.
you want to plug new values \(y_i'\) into the density, and high densities are better (no surprises)
this is equivalent to the negative twice the log of the density being low
the \(y_i'\)s you’re plugging in are random, so you take the average
sometimes people maximize \(E\left[\sum_{i=1}^n \log \hat{p}(y_i') \right]\)
For a particular model with \(p\) columns in the design matrix \[\begin{align*} \text{AIC} &:= - 2 \log(L) + 2p \\ &= n \log\left(\frac{SS_{Res}}{n} \right) + 2p \end{align*}\]
where \[\begin{align*} \log L &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\hat{\sigma}^2 ) - \frac{1}{2 \hat{\sigma}^2}\sum_{i=1}^n \left( y_i - \mathbf{x}^\intercal \hat{\boldsymbol{\beta}}_p \right)^2 \\ &= \underbrace{-\frac{n}{2}(\log(2\pi) +1)}_{\text{often ignored}}- \frac{n}{2}\log\left(\frac{SS_{Res}}{n} \right) \end{align*}\] is the normal density with the maximum likelihood estimates plugged in. Recall \(\hat{\sigma}^2 = SS_{Res}/n\), the MLE, is different from MSE because of the denominator.
BIC, DIC, AICc, WAIC, etc.
\(R^2\) and adjusted \(R^2\) (mentioned in module 3)
cross validation, the PRESS statistic (coming soon!)
R
## mpg cyl disp hp drat wt qsec
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02
## Valiant 18.1 6 225 105 2.76 3.460 20.22
R
## Loading required package: carData
## Abbreviation
## cyl c
## disp ds
## hp h
## drat dr
## wt w
## qsec q
R
leaps()
doesn’t calculate AIC, so we have to use bestglm()
library(bestglm)
## y var has to come last and be named y
carsdf.for.bestglm <- within(carsdf,
{
y <- mpg # mpg into
mpg <- NULL # Delete old mpg
})
carsdf.for.bestglm <-
carsdf.for.bestglm[, c("cyl", "disp", "hp", "drat", "wt", "qsec", "y")]
allModsWithAIC <-
bestglm(Xy = carsdf.for.bestglm,
family = gaussian,
IC = "AIC",
method = "exhaustive")
allModsWithAIC$BestModels
## cyl disp hp drat wt qsec Criterion
## 1 TRUE FALSE TRUE FALSE TRUE FALSE 60.66456
## 2 TRUE FALSE FALSE FALSE TRUE FALSE 61.19800
## 3 TRUE FALSE FALSE FALSE TRUE TRUE 61.37837
## 4 TRUE TRUE TRUE FALSE TRUE FALSE 61.52554
## 5 FALSE FALSE TRUE FALSE TRUE FALSE 61.84027
Picks the same model as Mallows’ Cp!
Recall that the number of models grows exponentially in the number of possible predictors.
Sometimes you’ll have too many columns, and doing an exhaustive search will be computationally prohibitive.
We’ll mention
forward selection
backward elimination
stepwise (combination of the above)
For each absent predictor, estimate the model with this new predictor added. Add predictor with lowest p-value, only if it’s below a threshold.
start with smallest model: \(y = \beta_0 + \epsilon\)
p-values come from the \(t\)-test, which is the same as the partial \(F\)-test when the bigger model has only one more predictor.
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
## Forward Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. cyl
## 2. disp
## 3. hp
## 4. drat
## 5. wt
## 6. qsec
##
## We are selecting variables based on p value...
##
##
## Forward Selection: Step 1
##
## - wt
##
## Model Summary
## --------------------------------------------------------------
## R 0.868 RMSE 3.046
## R-Squared 0.753 Coef. Var 15.161
## Adj. R-Squared 0.745 MSE 9.277
## Pred R-Squared 0.709 MAE 2.341
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 847.725 1 847.725 91.375 0.0000
## Residual 278.322 30 9.277
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 37.285 1.878 19.858 0.000 33.450 41.120
## wt -5.344 0.559 -0.868 -9.559 0.000 -6.486 -4.203
## ----------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 2
##
## - cyl
##
## Model Summary
## --------------------------------------------------------------
## R 0.911 RMSE 2.568
## R-Squared 0.830 Coef. Var 12.780
## Adj. R-Squared 0.819 MSE 6.592
## Pred R-Squared 0.790 MAE 1.921
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 934.875 2 467.438 70.908 0.0000
## Residual 191.172 29 6.592
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 39.686 1.715 23.141 0.000 36.179 43.194
## wt -3.191 0.757 -0.518 -4.216 0.000 -4.739 -1.643
## cyl -1.508 0.415 -0.447 -3.636 0.001 -2.356 -0.660
## ----------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 3
##
## - hp
##
## Model Summary
## --------------------------------------------------------------
## R 0.918 RMSE 2.512
## R-Squared 0.843 Coef. Var 12.501
## Adj. R-Squared 0.826 MSE 6.308
## Pred R-Squared 0.796 MAE 1.845
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 949.427 3 316.476 50.171 0.0000
## Residual 176.621 28 6.308
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 38.752 1.787 21.687 0.000 35.092 42.412
## wt -3.167 0.741 -0.514 -4.276 0.000 -4.684 -1.650
## cyl -0.942 0.551 -0.279 -1.709 0.098 -2.070 0.187
## hp -0.018 0.012 -0.205 -1.519 0.140 -0.042 0.006
## ----------------------------------------------------------------------------------------
##
##
##
## No more variables to be added.
##
## Variables Entered:
##
## + wt
## + cyl
## + hp
##
##
## Final Model Output
## ------------------
##
## Model Summary
## --------------------------------------------------------------
## R 0.918 RMSE 2.512
## R-Squared 0.843 Coef. Var 12.501
## Adj. R-Squared 0.826 MSE 6.308
## Pred R-Squared 0.796 MAE 1.845
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 949.427 3 316.476 50.171 0.0000
## Residual 176.621 28 6.308
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 38.752 1.787 21.687 0.000 35.092 42.412
## wt -3.167 0.741 -0.514 -4.276 0.000 -4.684 -1.650
## cyl -0.942 0.551 -0.279 -1.709 0.098 -2.070 0.187
## hp -0.018 0.012 -0.205 -1.519 0.140 -0.042 0.006
## ----------------------------------------------------------------------------------------
##
## Selection Summary
## -------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## -------------------------------------------------------------------------
## 1 wt 0.7528 0.7446 14.5629 166.0294 3.0459
## 2 cyl 0.8302 0.8185 3.2353 156.0101 2.5675
## 3 hp 0.8431 0.8263 3.0100 155.4766 2.5115
## -------------------------------------------------------------------------
Backward selection starts with the full model, but it uses the same F tests as forward selection.
Test the current model against all submodels that have one fewer predictor. Recall that the null hypothesis is that the full model is true. If all p-values are larger than the threshold, move to the model with the highest p-value.
Backward selection picks the same model as well.
library(olsrr)
fullMod <- lm(mpg ~ ., data = carsdf)
ols_step_backward_p(fullMod, prem = .3, details = T)
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1 . cyl
## 2 . disp
## 3 . hp
## 4 . drat
## 5 . wt
## 6 . qsec
##
## We are eliminating variables based on p value...
##
## - qsec
##
## Backward Elimination: Step 1
##
## Variable qsec Removed
##
## Model Summary
## --------------------------------------------------------------
## R 0.923 RMSE 2.538
## R-Squared 0.851 Coef. Var 12.631
## Adj. R-Squared 0.823 MSE 6.439
## Pred R-Squared 0.780 MAE 1.795
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 958.621 5 191.724 29.773 0.0000
## Residual 167.426 26 6.439
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 36.008 7.571 4.756 0.000 20.445 51.572
## cyl -1.107 0.716 -0.328 -1.547 0.134 -2.579 0.364
## disp 0.012 0.012 0.254 1.039 0.308 -0.012 0.037
## hp -0.024 0.013 -0.273 -1.809 0.082 -0.051 0.003
## drat 0.952 1.391 0.084 0.685 0.500 -1.907 3.811
## wt -3.673 1.059 -0.596 -3.469 0.002 -5.850 -1.496
## ----------------------------------------------------------------------------------------
##
##
## - drat
##
## Backward Elimination: Step 2
##
## Variable drat Removed
##
## Model Summary
## --------------------------------------------------------------
## R 0.921 RMSE 2.513
## R-Squared 0.849 Coef. Var 12.506
## Adj. R-Squared 0.826 MSE 6.313
## Pred R-Squared 0.791 MAE 1.771
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 955.603 4 238.901 37.844 0.0000
## Residual 170.444 27 6.313
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 40.829 2.757 14.807 0.000 35.171 46.486
## cyl -1.293 0.656 -0.383 -1.972 0.059 -2.639 0.052
## disp 0.012 0.012 0.239 0.989 0.331 -0.012 0.036
## hp -0.021 0.012 -0.234 -1.691 0.102 -0.045 0.004
## wt -3.854 1.015 -0.626 -3.795 0.001 -5.937 -1.770
## ----------------------------------------------------------------------------------------
##
##
## - disp
##
## Backward Elimination: Step 3
##
## Variable disp Removed
##
## Model Summary
## --------------------------------------------------------------
## R 0.918 RMSE 2.512
## R-Squared 0.843 Coef. Var 12.501
## Adj. R-Squared 0.826 MSE 6.308
## Pred R-Squared 0.796 MAE 1.845
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 949.427 3 316.476 50.171 0.0000
## Residual 176.621 28 6.308
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 38.752 1.787 21.687 0.000 35.092 42.412
## cyl -0.942 0.551 -0.279 -1.709 0.098 -2.070 0.187
## hp -0.018 0.012 -0.205 -1.519 0.140 -0.042 0.006
## wt -3.167 0.741 -0.514 -4.276 0.000 -4.684 -1.650
## ----------------------------------------------------------------------------------------
##
##
##
## No more variables satisfy the condition of p value = 0.3
##
##
## Variables Removed:
##
## - qsec
## - drat
## - disp
##
##
## Final Model Output
## ------------------
##
## Model Summary
## --------------------------------------------------------------
## R 0.918 RMSE 2.512
## R-Squared 0.843 Coef. Var 12.501
## Adj. R-Squared 0.826 MSE 6.308
## Pred R-Squared 0.796 MAE 1.845
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 949.427 3 316.476 50.171 0.0000
## Residual 176.621 28 6.308
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 38.752 1.787 21.687 0.000 35.092 42.412
## cyl -0.942 0.551 -0.279 -1.709 0.098 -2.070 0.187
## hp -0.018 0.012 -0.205 -1.519 0.140 -0.042 0.006
## wt -3.167 0.741 -0.514 -4.276 0.000 -4.684 -1.650
## ----------------------------------------------------------------------------------------
##
##
## Elimination Summary
## ------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------
## 1 qsec 0.8513 0.8227 5.6040 157.7659 2.5376
## 2 drat 0.8486 0.8262 4.0655 156.3376 2.5125
## 3 disp 0.8431 0.8263 3.0100 155.4766 2.5115
## ------------------------------------------------------------------------
If you use stepwise regression, you must specify two thresholds: one for entering variables into the model, and one for removing variables.
Stepwise is like forward selection with an added backward selection part. Once you’ve added in a variable with a forward selection part, you re-check all the variables that are currently in there. This second check is identical to what is done in the backward selection procedure.
library(olsrr)
fullMod <- lm(mpg ~ ., data = carsdf)
ols_step_both_p(fullMod, pent = .1, prem = .3, details = T)
## Stepwise Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. cyl
## 2. disp
## 3. hp
## 4. drat
## 5. wt
## 6. qsec
##
## We are selecting variables based on p value...
##
##
## Stepwise Selection: Step 1
##
## - wt added
##
## Model Summary
## --------------------------------------------------------------
## R 0.868 RMSE 3.046
## R-Squared 0.753 Coef. Var 15.161
## Adj. R-Squared 0.745 MSE 9.277
## Pred R-Squared 0.709 MAE 2.341
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 847.725 1 847.725 91.375 0.0000
## Residual 278.322 30 9.277
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 37.285 1.878 19.858 0.000 33.450 41.120
## wt -5.344 0.559 -0.868 -9.559 0.000 -6.486 -4.203
## ----------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 2
##
## - cyl added
##
## Model Summary
## --------------------------------------------------------------
## R 0.911 RMSE 2.568
## R-Squared 0.830 Coef. Var 12.780
## Adj. R-Squared 0.819 MSE 6.592
## Pred R-Squared 0.790 MAE 1.921
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 934.875 2 467.438 70.908 0.0000
## Residual 191.172 29 6.592
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 39.686 1.715 23.141 0.000 36.179 43.194
## wt -3.191 0.757 -0.518 -4.216 0.000 -4.739 -1.643
## cyl -1.508 0.415 -0.447 -3.636 0.001 -2.356 -0.660
## ----------------------------------------------------------------------------------------
##
##
##
## Model Summary
## --------------------------------------------------------------
## R 0.911 RMSE 2.568
## R-Squared 0.830 Coef. Var 12.780
## Adj. R-Squared 0.819 MSE 6.592
## Pred R-Squared 0.790 MAE 1.921
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 934.875 2 467.438 70.908 0.0000
## Residual 191.172 29 6.592
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 39.686 1.715 23.141 0.000 36.179 43.194
## wt -3.191 0.757 -0.518 -4.216 0.000 -4.739 -1.643
## cyl -1.508 0.415 -0.447 -3.636 0.001 -2.356 -0.660
## ----------------------------------------------------------------------------------------
##
##
##
## No more variables to be added/removed.
##
##
## Final Model Output
## ------------------
##
## Model Summary
## --------------------------------------------------------------
## R 0.911 RMSE 2.568
## R-Squared 0.830 Coef. Var 12.780
## Adj. R-Squared 0.819 MSE 6.592
## Pred R-Squared 0.790 MAE 1.921
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 934.875 2 467.438 70.908 0.0000
## Residual 191.172 29 6.592
## Total 1126.047 31
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 39.686 1.715 23.141 0.000 36.179 43.194
## wt -3.191 0.757 -0.518 -4.216 0.000 -4.739 -1.643
## cyl -1.508 0.415 -0.447 -3.636 0.001 -2.356 -0.660
## ----------------------------------------------------------------------------------------
##
## Stepwise Selection Summary
## -------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## -------------------------------------------------------------------------------------
## 1 wt addition 0.753 0.745 14.5630 166.0294 3.0459
## 2 cyl addition 0.830 0.819 3.2350 156.0101 2.5675
## -------------------------------------------------------------------------------------
The text distinguishes between model adequacy checking versus model validation.
Model adequacy checking includes residual analysis, testing for lack of fit, searching for high leverage or overly influential observations, and other internal analyses that investigate the fit of the regression model to the available data.
Model validation, however, is directed toward determining if the model will function successfully in its intended operating environment.
The data generating process may change:
Influential factors that were unknown during the model-building stage may significantly affect the new observations, rendering the predictions almost useless.
The correlative structure between the regressors may differ in the model-building and prediction data
Model user will ignore/forget the Model Developer’s decisions
start trying to interpret p-values (i.e. p-hacking)
use for prediction when it was intended for inference, or vice versa
Depends on what your intention is:
They mention cross-validation as a way to validate a model, but it is also a commonly-used model selection tool.
Here the data is repeatedly partitioned into different training-set-test-set pairs (aka folds) for each model we want to evaluate.
In leave-one-out (LOO) cross validation, each test set is of size \(1\). That means there are \(n\) folds.
To predict \(y_i\), we use \(\mathbf{X}_{(i)}\) and \(\mathbf{y}_{(i)}\). Comparing the two, we get a squared prediction error: \((y_i - \hat{y}_{(i)})^2\). Then we add them up to get the predicted residual error sum of squares (PRESS) \[ \text{PRESS}_p = \sum_{i=1}^n(y_i - \hat{y}_{(i)})^2. \]
The insanely-cool thing about regression models is that for each model you’re cross-validating, you don’t need to re-fit the model on the \(n\) folds!
\[ \text{PRESS}_p = \sum_{i=1}^n \left(\frac{e_i}{1 - h_{ii}} \right)^2. \]
This is too cool not to prove:
\[ \hat{y}_{(i)} = \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}}_{(i)} = \mathbf{x}_{i}^\intercal (\mathbf{X}_{(i)}^\intercal\mathbf{X}_{(i)})^{-1}\mathbf{X}_{(i)}^\intercal \mathbf{y}_{(i)} \]
The key is the Sherman-Morrison formula. \[ (\mathbf{X}_{(i)}^\intercal\mathbf{X}_{(i)})^{-1} = \left(\mathbf{X}^\intercal \mathbf{X} - \mathbf{x}_i\mathbf{x}_i^\intercal \right)^{-1} = (\mathbf{X}^\intercal \mathbf{X})^{-1} + \frac{ (\mathbf{X}^\intercal \mathbf{X})^{-1}\mathbf{x}_i \mathbf{x}_i^\intercal(\mathbf{X}^\intercal \mathbf{X})^{-1} }{1 - \mathbf{H}_{ii}} \]
So, because \(\mathbf{X}^\intercal_{(i)}\mathbf{y}_{(i)} = \mathbf{X}^\intercal\mathbf{y} - \overbrace{\mathbf{x}_i}^{\text{row i}} y_i\), \[ \hat{\boldsymbol{\beta}}_{(i)} = \hat{\boldsymbol{\beta}} - \frac{e_i (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{x}_i }{1-\mathbf{H}_{ii}} \] leading us to get \[ y_i - \hat{y}_{(i)} = y_i - \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}} + e_i \mathbf{x}_i^\intercal (\mathbf{X}^\intercal \mathbf{X})^{-1}\mathbf{x}_i/(1-\mathbf{H}_{ii}) =\left(\frac{e_i}{1 - \mathbf{H}_{ii} } \right) \] \(\square\)
Minimizing \(\text{PRESS}\) is equivalent to maximizing \(R^2_{\text{pred}} = 1 - \frac{\text{PRESS}}{ SS_T }\)
library(olsrr)
fullMod <- lm(mpg ~ ., data = carsdf)
allMods <- ols_step_all_possible(fullMod)
allMods$predictors[allMods$predrsq == max(allMods$predrsq)]
## [1] "cyl hp wt"
Same model :)
Comments
Forward/Backward/Stepwise have no guarantees of any kind of optimality
end of section 10.2 has some suggestions that may or may not work
do the exhaustive search if you can, although there are no guarantees here, either.
remember that Mallows’ and AIC are very noisy
looking at hypothesis tests after doing model selection is misleading/worthless