Module 4

STAT 6021

Taylor R. Brown, PhD

Multiple Linear Regression

Everything we’ve done so far rested on these assumptions:

Residuals!

The residuals are our friend:

\[ \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = [\mathbf{I}- \mathbf{H} ]\mathbf{y} \]

Residuals v1.0

Sometimes we look at the standardized residuals \(d_i\)

\[ d_i = \frac{e_i}{\sqrt{MS_{Res}}} \]

Idea: make \(V[d_i] \approx 1\)

Why? \[ V[e_i] \approx \sigma^2 \approx MS_{Res} \]

Residuals v2.0

We can do better:

mean \[ E[\mathbf{e}] = [\mathbf{I}- \mathbf{H} ]E[\mathbf{y} ]= [\mathbf{I}- \mathbf{H} ]\mathbf{X}\boldsymbol{\beta} = \mathbf{0} \]

covariance matrix \[ V[\mathbf{e}] = V[(\mathbf{I} - \mathbf{H})\mathbf{y}] = \sigma^2 (\mathbf{I} - \mathbf{H})(\mathbf{I} - \mathbf{H})^{\intercal} = \sigma^2 (\mathbf{I} - \mathbf{H}) \]

Residuals v2.0

Because \[ e_i \sim \text{Normal}\left(0, \sigma^2(1 - h_{ii}) \right) \]

it’s better to look at the (internally) studentized residuals \[ r_i = \frac{e_i}{\sqrt{ MS_{Res} (1 - h_{ii}) }} \overset{\text{approx.}}{\sim} t_{df_{Res}} \]

…still correlated, not exactly \(t\)-distributed, not exactly unit variance

Residuals v3.0

The externally studentized residuals follow a t-distribution exactly (if \(\boldsymbol{\epsilon}\) is normal). These are what we will use to check diagnostics.

\[ t_i = \frac{e_i}{\sqrt{ MS_{(i),Res} (1 - h_{ii}) }} \sim t_{df_{Res}-1} \]

\(MS_{(i),Res}\) differs from \(MS_{Res}\) because it is based on all of the data except for the \(i\)th data point. This is also why we lose an extra degree of freedom!

using R

# install.packages("MASS")
library(MASS)
mod <- lm(y ~ ., data = my_data)
stdized_resids <- residuals(mod)/sqrt(anova(mod)$'Mean Sq'[3])
intern_s_resids <- stdres(mod)
extern_s_resids <- studres(mod)
par(mfrow=c(1,3))
hist(stdized_resids)
hist(intern_s_resids)
hist(extern_s_resids)

par(mfrow=c(1,1))

using R

internally versus externally…

# install.packages("MASS")
intern_s_resids <- stdres(mod)
extern_s_resids <- studres(mod)
hist(intern_s_resids - extern_s_resids)

Plotting Residuals

Plots:

Checking normality

If \(\boldsymbol{\epsilon}\) is normal, then \(\{t_i\}\) are iid \(t_{df_{Res}-1}\)

x_grid <- seq(min(extern_s_resids)-.5, max(extern_s_resids)+.5, by = .01)
dfRes <- nrow(my_data) - 7 - 1 # 7 predictors and one intercept
hist(extern_s_resids, probability = T, ylim = c(0,.5))
t_densities <- dt(x_grid, dfRes)
points(x_grid, t_densities, type = "l", col = "red")

Checking normality

QQ plots are a little better for spotting heavy/thin tails or skewness…

qqnorm(extern_s_resids)
qqline(extern_s_resids)

Checking Homoskedasticity and Linearity

Always plot \(t_i\) versus \(\hat{y}_i\) to check for heteroskedasticity and nonlinearities

Checking Homoskedasticity and Linearity

Our model’s plot:

plot(fitted.values(mod), extern_s_resids)

Missing predictors?

Another example:

bad_mod <- lm(y ~ x, data = my_df)
plot(my_df$x,my_df$y, xlab = "x", ylab = "y")
points(my_df$x, fitted.values(bad_mod), type ="l", col= "red")

plot(fitted.values(bad_mod), studres(bad_mod))

plot(my_df$x, studres(bad_mod))

Added-Variable Plots

A partial regression plot aka an added-variable plot plots two types of residuals against each other

The plot is straight-sloped if it should be included in the model. It’s horizontal if it’s redundant. And it’s curved if you might need to add a power of \(\mathbf{x}_i\) to the regression model.

Added-Variable Plots

library(car)
## Loading required package: carData
head(my_data)
##       y   x1  x2    x3    x4      x5      x6   x7
## 1 36.98  5.1 400 51.37  4.24 1484.83 2227.25 2.06
## 2 13.74 26.4 400 72.33 30.87  289.94  434.90 1.33
## 3 10.08 23.8 400 71.44 33.01  320.79  481.19 0.97
## 4  8.53 46.4 400 79.15 44.61  164.76  247.14 0.62
## 5 36.42  7.0 450 80.47 33.84 1097.26 1645.89 0.22
## 6 26.59 12.6 450 89.90 41.26  605.06  907.59 0.76
avPlots(lm(y ~ ., data = my_data))

Added-Variable Plots

Some intuition:

Say we are interested in looking at \(x_1\). Pretend now that it doesn’t belong in the model.

The residuals from regressing \(y\) on everything except for \(x_1\) will roughly be \(\epsilon\). These are uncorrelated with everything, and in particular, with the residuals of \(x_1\) regressed on the others.

Added-Variable Plots

Some intuition:

Say we are interested in looking at \(x_1\). Pretend that it does belong in the model.

Regressing \(y\) on everything except for \(x_1\) roughly gives you \[ y = \underbrace{\beta_0 + \beta_1 \alpha_0}_{\text{intercept}} + \underbrace{(\beta_2 + \beta_1 \alpha_1)}_{\text{slope}}x_2 + \underbrace{\epsilon + \beta_1 \tilde{\epsilon}}_{\text{residuals}} \] So the residuals from regressing \(y\) on everything else (\(\epsilon + \beta_1 \tilde{\epsilon}\)) are a function of the residuals from regressing \(x_1\) on everything else (\(\tilde{\epsilon}\)).

Variance-Stabilizing Transformations

Sometimes you can’t just add predictors until the problems go away. Sometimes you have to transform your \(y\) variable

Variance-Stabilizing Transformations

Example:

\(Y \sim \text{Poisson}\left(\lambda_x\right)\), then \(E[Y] = V[Y] = \lambda_x\) (violation of linear model assumptions!)

Solution: run a regression using \(\tilde{Y} = \sqrt{Y}\) \[ V[\sqrt{Y}] \approx \left\{\frac{1}{2} E[Y]^{-1/2}\right\}^2 E[Y] = \frac{1}{4} \]

now the variance doesn’t depend on the mean!

Variance-Stabilizing Transformations

Why though?

It’s because of something called the Delta Method, which is a way to approximate first and second moments of transformed random variables.

If \(W\) has \(E[W]\) and \(V[W]\), and \(g\) is some transformation, under certain regularlity conditions

\[ V[g(W)] \approx \{g'(E[W])\}^2 V[W] \]

Example

You just ran a regression, and go to check your residuals, and you see this:

peak-hour energy demand (y) versus total monthly energy usage (x)

Example (continued)

After a square root transformation

The Box-Cox procedure

Q: how do I choose a power on my own for the response variable \(y^{\lambda}\)?

A: Make it a parameter of the model!

Maybe this:

\[ y^{\lambda} = \mathbf{x}_i^{\intercal}\boldsymbol{\beta} + \epsilon \]

The Box-Cox procedure

Q: what do we do about really small values of \(\lambda\) for \(y^{\lambda}\)?

Maybe this? \[ \frac{y^{\lambda}}{\lambda} = \mathbf{x}_i^{\intercal}\boldsymbol{\beta} + \epsilon \]

The Box-Cox procedure

Or maybe this: \[ \frac{y^{\lambda}-1}{\lambda} = \mathbf{x}_i^{\intercal}\boldsymbol{\beta} + \epsilon \]

Notice that \[ \lim_{\lambda \to 0}\frac{y^{\lambda}-1}{\lambda} = \lim_{\lambda \to 0}\frac{\exp[\lambda \log y]-1}{\lambda} = \lim_{\lambda \to 0}\exp[\lambda \log y] \log y = \log y < \infty \]

Guillaume de l’Hôpital (source: wikipedia)

Guillaume de l’Hôpital (source: wikipedia)

The Box-Cox procedure

So we’ll go with this:

\[ \frac{y^{\lambda}-1}{\lambda} = \beta_0 + x_1\beta_1 + \cdots x_k\beta_k + \epsilon \]

Confusion arises because there is a difference between \[ f\left(\frac{y_1^{\lambda}-1}{\lambda}, \ldots, \frac{y_n^{\lambda}-1}{\lambda} \bigg\rvert \lambda, \boldsymbol{\beta}, \sigma^2\right) \] and

\[ f\left(y_1, \ldots, y_n \mid \lambda, \boldsymbol{\beta}, \sigma^2\right) \]

We maximize the second one!

The Box-Cox procedure

We have to solve these

\[ \frac{\partial \log f(y_1, \ldots, y_n \mid \lambda, \boldsymbol{\beta}, \sigma^2)}{\partial \boldsymbol{\beta}} \overset{\text{set}}{=} \mathbf{0} \] \[ \frac{\partial \log f(y_1, \ldots, y_n \mid \lambda, \boldsymbol{\beta}, \sigma^2)}{\partial \sigma^2} \overset{\text{set}}{=} 0 \]

\[ \frac{\partial \log f(y_1, \ldots, y_n \mid \lambda, \boldsymbol{\beta}, \sigma^2)}{\partial \lambda} \overset{\text{set}}{=} 0 \]

Solving the first two and then plugging them in gives the profile log-likelihood \(\ell(\lambda)\). That’s maximized last.

The Box-Cox procedure

The profile log-likelihood is easy to get in R

library(MASS)
bc_res <- boxcox(y ~ ., data = my_data) # looks like lm()

The Box-Cox procedure

best_lambda <- bc_res$x[bc_res$y == max(bc_res$y)] # or manually choose one in the range
cat(best_lambda)
## 0.3030303
my_data$transformed_y <- (my_data$y^best_lambda - 1)/best_lambda
head(my_data)
##       y   x1  x2    x3    x4      x5      x6   x7 transformed_y
## 1 36.98  5.1 400 51.37  4.24 1484.83 2227.25 2.06      6.554973
## 2 13.74 26.4 400 72.33 30.87  289.94  434.90 1.33      4.000597
## 3 10.08 23.8 400 71.44 33.01  320.79  481.19 0.97      3.346498
## 4  8.53 46.4 400 79.15 44.61  164.76  247.14 0.62      3.018583
## 5 36.42  7.0 450 80.47 33.84 1097.26 1645.89 0.22      6.509509
## 6 26.59 12.6 450 89.90 41.26  605.06  907.59 0.76      5.617575
final_mod <- lm(transformed_y ~ . - y, data = my_data)  # the minus sign means "except for"

Linearization

Alternatively, the “true” model can be written as a linear model.

If a “subject-matter expert” tells you that the model is

\[ y = \beta_0 e^{\beta_1 x}\epsilon \] then \[ \log y = \log \beta_0 + \beta_1 x + \log \epsilon \]

Linearization

Another example

\[ y = \beta_0 + \beta_1 \frac{1}{x} + \epsilon \]

is \[ y = \beta_0 + \beta_1 \tilde{x} + \epsilon \]

where \(\frac{1}{x} = \tilde{x}\)

more examples on page 178