Taylor R. Brown, PhD
Everything we’ve done so far rested on these assumptions:
linear relationship between \(y\) and \(x\) (i.e. \(y =\mathbf{x}^{\intercal}\boldsymbol{\beta} + \epsilon\))
\(\epsilon_1, \ldots, \epsilon_n \overset{iid}{\sim} \text{Normal}(0, \sigma^2)\)
The residuals are our friend:
\[ \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = [\mathbf{I}- \mathbf{H} ]\mathbf{y} \]
Sometimes we look at the standardized residuals \(d_i\)
\[ d_i = \frac{e_i}{\sqrt{MS_{Res}}} \]
Idea: make \(V[d_i] \approx 1\)
Why? \[ V[e_i] \approx \sigma^2 \approx MS_{Res} \]
We can do better:
mean \[ E[\mathbf{e}] = [\mathbf{I}- \mathbf{H} ]E[\mathbf{y} ]= [\mathbf{I}- \mathbf{H} ]\mathbf{X}\boldsymbol{\beta} = \mathbf{0} \]
covariance matrix \[ V[\mathbf{e}] = V[(\mathbf{I} - \mathbf{H})\mathbf{y}] = \sigma^2 (\mathbf{I} - \mathbf{H})(\mathbf{I} - \mathbf{H})^{\intercal} = \sigma^2 (\mathbf{I} - \mathbf{H}) \]
\(V[e_i] = \sigma^2(1 - h_{ii})\)
\(\text{Cov}( e_i, e_j) = -\sigma^2h_{i,j}\)
\(V[\mathbf{e}]\) is not full rank!
Because \[ e_i \sim \text{Normal}\left(0, \sigma^2(1 - h_{ii}) \right) \]
it’s better to look at the (internally) studentized residuals \[ r_i = \frac{e_i}{\sqrt{ MS_{Res} (1 - h_{ii}) }} \overset{\text{approx.}}{\sim} t_{df_{Res}} \]
…still correlated, not exactly \(t\)-distributed, not exactly unit variance
The externally studentized residuals follow a t-distribution exactly (if \(\boldsymbol{\epsilon}\) is normal). These are what we will use to check diagnostics.
\[ t_i = \frac{e_i}{\sqrt{ MS_{(i),Res} (1 - h_{ii}) }} \sim t_{df_{Res}-1} \]
\(MS_{(i),Res}\) differs from \(MS_{Res}\) because it is based on all of the data except for the \(i\)th data point. This is also why we lose an extra degree of freedom!
R
R
internally versus externally…
check for normality
checking homoskedasticity
checking linearity/missing predictors
checking for outliers
Plots:
\(t_i\) by themselves
\(t_i\) versus \(\hat{y}_i\)
\(t_i\) versus \(x_{ij}\) for any \(j\) (in particular if one predictor is time)
partial regression and partial residual plots
If \(\boldsymbol{\epsilon}\) is normal, then \(\{t_i\}\) are iid \(t_{df_{Res}-1}\)
QQ plots are a little better for spotting heavy/thin tails or skewness…
Always plot \(t_i\) versus \(\hat{y}_i\) to check for heteroskedasticity and nonlinearities
Another example:
A partial regression plot aka an added-variable plot plots two types of residuals against each other
\(\mathbf{y}\) against all predictors except \(\mathbf{x}_i\)
\(\mathbf{x}_i\) against all other predictors
The plot is straight-sloped if it should be included in the model. It’s horizontal if it’s redundant. And it’s curved if you might need to add a power of \(\mathbf{x}_i\) to the regression model.
## Loading required package: carData
## y x1 x2 x3 x4 x5 x6 x7
## 1 36.98 5.1 400 51.37 4.24 1484.83 2227.25 2.06
## 2 13.74 26.4 400 72.33 30.87 289.94 434.90 1.33
## 3 10.08 23.8 400 71.44 33.01 320.79 481.19 0.97
## 4 8.53 46.4 400 79.15 44.61 164.76 247.14 0.62
## 5 36.42 7.0 450 80.47 33.84 1097.26 1645.89 0.22
## 6 26.59 12.6 450 89.90 41.26 605.06 907.59 0.76
Some intuition:
Say we are interested in looking at \(x_1\). Pretend now that it doesn’t belong in the model.
The residuals from regressing \(y\) on everything except for \(x_1\) will roughly be \(\epsilon\). These are uncorrelated with everything, and in particular, with the residuals of \(x_1\) regressed on the others.
Some intuition:
Say we are interested in looking at \(x_1\). Pretend that it does belong in the model.
Regressing \(y\) on everything except for \(x_1\) roughly gives you \[ y = \underbrace{\beta_0 + \beta_1 \alpha_0}_{\text{intercept}} + \underbrace{(\beta_2 + \beta_1 \alpha_1)}_{\text{slope}}x_2 + \underbrace{\epsilon + \beta_1 \tilde{\epsilon}}_{\text{residuals}} \] So the residuals from regressing \(y\) on everything else (\(\epsilon + \beta_1 \tilde{\epsilon}\)) are a function of the residuals from regressing \(x_1\) on everything else (\(\tilde{\epsilon}\)).
Sometimes you can’t just add predictors until the problems go away. Sometimes you have to transform your \(y\) variable
Example:
\(Y \sim \text{Poisson}\left(\lambda_x\right)\), then \(E[Y] = V[Y] = \lambda_x\) (violation of linear model assumptions!)
Solution: run a regression using \(\tilde{Y} = \sqrt{Y}\) \[ V[\sqrt{Y}] \approx \left\{\frac{1}{2} E[Y]^{-1/2}\right\}^2 E[Y] = \frac{1}{4} \]
now the variance doesn’t depend on the mean!
Why though?
It’s because of something called the Delta Method, which is a way to approximate first and second moments of transformed random variables.
If \(W\) has \(E[W]\) and \(V[W]\), and \(g\) is some transformation, under certain regularlity conditions
\[ V[g(W)] \approx \{g'(E[W])\}^2 V[W] \]
You just ran a regression, and go to check your residuals, and you see this:
peak-hour energy demand (y) versus total monthly energy usage (x)
After a square root transformation
Q: how do I choose a power on my own for the response variable \(y^{\lambda}\)?
A: Make it a parameter of the model!
Maybe this:
\[ y^{\lambda} = \mathbf{x}_i^{\intercal}\boldsymbol{\beta} + \epsilon \]
Q: what do we do about really small values of \(\lambda\) for \(y^{\lambda}\)?
Maybe this? \[ \frac{y^{\lambda}}{\lambda} = \mathbf{x}_i^{\intercal}\boldsymbol{\beta} + \epsilon \]
Or maybe this: \[ \frac{y^{\lambda}-1}{\lambda} = \mathbf{x}_i^{\intercal}\boldsymbol{\beta} + \epsilon \]
Notice that \[ \lim_{\lambda \to 0}\frac{y^{\lambda}-1}{\lambda} = \lim_{\lambda \to 0}\frac{\exp[\lambda \log y]-1}{\lambda} = \lim_{\lambda \to 0}\exp[\lambda \log y] \log y = \log y < \infty \]
So we’ll go with this:
\[ \frac{y^{\lambda}-1}{\lambda} = \beta_0 + x_1\beta_1 + \cdots x_k\beta_k + \epsilon \]
Confusion arises because there is a difference between \[ f\left(\frac{y_1^{\lambda}-1}{\lambda}, \ldots, \frac{y_n^{\lambda}-1}{\lambda} \bigg\rvert \lambda, \boldsymbol{\beta}, \sigma^2\right) \] and
\[ f\left(y_1, \ldots, y_n \mid \lambda, \boldsymbol{\beta}, \sigma^2\right) \]
We maximize the second one!
We have to solve these
\[ \frac{\partial \log f(y_1, \ldots, y_n \mid \lambda, \boldsymbol{\beta}, \sigma^2)}{\partial \boldsymbol{\beta}} \overset{\text{set}}{=} \mathbf{0} \] \[ \frac{\partial \log f(y_1, \ldots, y_n \mid \lambda, \boldsymbol{\beta}, \sigma^2)}{\partial \sigma^2} \overset{\text{set}}{=} 0 \]
\[ \frac{\partial \log f(y_1, \ldots, y_n \mid \lambda, \boldsymbol{\beta}, \sigma^2)}{\partial \lambda} \overset{\text{set}}{=} 0 \]
Solving the first two and then plugging them in gives the profile log-likelihood \(\ell(\lambda)\). That’s maximized last.
The profile log-likelihood is easy to get in R
best_lambda <- bc_res$x[bc_res$y == max(bc_res$y)] # or manually choose one in the range
cat(best_lambda)
## 0.3030303
## y x1 x2 x3 x4 x5 x6 x7 transformed_y
## 1 36.98 5.1 400 51.37 4.24 1484.83 2227.25 2.06 6.554973
## 2 13.74 26.4 400 72.33 30.87 289.94 434.90 1.33 4.000597
## 3 10.08 23.8 400 71.44 33.01 320.79 481.19 0.97 3.346498
## 4 8.53 46.4 400 79.15 44.61 164.76 247.14 0.62 3.018583
## 5 36.42 7.0 450 80.47 33.84 1097.26 1645.89 0.22 6.509509
## 6 26.59 12.6 450 89.90 41.26 605.06 907.59 0.76 5.617575
Alternatively, the “true” model can be written as a linear model.
If a “subject-matter expert” tells you that the model is
\[ y = \beta_0 e^{\beta_1 x}\epsilon \] then \[ \log y = \log \beta_0 + \beta_1 x + \log \epsilon \]
Another example
\[ y = \beta_0 + \beta_1 \frac{1}{x} + \epsilon \]
is \[ y = \beta_0 + \beta_1 \tilde{x} + \epsilon \]
where \(\frac{1}{x} = \tilde{x}\)
more examples on page 178