Taylor R. Brown, PhD
Recall our regression model from a few modules ago:
\[ y_i = \mathbf{x}_i^\intercal \boldsymbol{\beta} + \epsilon_i \]
and we assumed \(\{\epsilon_i\}_{i}\) are i.i.d. normal.
Now we’ll relax that assumption by assuming our errors are not independent!
When would we need this?
If there’s a missing regressor that you can find, that’s probably the easiest way to fix this, but if not, you might want to stop assuming the errors are independent.
Beyond visually checking, we can formally test the hypothesis that the errors are correlated. Before we do that, we need some new tools.
The process \(\epsilon_1, \epsilon_2, \ldots\) is stationary if windows’ distributions don’t depend on where in time they are.
In other words,
\[ \epsilon_{3}, \epsilon_{5}, \ldots, \epsilon_{27} \] has the same joint distribution as \[ \epsilon_{3+s}, \epsilon_{5+s}, \ldots, \epsilon_{27+s} \]
The process \(\epsilon_1, \epsilon_2, \ldots\) is stationary if windows’ distributions don’t depend on where in time they are.
So in particular, if you look at windows of length \(1\) and \(2\):
Correlation within a time series is usually referred to as autocorrelation (“auto” means self).
If you take a time series course, you’ll learn about the autocorrelation function \(\rho_{\epsilon} : \mathbb{Z} \to [-1,1]\)
\[ \rho_{\epsilon}(h) = \text{Corr}\left(\epsilon_t, \epsilon_{t+h} \right) = \frac{\text{Cov}\left(\epsilon_t, \epsilon_{t+h} \right) }{\text{V}(\epsilon_t)} \] where \(h\) is the “lag.”
This definition only works for stationary processes. Where are the places where this is required?
Every stationary time series has an autocorrelation function. The process that we will be using a lot is called an autoregressive process of order \(1\) (called AR(1) for short):
\[\begin{gather} \epsilon_t = \phi \epsilon_{t-1} + a_t \\ \epsilon_1 \sim N\left(0, \frac{\sigma^2_a}{1-\phi^2} \right) \end{gather}\] where \(a_t\) are iid normal with mean \(0\) and variance \(\sigma^2_a\).
Notice that \(-1 < \phi < 1\), otherwise the process will not be stationary
\[\begin{gather} \epsilon_t = \phi \epsilon_{t-1} + a_t \\ \epsilon_1 \sim N\left(0, \frac{\sigma^2_a}{1-\phi^2} \right) \end{gather}\]
Notice that \(\phi = 0\), then this simplifies to iid noise!
\[\begin{gather} \epsilon_t = \phi \epsilon_{t-1} + a_t \\ \epsilon_1 \sim N\left(0, \frac{\sigma^2_a}{1-\phi^2} \right) \end{gather}\]
See how the mean and variance stay the same:
\[ \text{V}[\epsilon_2] = \text{V}[\phi \epsilon_{1} + a_2] = \phi^2\frac{\sigma^2_a}{1-\phi^2} + \sigma^2_a = \frac{\sigma^2_a}{1-\phi^2} = \text{V}[\epsilon_1] \]
We use \(\text{V}[\epsilon_t] = \frac{\sigma^2_a}{1-\phi^2}\) in the next slide…
\[\begin{gather} \epsilon_t = \phi \epsilon_{t-1} + a_t \\ \epsilon_1 \sim N\left(0, \frac{\sigma^2_a}{1-\phi^2} \right) \end{gather}\]
Now we find the autocorrelations. It’s easier to start off with autocovariances so we can use bilinearity
\[ \text{Cov}[\epsilon_{t-1}, \epsilon_{t}] = \text{Cov}[\epsilon_{t-1},\phi \epsilon_{t-1} + a_t] = \phi \text{Cov}[\epsilon_{t-1}, \epsilon_{t-1}] = \phi \text{V}[\epsilon_t] \]
So \[ \rho_{\epsilon}(1) = \text{Corr}\left(\epsilon_{t-1}, \epsilon_{t} \right) = \phi \]
At higher lags:
\[\begin{align*} \text{Cov}[\epsilon_{t-2}, \epsilon_{t}] &= \text{Cov}[\epsilon_{t-2},\phi \epsilon_{t-1} + a_t] \\ &= \text{Cov}[\epsilon_{t-2},\phi \epsilon_{t-1}] \\ &= \text{Cov}[\epsilon_{t-2},\phi^2 \epsilon_{t-2} + \phi a_{t-1}] \\ &= \phi^2 \text{V}[\epsilon_t] \end{align*}\]
So \[ \rho_{\epsilon}(2) = \text{Corr}\left(\epsilon_{t-2}, \epsilon_{t} \right) = \phi^2 \]
Proceeding inductively: \(\rho_{\epsilon}(h) = \phi^h\).
The Durbin-Watson test tests \[ H_0: \phi = 0 \] against \[ H_A: 1 > \phi > 0. \]
The test statistic it uses is \[ d = \frac{\sum_{t=2}^T(e_t - e_{t-1})^2}{\sum_{t=1}^T e_t^2} \] where \(T\) is the sample size, and \(e_t\) are the residuals obtained after fitting a regular OLS regression.
\[ d = \frac{\sum_{t=2}^T(e_t - e_{t-1})^2}{\sum_{t=1}^T e_t^2} \]
Reject when this is low!
R
R
’s dwtest()
has an involved algorithm for finding the p-values called “Pan’s algorithm”
##
## Durbin-Watson test
##
## data: lm(y ~ x)
## DW = 0.51098, p-value = 9.403e-10
## alternative hypothesis: true autocorrelation is greater than 0
Say we have identified the problem, and now we must fit the following model:
\[\begin{gather} y_t = \mathbf{x}_t^\intercal \boldsymbol{\beta} + \epsilon_t \\ \epsilon_t = \phi \epsilon_{t-1} + a_t \\ \epsilon_1 \sim N\left(0, \frac{\sigma^2_a}{1-\phi^2} \right) \end{gather}\]
The book mentions the Cochrane-Orchutt procedure, but MLE is better.
Recall our original OLS model from module 3
\[ \mathbf{y} \sim \text{Normal}\left(\mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I}_n \right) \] which is the same as \[ f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = (2\pi)^{-n/2}\det[\sigma^2 \mathbf{I}_n]^{-1/2}\exp\left[-\frac{1}{2 }(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal} \left[\sigma^2 \mathbf{I}_n \right]^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \]
The only thing different now, when we have correlated errors, is the covariance matrix.
We must replace \(\sigma^2 \mathbf{I}_n\) with a \(\boldsymbol{\Gamma}\).
To find \(\boldsymbol{\Gamma}\), we can just use the autocovariances we found before. Recall
\[ \text{Cov}[\epsilon_{t}, \epsilon_{t+h}] = \phi^h \text{V}[\epsilon_t] = \phi^h \frac{\sigma^2_a}{1-\phi^2}. \]
So \[ \boldsymbol{\Gamma} = \frac{\sigma^2_a}{1-\phi^2} \begin{bmatrix} 1 & \phi^1 & \phi^2 & \cdots & \phi^{n-1} \\ \phi^1 & 1 & \phi^1 & \cdots & \phi^{n-2} \\ \vdots & \vdots & \ddots & \vdots & \vdots\\ \phi^{n-2} & \cdots & \cdots & \cdots & \phi^{1} \\ \phi^{n-1} & \cdots & \cdots & \phi^1 & 1 \\ \end{bmatrix} \]
matrices with this banded pattern are called “Toeplitz matrices”
So we have our likelihood that we can maximize with any numerical scheme:
\[ f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \boldsymbol{\Gamma}) = (2\pi)^{-n/2}\det[\boldsymbol{\Gamma}]^{-1/2}\exp\left[-\frac{1}{2 }(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal} \boldsymbol{\Gamma} ^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \]
Beware: it is very easy to poorly program a function to evaluate this. \(\boldsymbol{\Gamma}\) is \(n \times n\)…naively inverting that on a large data set could be very slow.
R
R
## Length Class Mode
## coef 3 -none- numeric
## sigma2 1 -none- numeric
## var.coef 9 -none- numeric
## mask 3 -none- logical
## loglik 1 -none- numeric
## aic 1 -none- numeric
## arma 7 -none- numeric
## residuals 41 ts numeric
## call 6 -none- call
## series 1 -none- character
## code 1 -none- numeric
## n.cond 1 -none- numeric
## nobs 1 -none- numeric
## model 10 -none- list
## ar1 intercept x
## 0.7426234 -4.1972843 -2.0486754
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2457 -2.9641 -0.4851 3.5674 9.8234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.93806 0.81380 -6.068 4.17e-07 ***
## x -2.02474 0.05253 -38.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.98 on 39 degrees of freedom
## Multiple R-squared: 0.9744, Adjusted R-squared: 0.9738
## F-statistic: 1486 on 1 and 39 DF, p-value: < 2.2e-16
And now for something totally different: the bootstrap, which is a class of computationally-intensive techniques to find confidence or prediction intervals, or perform hypothesis tests.
“Bootstrap” is a pretty vague word that can refer to probably hundreds of different procedures, but what they all have in common is that they sample with replacement \(m\) times from the existing data set.
This approximates having \(m\) independent fresh data sets that are generated randomly under the same conditions. If we had many data sets, we wouldn’t need to derive the distribution of \(\hat{\beta}\)…we could just look at a histogram!
The book describes two ways to use bootstrapping for regression models. The first is bootstrapping cases.
Say your data set is \(\{ (y_i, \mathbf{x}_i)\}_{i=1}^n\). For \(j=1,\ldots, m\) times:
Then, look at the histogram of \(\{\boldsymbol{\hat{\beta}}_j^*\}_{j=1}^m\)!
R
A little experiment: generate some fake data and see how the bootstrap compares!
R
sample()
or sample.int()
are the workhorses
random_indexes <- sample.int(length(x), replace=TRUE) # need replace=T
sort(random_indexes) # just to show that some are repeating
## [1] 1 2 2 3 3 3 4 4 6 10 10 13 13 13 14 14 16 16 16 18 20 21 22
## [24] 24 24 25 25 26 27 28 31 33 35 36 37 37 39 39 40 41 41 42 44 44 44 47
## [47] 48 50 50 50
## (Intercept) xstar
## -3.202016 3.015001
R
Do it over and over again. Say we’re only interested in the intercept:
bootstrap_once <- function(){
random_indexes <- sample.int(length(x), replace=TRUE) # need replace=T
xstar <- x[random_indexes]
ystar <- y[random_indexes]
return(coefficients(lm(ystar ~ xstar))[1])
}
m <- 1e3
beta0Stars <- replicate(m, bootstrap_once())
hist(beta0Stars)
abline(v=true_beta0, col = "red") # true unknown
abline(v=coefficients(lm(y ~ x))[1], col ="green") # estimated from our sample
abline(v=quantile(beta0Stars, .025), col = "blue") # bootstrap lower
abline(v=quantile(beta0Stars, .975), col = "blue") # bootstrap upper
## 2.5 % 97.5 %
## (Intercept) -4.279896 -1.735752
## x 2.986427 3.030270
The second way to bootstrap a regression model is bootstrapping residuals.
Say your data set is \(\{ (y_i, \mathbf{x}_i)\}_{i=1}^n\). Regress original \(\mathbf{y}\) on original \(\mathbf{X}\) to obtain residuals \(\mathbf{e}\) and coefficient estimates \(\boldsymbol{\hat{\beta}}\).
For \(j=1,\ldots, m\) times:
Then, look at the histogram of \(\{\boldsymbol{\hat{\beta}}_j^*\}_{j=1}^m\)!
R
original_beta_hat <- as.matrix(coefficients(lm(y ~ x)))
original_residuals <- residuals(lm(y ~ x))
X <- matrix(c(rep(1,length(x)), x), ncol=2)
bootstrap_resids_once <- function(){
random_indexes <- sample.int(length(x), replace=TRUE) # need replace=T
estar <- original_residuals[random_indexes]
ystar <- X %*% original_beta_hat + estar
return(coefficients(lm(ystar ~ x))[1])
}
m <- 1e3
beta0Stars <- replicate(m, bootstrap_resids_once())
hist(beta0Stars)
abline(v=true_beta0, col = "red") # true unknown
abline(v=coefficients(lm(y ~ x))[1], col ="green") # estimated from our sample
abline(v=quantile(beta0Stars, .025), col = "blue") # bootstrap lower
abline(v=quantile(beta0Stars, .975), col = "blue") # bootstrap upper
## 2.5 % 97.5 %
## (Intercept) -4.279896 -1.735752
## x 2.986427 3.030270