Taylor R. Brown, PhD
We want to deal with the uncertainty of our estimates and of unseen data points. We do this in two ways:
hypothesis tests
intervals (prediction intervals, confidence intervals, etc.)
We don’t know \(\beta_1\), we just have \(\hat{\beta}_1\) (which is random)
We have two hypotheses: \[ H_0: \beta_1 = \beta_{1,0} \] and \[ H_A: \beta_1 \neq \beta_{1,0} \]
(\(\beta_{1,0}\) is usually \(0\))
There is no guarantee!
Two approaches:
t-test
F-test
(they are equivalent now, but they won’t be when we use more than 1 predictor)
Our model:
\[
Y = \beta_0 +\beta_1 X + \epsilon
\]
Plus a new assumption:
We showed earlier:
\(E[\hat{\beta}_1 ] = \beta_1\)
\(V[\hat{\beta}_1 ] = \sigma^2/S_{xx}\)
With normality we get \[ \hat{\beta}_1 \sim \text{Normal}\left(\beta_1, \sigma^2/S_{xx} \right) \] or
\[ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/S_{xx}} } \sim \text{Normal}\left(0, 1 \right) \]
We can also show that \[ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{ MS_{Res}/S_{xx}} } \sim t_{df_{Res}} \] Recall from module 1 that \(MS_{Res} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i )^2 }{n-2} = \frac{SS_{\text{res}} }{n-2}\).
(we still don’t know \(\beta_1\)…)
If we assume tentatively that \(H_0: \beta_1 = \beta_{1,0}\). This implies that
\[ t_0 = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{ MS_{Res}/S_{xx}} } \sim t_{df_{Res}} \]
General idea: if we calculate \(t_0\) but it doesn’t look like it’s coming from the null distn.,
x <- seq(-3,3,.01)
alpha <- .05
plot(x,dt(x, 13), type = "l", xlab = "t0", ylab = "t density", main = "null distn")
abline(v=qt(alpha/2, 13),col="red")
abline(v=qt(1-alpha/2, 13),col="red")
then we reject our assumption about \(\beta_1\) (we keep assuming all the other things, though)
R
setwd("/home/t/UVa/all_teaching/summer19_6021/data")
my_data <- read.csv("data-table-B3.csv")
reg_mod_1 <- lm(y ~ x1, data = my_data)
summary(reg_mod_1)
##
## Call:
## lm(formula = y ~ x1, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7923 -1.9752 0.0044 1.7677 6.8171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.722677 1.443903 23.36 < 2e-16 ***
## x1 -0.047360 0.004695 -10.09 3.74e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.065 on 30 degrees of freedom
## Multiple R-squared: 0.7723, Adjusted R-squared: 0.7647
## F-statistic: 101.7 on 1 and 30 DF, p-value: 3.743e-11
Goal: test to see if “anything is going on.”
\[ SS_T = SS_R + SS_{Res} \]
\[ df_T = df_R + df_{Res} \] or in this case \[ n-1 = 1 + (n-2) \]
Assuming \(H_0: \beta_1 = 0\)
\[ F_0 = \frac{ SS_R / df_R }{ SS_{Res}/df_{Res} } \sim F_{df_R, df_{Res}} \]
If \(\beta_1 \neq 0\) then \(F_0\) should be larger!
Let’s assume we have \(22\) data points. Then \(df_R = 1\) and \(df_{Res} = 20\). If it’s true that \(H_0: \beta_1 = 0\) then this is the density of the random variable we’re calculating:
R
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 955.72 955.72 101.74 3.743e-11 ***
## Residuals 30 281.82 9.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we discuss constructing intervals
intervals for parameter estimates
intervals for “mean responses” \(\hat{\beta}_0 + \hat{\beta}_1 x\)
intervals for unseen data \(y_{new}\)
Again, we assume normality of errors, in addition to all other assumptions.
Recall that
\[ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{ MS_{Res}/S_{xx}} } \sim t_{df_{Res}} \]
this is not a test here, so we do not assume anything about \(\beta_1\)
\[ P\left(-t_{\alpha/2, df_{Res} } \le \frac{\hat{\beta}_1 - \beta_1}{\sqrt{ MS_{Res}/S_{xx}} } \le t_{\alpha/2, df_{Res} } \right) = 1-\alpha \] the left hand side is equivalent to \[ P\left(\hat{\beta}_1 - t_{\alpha/2, df_{Res} } \sqrt{ MS_{Res}/S_{xx}} \le \beta_1 \le \hat{\beta}_1 + t_{\alpha/2, df_{Res} } \sqrt{ MS_{Res}/S_{xx}} \right) \]
So our interval is \[ \underbrace{\hat{\beta}_1}_{\text{point estimate}} \pm \overbrace{\underbrace{t_{\alpha/2, df_{Res} }}_{\text{multiplier}} \underbrace{\sqrt{ MS_{Res}/S_{xx}}}_{\text{standard error}}}^{\text{error}} \]
## 2.5 % 97.5 %
## (Intercept) 30.77383383 36.67151954
## x1 -0.05694883 -0.03777032
You can show that \(SS_{Res}/\sigma^2 \sim \chi^2_{df_{Res}}\).
This means
\[ P\left(\chi^2_{1-\alpha/2, df_{Res}} \le SS_{Res}/\sigma^2 \le \chi^2_{\alpha/2,df_{Res}} \right) \]
if you rearrange this to look like \(\text{left stuff} \le \sigma^2 \le \text{right stuff}\), then your interval becomes
\[ \left[\frac{SS_{Res} }{\chi^2_{\alpha/2,df_{Res}} }, \frac{SS_{Res} }{\chi^2_{1-\alpha/2, df_{Res}} } \right] \]
I couldn’t find a function for this, so we can do it manually
alpha <- .05
dof <- nrow(my_data)-2
small_quantile <- qchisq(alpha/2, dof)
large_quantile <- qchisq(1-alpha/2, dof)
ssres <- sum(resid(reg_mod_1)^2)
cat("lower bound: ", ssres/large_quantile,
"\npoint estimate: ", ssres/dof,
"\nupper bound: ", ssres/small_quantile)
## lower bound: 5.998913
## point estimate: 9.394146
## upper bound: 16.78448
We want an expected output for a hypothetical input \(x_0\) \[ E[y \mid x_0] = \beta_0 + \beta_1 x_0 \]
All we have is the random “guess”
\[ \hat{E}[y \mid x_0] = \hat{\beta}_0 + \hat{\beta}_1 x_0 \]
We will now derive a trick that will let us show it’s normally distributed, and find its mean and variance!
Recall
\(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\)
\(\hat{\beta}_1 = \sum_i c_i y_i\)
so
\[\begin{align*} \hat{E}[y \mid x_0] &= \hat{\beta}_0 + \hat{\beta}_1 x_0 \\ &= \bar{y} + \hat{\beta}_1\left(x_0 - \bar{x}\right) \\ &= \sum_i \frac{y_i}{n}+ \sum_i c_i y_i\left(x_0 - \bar{x}\right) \\ &= \sum_i y_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right) \end{align*}\]
\[ \hat{E}[y \mid x_0] = \sum_i y_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right) \]
So \[\begin{align*} V\left( \hat{E}[y \mid x_0] \right) &= \sum_i V\left[ y_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right)\right] \tag{indep.} \\ &= \sum_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right)^2 V(y_i) \\ &= \sigma^2 \sum_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right)^2 \\ &= \sigma^2 \sum_i\left( \frac{1}{n^2} + 2 \frac{c_i}{n} (x_0 - \bar{x}) + c_i^2 (x_0 - \bar{x})^2 \right) \\ &= \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) \end{align*}\]
because \(\sum_i c_i = 0\)
Through a similar line of reasoning, you can make a “z-score” with an estimated standard error, and it will be t-distributed. This yields the following interval for the mean response:
\[ \hat{\beta}_0 + \hat{\beta}_1 x_0 \pm t_{\alpha/2, df_{Res}} \sqrt{ MS_{Res} \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) } \]
in R
:
#?predict.lm
new_df <- data.frame(x1=40) # common bug
predict(reg_mod_1, new_df, interval = "confidence")
## fit lwr upr
## 1 31.82829 29.231 34.42559
Now we want to predict \(y_0\) based on \(x_0\): \[ y_0 = \beta_0 + \beta_1 x_0 + \epsilon \]
And we have the same random “guess” based on \(\{y_1, \ldots, y_n, x_1, \ldots, x_n\}\):
\[ \hat{y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0 \]
To find a prediction interval, we need to look at the prediction error \[ y_0 - \hat{y}_0 = (\beta_0 - \hat{\beta}_0) + (\beta_1 - \hat{\beta}_1)x_0 + \epsilon \] Clearly a mean zero normal variate.
\[\begin{align*} V\left[ y_0 - \hat{y}_0 \right] &= \text{Cov}\left(y_0 - \hat{y}_0, y_0 - \hat{y}_0 \right) \\ &= \text{Cov}\left[y_0 , y_0 \right] -2 \text{Cov}\left[y_0 , \hat{y}_0 \right] + \text{Cov}\left[\hat{y}_0, \hat{y}_0 \right] \\ &= V[y_0] - 0 + V[\hat{y}_0] \\ &= \sigma^2 + \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) \tag{previous} \end{align*}\]
A similar line of reasoning yields the following prediction interval:
\[ \hat{\beta}_0 + \hat{\beta}_1 x_0 \pm t_{\alpha/2, df_{Res}} \sqrt{ MS_{Res} \left(1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) } \]
in R
:
new_df <- data.frame(x1=40)
predict(reg_mod_1, new_df, interval = "prediction") # interval arg changed
## fit lwr upr
## 1 31.82829 25.05129 38.6053
blue is a CI for the mean response, red is the PI