Module 2

STAT 6021

Taylor R. Brown, PhD

Quantifying Uncertainty

We want to deal with the uncertainty of our estimates and of unseen data points. We do this in two ways:

Hypothesis Testing!

We don’t know \(\beta_1\), we just have \(\hat{\beta}_1\) (which is random)


We have two hypotheses: \[ H_0: \beta_1 = \beta_{1,0} \] and \[ H_A: \beta_1 \neq \beta_{1,0} \]

(\(\beta_{1,0}\) is usually \(0\))

Hypothesis Testing1

There is no guarantee!

Hypothesis Testing1

Two approaches:

(they are equivalent now, but they won’t be when we use more than 1 predictor)

The t-test

Our model:

\[ Y = \beta_0 +\beta_1 X + \epsilon \]

Plus a new assumption:

The t-test

We showed earlier:

With normality we get \[ \hat{\beta}_1 \sim \text{Normal}\left(\beta_1, \sigma^2/S_{xx} \right) \] or

\[ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/S_{xx}} } \sim \text{Normal}\left(0, 1 \right) \]

The t-test

We can also show that \[ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{ MS_{Res}/S_{xx}} } \sim t_{df_{Res}} \] Recall from module 1 that \(MS_{Res} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i )^2 }{n-2} = \frac{SS_{\text{res}} }{n-2}\).

(we still don’t know \(\beta_1\)…)

The t-test

If we assume tentatively that \(H_0: \beta_1 = \beta_{1,0}\). This implies that

\[ t_0 = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{ MS_{Res}/S_{xx}} } \sim t_{df_{Res}} \]

The t-test

General idea: if we calculate \(t_0\) but it doesn’t look like it’s coming from the null distn.,

x <- seq(-3,3,.01)
alpha <- .05
plot(x,dt(x, 13), type = "l", xlab = "t0", ylab = "t density", main = "null distn")
abline(v=qt(alpha/2, 13),col="red")
abline(v=qt(1-alpha/2, 13),col="red")

then we reject our assumption about \(\beta_1\) (we keep assuming all the other things, though)

The t-test in R

setwd("/home/t/UVa/all_teaching/summer19_6021/data")
my_data <- read.csv("data-table-B3.csv")
reg_mod_1 <- lm(y ~ x1, data = my_data)
summary(reg_mod_1)  
## 
## Call:
## lm(formula = y ~ x1, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7923 -1.9752  0.0044  1.7677  6.8171 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.722677   1.443903   23.36  < 2e-16 ***
## x1          -0.047360   0.004695  -10.09 3.74e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.065 on 30 degrees of freedom
## Multiple R-squared:  0.7723, Adjusted R-squared:  0.7647 
## F-statistic: 101.7 on 1 and 30 DF,  p-value: 3.743e-11

The F-test or Analysis of Variance

Goal: test to see if “anything is going on.”

\[ SS_T = SS_R + SS_{Res} \]

\[ df_T = df_R + df_{Res} \] or in this case \[ n-1 = 1 + (n-2) \]

Assuming \(H_0: \beta_1 = 0\)

\[ F_0 = \frac{ SS_R / df_R }{ SS_{Res}/df_{Res} } \sim F_{df_R, df_{Res}} \]

If \(\beta_1 \neq 0\) then \(F_0\) should be larger!

Analysis of Variance

Let’s assume we have \(22\) data points. Then \(df_R = 1\) and \(df_{Res} = 20\). If it’s true that \(H_0: \beta_1 = 0\) then this is the density of the random variable we’re calculating:

x <- seq(0,10,.1)
y <- df(x, 1, 20)
cutoff <- qf(.95, df1 = 1, df2 = 20)
plot(x,y, type = "l", xlab = "F0", ylab = "F_{1,20} density", main = "null distn.")
abline(v=cutoff, col = "red")

Analysis of Variance in R

anova(reg_mod_1)  # analysis of variance = anova
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x1         1 955.72  955.72  101.74 3.743e-11 ***
## Residuals 30 281.82    9.39                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Intervals!

Now we discuss constructing intervals

Intervals for parameter estimates

Again, we assume normality of errors, in addition to all other assumptions.

Recall that

\[ \frac{\hat{\beta}_1 - \beta_1}{\sqrt{ MS_{Res}/S_{xx}} } \sim t_{df_{Res}} \]

this is not a test here, so we do not assume anything about \(\beta_1\)

Intervals for the slope

\[ P\left(-t_{\alpha/2, df_{Res} } \le \frac{\hat{\beta}_1 - \beta_1}{\sqrt{ MS_{Res}/S_{xx}} } \le t_{\alpha/2, df_{Res} } \right) = 1-\alpha \] the left hand side is equivalent to \[ P\left(\hat{\beta}_1 - t_{\alpha/2, df_{Res} } \sqrt{ MS_{Res}/S_{xx}} \le \beta_1 \le \hat{\beta}_1 + t_{\alpha/2, df_{Res} } \sqrt{ MS_{Res}/S_{xx}} \right) \]

Intervals for the slope

So our interval is \[ \underbrace{\hat{\beta}_1}_{\text{point estimate}} \pm \overbrace{\underbrace{t_{\alpha/2, df_{Res} }}_{\text{multiplier}} \underbrace{\sqrt{ MS_{Res}/S_{xx}}}_{\text{standard error}}}^{\text{error}} \]

alpha <- .05
confint(reg_mod_1, level = 1-alpha)
##                   2.5 %      97.5 %
## (Intercept) 30.77383383 36.67151954
## x1          -0.05694883 -0.03777032

Intervals for \(\sigma^2\)

You can show that \(SS_{Res}/\sigma^2 \sim \chi^2_{df_{Res}}\).

This means

\[ P\left(\chi^2_{1-\alpha/2, df_{Res}} \le SS_{Res}/\sigma^2 \le \chi^2_{\alpha/2,df_{Res}} \right) \]

if you rearrange this to look like \(\text{left stuff} \le \sigma^2 \le \text{right stuff}\), then your interval becomes

\[ \left[\frac{SS_{Res} }{\chi^2_{\alpha/2,df_{Res}} }, \frac{SS_{Res} }{\chi^2_{1-\alpha/2, df_{Res}} } \right] \]

Intervals for \(\sigma^2\)

I couldn’t find a function for this, so we can do it manually

alpha <- .05
dof <- nrow(my_data)-2
small_quantile <- qchisq(alpha/2, dof)
large_quantile <- qchisq(1-alpha/2, dof)
ssres <- sum(resid(reg_mod_1)^2)
cat("lower bound: ", ssres/large_quantile, 
    "\npoint estimate: ", ssres/dof, 
    "\nupper bound: ", ssres/small_quantile)
## lower bound:  5.998913 
## point estimate:  9.394146 
## upper bound:  16.78448

Intervals for the mean response

We want an expected output for a hypothetical input \(x_0\) \[ E[y \mid x_0] = \beta_0 + \beta_1 x_0 \]

All we have is the random “guess”

\[ \hat{E}[y \mid x_0] = \hat{\beta}_0 + \hat{\beta}_1 x_0 \]

We will now derive a trick that will let us show it’s normally distributed, and find its mean and variance!

Intervals for the mean response

Recall

so

\[\begin{align*} \hat{E}[y \mid x_0] &= \hat{\beta}_0 + \hat{\beta}_1 x_0 \\ &= \bar{y} + \hat{\beta}_1\left(x_0 - \bar{x}\right) \\ &= \sum_i \frac{y_i}{n}+ \sum_i c_i y_i\left(x_0 - \bar{x}\right) \\ &= \sum_i y_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right) \end{align*}\]

Intervals for the mean response

\[ \hat{E}[y \mid x_0] = \sum_i y_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right) \]

So \[\begin{align*} V\left( \hat{E}[y \mid x_0] \right) &= \sum_i V\left[ y_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right)\right] \tag{indep.} \\ &= \sum_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right)^2 V(y_i) \\ &= \sigma^2 \sum_i \left(\frac{1}{n} + c_i (x_0 - \bar{x}) \right)^2 \\ &= \sigma^2 \sum_i\left( \frac{1}{n^2} + 2 \frac{c_i}{n} (x_0 - \bar{x}) + c_i^2 (x_0 - \bar{x})^2 \right) \\ &= \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) \end{align*}\]

because \(\sum_i c_i = 0\)

Intervals for the mean response

Through a similar line of reasoning, you can make a “z-score” with an estimated standard error, and it will be t-distributed. This yields the following interval for the mean response:

\[ \hat{\beta}_0 + \hat{\beta}_1 x_0 \pm t_{\alpha/2, df_{Res}} \sqrt{ MS_{Res} \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) } \]

in R:

#?predict.lm
new_df <- data.frame(x1=40) # common bug
predict(reg_mod_1, new_df, interval = "confidence") 
##        fit    lwr      upr
## 1 31.82829 29.231 34.42559

Prediction intervals!

Now we want to predict \(y_0\) based on \(x_0\): \[ y_0 = \beta_0 + \beta_1 x_0 + \epsilon \]

And we have the same random “guess” based on \(\{y_1, \ldots, y_n, x_1, \ldots, x_n\}\):

\[ \hat{y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0 \]

To find a prediction interval, we need to look at the prediction error \[ y_0 - \hat{y}_0 = (\beta_0 - \hat{\beta}_0) + (\beta_1 - \hat{\beta}_1)x_0 + \epsilon \] Clearly a mean zero normal variate.

Prediction intervals

\[\begin{align*} V\left[ y_0 - \hat{y}_0 \right] &= \text{Cov}\left(y_0 - \hat{y}_0, y_0 - \hat{y}_0 \right) \\ &= \text{Cov}\left[y_0 , y_0 \right] -2 \text{Cov}\left[y_0 , \hat{y}_0 \right] + \text{Cov}\left[\hat{y}_0, \hat{y}_0 \right] \\ &= V[y_0] - 0 + V[\hat{y}_0] \\ &= \sigma^2 + \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) \tag{previous} \end{align*}\]

Prediction intervals

A similar line of reasoning yields the following prediction interval:

\[ \hat{\beta}_0 + \hat{\beta}_1 x_0 \pm t_{\alpha/2, df_{Res}} \sqrt{ MS_{Res} \left(1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) } \]

in R:

new_df <- data.frame(x1=40) 
predict(reg_mod_1, new_df, interval = "prediction") # interval arg changed
##        fit      lwr     upr
## 1 31.82829 25.05129 38.6053

Interval Comparison

blue is a CI for the mean response, red is the PI