Taylor R. Brown, PhD
More predictors!
y = \beta_0 + \beta_1 x_1 + \beta_2x_2 + \cdots + \beta_k x_k + \epsilon
tip: don’t say “multivariate”
E[y] = \beta_0 + \beta_1 x_1 + \beta_2x_2 + \cdots + \beta_k x_k
\frac{dE[y]}{dx_3} = \beta_3
one unit change in x_3
assumes all other regressors aren’t changing
disregards noise
These are very flexible…
Special case example: polynomial regression
y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon
betas <- c(-1,2,-3)
x <- seq(-3,3,.1)
y_line <- cbind(x,x^2,x^3) %*% betas
y <- y_line + rnorm(n = length(x), sd = 20)
plot(x,y)
points(x,y_line, col="red", type = "l")
Special case example: interaction effects y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1x_2 + \epsilon with (\beta_0, \beta_1, \beta_2, \beta_3) = (-2, 3, -.2, -4)
Special case example: second-order model with interactions
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1^2 + \beta_4 x_2^2 + \beta_5x_1x_2 + \epsilon with (\beta_0, \beta_1, \beta_2, \beta_3, \beta_4, \beta_5) = (800, 10, 7, 8.5, -5, 4)
To estimate the coefficients , we again look at the loss S(\beta_0, \beta_1, \ldots, \beta_k) = \sum_{i=1}^n(y_i - [\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik}])^2 We take derivatives with respect to all betas, and set them all equal to 0, and then solve the system of equations…
It’s cleaner if we use matrices
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}
\mathbf{y} = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} \hspace{10mm} \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1k} \\ 1 & x_{21} & \cdots & x_{2k}\\ \vdots & \vdots & & \vdots \\ 1 & x_{n1} & \cdots & x_{nk} \end{bmatrix} = \begin{bmatrix} \mathbf{x}_1^{\intercal} \\ \mathbf{x}_2^{\intercal}\\ \vdots \\ \mathbf{x}_n^{\intercal} \end{bmatrix}
\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix} \hspace{10mm} \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \epsilon_2\\ \vdots\\ \epsilon_{n} \end{bmatrix}
rewriting the loss
\begin{align*} S(\beta_0, \beta_1, \ldots, \beta_k) &= S(\boldsymbol{\beta})\\ &= \sum_{i=1}^n(y_i - [\beta_0 + \beta_1 x_1 + \beta_2x_2 + \cdots + \beta_k x_k])^2 \\ &= \sum_{i=1}^n(y_i - \mathbf{x}_i^{\intercal}\boldsymbol{\beta} )^2 \\ &= (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \end{align*}
Warning: you have to equivocate on \boldsymbol{\beta} or \boldsymbol{\epsilon}. The book has chosen the second option; however, we always use \epsilon to refer to true, unobserved random errors, not residuals. Instead, when we write the loss, we are not thinking of \boldsymbol{\beta} as the true coefficient vector; instead, we think of it as some input vector for this score function.
A few more moves
\begin{align*} S(\boldsymbol{\beta}) &= (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \\ &= (\mathbf{y}^{\intercal} - [\mathbf{X} \boldsymbol{\beta}]^{\intercal})(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \\ &=(\mathbf{y}^{\intercal} - \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal} )(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \\ &= \mathbf{y}^{\intercal}\mathbf{y} - \mathbf{y}^{\intercal}\mathbf{X} \boldsymbol{\beta} \\ &\hspace{10mm} - \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal} \mathbf{y} + \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal}\mathbf{X} \boldsymbol{\beta} \\ &= \mathbf{y}^{\intercal}\mathbf{y} - 2\mathbf{y}^{\intercal}\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal}\mathbf{X} \boldsymbol{\beta} \end{align*}
taking the derivatives of the loss S(\boldsymbol{\beta}) = \mathbf{y}^{\intercal}\mathbf{y} - 2\mathbf{y}^{\intercal}\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal}\mathbf{X} \boldsymbol{\beta}
\frac{\partial S}{\partial\boldsymbol{\beta} } = -2\mathbf{X}^{\intercal} \mathbf{y} + 2\mathbf{X}^{\intercal}\mathbf{X}\boldsymbol{\beta} \overset{\text{set}}{=} 0 rearranging yields the normal equations (\mathbf{X}^{\intercal}\mathbf{X})\boldsymbol{\hat{\beta}} = \mathbf{X}^{\intercal} \mathbf{y} and solving for the coefficient vector finally gives us \boldsymbol{\hat{\beta}} = (\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y} multiplying both sides by the design matrix gives us the predictions \hat{\mathbf{y}} \overset{\text{def}}{=} \mathbf{X}\boldsymbol{\hat{\beta}} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y}
\hat{\mathbf{y}} \overset{\text{def}}{=} \mathbf{X}\boldsymbol{\hat{\beta}} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y}
The matrix \mathbf{H} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} is special. It is called the hat matrix or projection matrix.
\hat{\mathbf{y}} \overset{\text{def}}{=} \mathbf{X}\boldsymbol{\hat{\beta}} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y}
We’ll use these quite often
\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} =(\mathbf{I}_n - \mathbf{H}) \mathbf{y}
R
What to do?
## y x
## 1 -520.2930 -10
## 2 -397.7454 -9
## 3 -298.1390 -8
## 4 -231.1284 -7
## 5 -144.6627 -6
## 6 -156.7459 -5
R
##
## Call:
## lm(formula = y ~ x + xsquared, data = my_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.172 -31.730 0.934 21.017 58.935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1039 10.3340 0.591 0.562
## x -0.1384 1.1356 -0.122 0.904
## xsquared -5.0683 0.2104 -24.091 3.79e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.51 on 18 degrees of freedom
## Multiple R-squared: 0.9699, Adjusted R-squared: 0.9666
## F-statistic: 290.2 on 2 and 18 DF, p-value: 2.017e-14
R
fitted_vals <- predict(mlr_model)
plot(my_df$x, my_df$y)
points(my_df$x, fitted_vals, col = "red", type ="l")
(\mathbf{X}^{\intercal}\mathbf{X}) \boldsymbol{\hat{\beta}} = \mathbf{X}^{\intercal} \mathbf{y} If \mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 & \cdots & \mathbf{x}_k \end{bmatrix} \begin{align*} \mathbf{X}^{\intercal}\mathbf{X} &= \begin{bmatrix} \mathbf{1}^{\intercal} \\ \mathbf{x}_1^{\intercal} \\ \vdots \\ \mathbf{x}_k^{\intercal} \end{bmatrix} \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 & \cdots & \mathbf{x}_k \end{bmatrix} \\ &= \begin{bmatrix} n & \sum_i x_{i1} & \sum_i x_{i2} & \cdots & \sum_i x_{ik} \\ \sum_i x_{i1} & \sum_i x_{i1}^2 & \sum_i x_{i1}x_{i2} & \cdots & \sum_i x_{i1}x_{ik} \\ \vdots & \vdots & \vdots & \ddots & \vdots\\ \sum_i x_{ik} & \sum_i x_{i1}x_{ik} & \sum_i x_{i2}x_{ik} & \cdots & \sum_i x_{ik}^2 \end{bmatrix} \end{align*}
Rearranging the normal equations
(\mathbf{X}^{\intercal}\mathbf{X}) \boldsymbol{\hat{\beta}} = \mathbf{X}^{\intercal} \mathbf{y} \mathbf{X}^{\intercal} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}\right) = \mathbf{0}
can yield a geometrical interpretation. Choosing \hat{\boldsymbol{\beta}} chooses the error to be orthogonal to the column space of the design matrix.
It’s easier to see if n=3 and k=1:
\begin{bmatrix} \mathbf{1}_1^{\intercal} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}\right)\\ \mathbf{x}_1^{\intercal}\left( \mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}\right) \end{bmatrix} = \begin{bmatrix} 0\\ 0 \end{bmatrix}
they deliberately do not label the axes
\boldsymbol{\hat{\beta}} = (\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y} Taking its expectation is easy; just use linearity and the definition of \mathbf{y}.
Finding the variance is fun. Recall that property that for any fixed matrix \mathbf{A} and random vector \mathbf{W} V\left[\mathbf{A} \mathbf{W} \right] = \mathbf{A} V\left[ \mathbf{W} \right]\mathbf{A}^{\intercal}
This is analogous to the univariate case: V[cX] = c^2V[X]
\begin{align*} V\left[\boldsymbol{\hat{\beta}}\right] &= V\left[(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y}\right] \\ &= \sigma^2 \left[(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \right]\left[(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \right]^{\intercal} \\ &= \sigma^2 (\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal}\mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^{\intercal} \mathbf{X})^{-1} \\ \end{align*} when V\left[\mathbf{y}\right] = \sigma^2 \mathbf{I}_n
So \boldsymbol{\hat{\beta}} \sim \text{Normal}\left( \boldsymbol{\beta}, \sigma^2(\mathbf{X}^{\intercal} \mathbf{X})^{-1} \right)
We still estimate the error variance with the average squared residual:
\hat{\sigma}^2 = MS_{Res} = \frac{\sum_{i=1}^n (y_i- \hat{y}_i)^2 }{n-k-1}
Note we use n-k -1 instead of n. With simple linear regression k=1.
It’s cleaner to write the sum of squares in terms of an inner product:
\hat{\sigma}^2 = MS_{Res} = \frac{(\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^{\intercal}(\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})}{(n-k-1)}
The numerator is the loss function evaluated at \hat{\boldsymbol{\beta}}.
R
Here’s another way to get \hat{\sigma}^2
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 15 15 0.0149 0.9043
## xsquared 1 576253 576253 580.3548 3.794e-15 ***
## Residuals 18 17873 993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Df" "Sum Sq" "Mean Sq" "F value" "Pr(>F)"
## [1] 14.75595 576253.11609 992.93250
## [1] 992.9325
R
Make your own function so you don’t have to re-write that code ever again!
# define a function
getSSHat <- function(model_object){
return(anova(model_object)$'Mean Sq'[3])
}
# use/call the function on a specific model
getSSHat(mlr_model)
## [1] 992.9325
Simulate data from y = 8 - 5x_1 + 12x_2 + \epsilon and look at the misleading scatterplots
x1 <- c(2,3,4,1,5,6,7,8)
x2 <- c(1,2,5,2,6,4,3,4)
y <- 8 - 5*x1 + 12*x2 + .01*rnorm(n = length(x1))
pairs(data.frame(y,x1,x2))
Writing \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} is the same as \mathbf{y} \sim \text{Normal}\left(\mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I}_n \right) is the same as f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = (2\pi)^{-n/2}\det[\sigma^2 \mathbf{I}_n]^{-1/2}\exp\left[-\frac{1}{2 \sigma^2 }(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right]
is the same as f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = (2\pi \sigma^2)^{-n/2}\exp\left[-\frac{1}{2 \sigma^2 } \sum_{i=1}^n (y_i - [\beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik}])^2 \right]
The negative log of f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = (2\pi)^{-n/2}\det[\sigma^2 \mathbf{I}_n]^{-1/2}\exp\left[-\frac{1}{2 \sigma^2 }(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] is - \log f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \frac{n}{2}\log(2\pi) + \frac{n}{2}\log \sigma^2 + \frac{1}{2 \sigma^2 }(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) When we take derivatives with respect to the coefficients, we get the same thing as our original loss!
The MLE estimator of \sigma^2 has a different denominator, though.
We want to test all coefficients (except the intercept) at once: H_0 : \beta_1 = \beta_2 = \cdots = \beta_k = 0
Too many tests inflates type 1 error!
The omnibus F-test test statistic:
F_0 = \frac{ SS_R / df_R }{ SS_{Res}/df_{Res} } \sim F_{df_R, df_{Res}}
Exactly the same formula, but the degrees of freedom changes.
Find F_0…
##
## Call:
## lm(formula = y ~ x + xsquared, data = my_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.172 -31.730 0.934 21.017 58.935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1039 10.3340 0.591 0.562
## x -0.1384 1.1356 -0.122 0.904
## xsquared -5.0683 0.2104 -24.091 3.79e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.51 on 18 degrees of freedom
## Multiple R-squared: 0.9699, Adjusted R-squared: 0.9666
## F-statistic: 290.2 on 2 and 18 DF, p-value: 2.017e-14
t_0 = \frac{\hat{\beta}_j - \beta_j }{\sqrt{ \hat{\sigma}^2 [(\mathbf{X}^{\intercal}\mathbf{X})^{-1}]_{jj}} } \sim t_{df_{Res}}
summary()
does them all at once…
##
## Call:
## lm(formula = y ~ x + xsquared, data = my_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.172 -31.730 0.934 21.017 58.935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1039 10.3340 0.591 0.562
## x -0.1384 1.1356 -0.122 0.904
## xsquared -5.0683 0.2104 -24.091 3.79e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.51 on 18 degrees of freedom
## Multiple R-squared: 0.9699, Adjusted R-squared: 0.9666
## F-statistic: 290.2 on 2 and 18 DF, p-value: 2.017e-14
Before we talk about hypothesis tests for chunks of coefficients, let’s play around with the sums of squares…
Want to show:
SS_T = SS_R + SS_{Res}
first we should check
\mathbf{H} is symmetric (i.e. \mathbf{H} = \mathbf{H}^{\intercal})
\mathbf{H} is idempotent (i.e. \mathbf{H}^2 = \mathbf{H})
\mathbf{I} - \mathbf{H} is symmetric and idempotent
\frac{1}{n}\mathbf{1}\mathbf{1}^{\intercal} is symmetric and idempotent
\mathbf{H} \mathbf{X} = \mathbf{X}
Before we talk about hypothesis tests for chunks of coefficients, let’s play around with the sums of squares…
Check
SS_T = (\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )^{\intercal}(\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} ) = \mathbf{y}^{\intercal}\left[\mathbf{I} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal} \right]\mathbf{y}
SS_{Res} = (\mathbf{y} - \mathbf{H}\mathbf{y} )^{\intercal}(\mathbf{y} - \mathbf{H}\mathbf{y} ) = \mathbf{y}^{\intercal}\left[\mathbf{I} - \mathbf{H} \right]\mathbf{y}
SS_{R} = (\mathbf{H}\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )^{\intercal}(\mathbf{y} - \mathbf{H}\mathbf{y} ) = \mathbf{y}^{\intercal}\left[\mathbf{H} - \frac{1}{n}\mathbf{1}\mathbf{1}^{\intercal} \right]\mathbf{y}
\mathbf{H}\mathbf{X} = \mathbf{X} \implies \mathbf{H}\mathbf{1} = \mathbf{1}
Finally \begin{align*} SS_T &= (\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )^{\intercal}(\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} ) \\ &= [(\mathbf{y} - \mathbf{H}\mathbf{y}) + (\mathbf{H}\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )]^{\intercal} [(\mathbf{y} - \mathbf{H}\mathbf{y}) + (\mathbf{H}\mathbf{y} - \frac{1}{n} \mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )] \\ &= SS_R + SS_{Res} + 2(\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal}(\mathbf{H}\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )\\ &= SS_R + SS_{Res} + 2(\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal}\mathbf{H}\mathbf{y} - \frac{2}{n}(\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal} \mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} \\ &= SS_R + SS_{Res} \end{align*} because of the normal equations again (\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal}\mathbf{H}\mathbf{y} = \underbrace{(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{\intercal}\mathbf{X}}_{0}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y} and
\underbrace{(\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal}\mathbf{1}}_{0} \mathbf{1}^{\intercal}\mathbf{y}
R
Find SS_R, SS_T, SS_{Res}:
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 15 15 0.0149 0.9043
## xsquared 1 576253 576253 580.3548 3.794e-15 ***
## Residuals 18 17873 993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
be careful…
First, rewrite
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}
as
\mathbf{y} = \begin{bmatrix} \mathbf{X}_{1} & \mathbf{X}_{2} \end{bmatrix} \begin{bmatrix} \boldsymbol{\beta}_1 \\ \boldsymbol{\beta}_2 \end{bmatrix} + \boldsymbol{\epsilon} = \mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{X}_2 \boldsymbol{\beta}_2 + \boldsymbol{\epsilon}
We want to test H_0: \boldsymbol{\beta}_2 = \mathbf{0}
For the full model \mathbf{y} = \mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{X}_2 \boldsymbol{\beta}_2 + \epsilon
SS_{R}(\boldsymbol{\beta}) = \mathbf{y}^{\intercal}\left(\mathbf{H} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\right)\mathbf{y}
For the reduced model \mathbf{y} = \mathbf{X}_1 \boldsymbol{\beta}_1 + \epsilon
SS_{R}(\boldsymbol{\beta}_1) = \mathbf{y}^{\intercal}\left(\mathbf{H}_1 - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\right)\mathbf{y}
where \mathbf{H}_1 = \mathbf{X}_1 (\mathbf{X}_1^{\intercal}\mathbf{X}_1)^{-1}\mathbf{X}_1^{\intercal}
So our test statistic is F_0 = \frac{(SS_{R}(\boldsymbol{\beta})-SS_{R}(\boldsymbol{\beta}_1))/r }{MS_{Res}} where r is the number of elements of \boldsymbol{\beta}_2, and MS_{Res} is from the full model.
R
lil_mod <- lm(y ~ x, data = my_df)
big_mod <- lm(y ~ x + xsquared, data=my_df)
anova(lil_mod, big_mod) # smallest to largest
## Analysis of Variance Table
##
## Model 1: y ~ x
## Model 2: y ~ x + xsquared
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19 594126
## 2 18 17873 1 576253 580.35 3.794e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regular R^2 was R^2 = 1 - \frac{SS_{Res}}{SS_T}
Adjusted R^2 has a built-in penalty
R^2_{\text{adj}} = 1 - \frac{SS_{Res}/(n-k-1)}{SS_T/(n-1)}
summary()
spits this out
Need the distribution of \hat{y}_0 = \mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}}…
Unbiasedness E\left[ \mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}} \right] = \mathbf{x}_0^{\intercal}E\left[ \hat{\boldsymbol{\beta}} \right] = \mathbf{x}_0^{\intercal}\boldsymbol{\beta}
Variance: V[\hat{y}_0] = V[\mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}}] =\sigma^2 \mathbf{x}_0^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{x}_0
Prediction error variance: V[y_0 - \hat{y}_0] = V[y_0] + V[\hat{y}_0]
So \hat{y}_0 = \mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}} \sim \text{Normal}\left(\mathbf{x}_0^{\intercal}\boldsymbol{\beta}, \sigma^2 \mathbf{x}_0^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{x}_0 \right)
Confidence interval for the mean response:
\mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}} \pm t_{\alpha/2,df_{Res}}\sqrt{ \hat{\sigma}^2 \mathbf{x}_0^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{x}_0 }
Prediction interval:
\mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}} \pm t_{\alpha/2,df_{Res}}\sqrt{ \hat{\sigma}^2\left(1 + \mathbf{x}_0^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{x}_0 \right) }
Same code as before
x0 <- data.frame(x = 30, xsquared = 20) #named!
predict(mlr_model, newdata = x0, interval = "confidence")
## fit lwr upr
## 1 -99.416 -172.8025 -26.0295
## fit lwr upr
## 1 -99.416 -198.2505 -0.5815068
Space, Right Arrow or swipe left to move to next slide, click help below for more details