Taylor R. Brown, PhD
More predictors!
\[ y = \beta_0 + \beta_1 x_1 + \beta_2x_2 + \cdots + \beta_k x_k + \epsilon \]
tip: don’t say “multivariate”
\[ E[y] = \beta_0 + \beta_1 x_1 + \beta_2x_2 + \cdots + \beta_k x_k \]
\(\frac{dE[y]}{dx_3} = \beta_3\)
one unit change in \(x_3\)
assumes all other regressors aren’t changing
disregards noise
These are very flexible…
Special case example: polynomial regression
\[ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon \]
Special case example: interaction effects \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1x_2 + \epsilon \] with \((\beta_0, \beta_1, \beta_2, \beta_3) = (-2, 3, -.2, -4)\)
Special case example: second-order model with interactions
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1^2 + \beta_4 x_2^2 + \beta_5x_1x_2 + \epsilon \] with \((\beta_0, \beta_1, \beta_2, \beta_3, \beta_4, \beta_5) = (800, 10, 7, 8.5, -5, 4)\)
To estimate the coefficients , we again look at the loss \[ S(\beta_0, \beta_1, \ldots, \beta_k) = \sum_{i=1}^n(y_i - [\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik}])^2 \] We take derivatives with respect to all betas, and set them all equal to \(0\), and then solve the system of equations…
It’s cleaner if we use matrices
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]
\[ \mathbf{y} = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} \hspace{10mm} \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1k} \\ 1 & x_{21} & \cdots & x_{2k}\\ \vdots & \vdots & & \vdots \\ 1 & x_{n1} & \cdots & x_{nk} \end{bmatrix} = \begin{bmatrix} \mathbf{x}_1^{\intercal} \\ \mathbf{x}_2^{\intercal}\\ \vdots \\ \mathbf{x}_n^{\intercal} \end{bmatrix} \]
\[ \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix} \hspace{10mm} \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \epsilon_2\\ \vdots\\ \epsilon_{n} \end{bmatrix} \]
rewriting the loss
\[\begin{align*} S(\beta_0, \beta_1, \ldots, \beta_k) &= S(\boldsymbol{\beta})\\ &= \sum_{i=1}^n(y_i - [\beta_0 + \beta_1 x_1 + \beta_2x_2 + \cdots + \beta_k x_k])^2 \\ &= \sum_{i=1}^n(y_i - \mathbf{x}_i^{\intercal}\boldsymbol{\beta} )^2 \\ &= (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \end{align*}\]
Warning: you have to equivocate on \(\boldsymbol{\beta}\) or \(\boldsymbol{\epsilon}\). The book has chosen the second option; however, we always use \(\epsilon\) to refer to true, unobserved random errors, not residuals. Instead, when we write the loss, we are not thinking of \(\boldsymbol{\beta}\) as the true coefficient vector; instead, we think of it as some input vector for this score function.
A few more moves
\[\begin{align*} S(\boldsymbol{\beta}) &= (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \\ &= (\mathbf{y}^{\intercal} - [\mathbf{X} \boldsymbol{\beta}]^{\intercal})(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \\ &=(\mathbf{y}^{\intercal} - \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal} )(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \\ &= \mathbf{y}^{\intercal}\mathbf{y} - \mathbf{y}^{\intercal}\mathbf{X} \boldsymbol{\beta} \\ &\hspace{10mm} - \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal} \mathbf{y} + \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal}\mathbf{X} \boldsymbol{\beta} \\ &= \mathbf{y}^{\intercal}\mathbf{y} - 2\mathbf{y}^{\intercal}\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal}\mathbf{X} \boldsymbol{\beta} \end{align*}\]
taking the derivatives of the loss \(S(\boldsymbol{\beta}) = \mathbf{y}^{\intercal}\mathbf{y} - 2\mathbf{y}^{\intercal}\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\beta}^{\intercal} \mathbf{X}^{\intercal}\mathbf{X} \boldsymbol{\beta}\)
\[ \frac{\partial S}{\partial\boldsymbol{\beta} } = -2\mathbf{X}^{\intercal} \mathbf{y} + 2\mathbf{X}^{\intercal}\mathbf{X}\boldsymbol{\beta} \overset{\text{set}}{=} 0 \] rearranging yields the normal equations \[ (\mathbf{X}^{\intercal}\mathbf{X})\boldsymbol{\hat{\beta}} = \mathbf{X}^{\intercal} \mathbf{y} \] and solving for the coefficient vector finally gives us \[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y} \] multiplying both sides by the design matrix gives us the predictions \[ \hat{\mathbf{y}} \overset{\text{def}}{=} \mathbf{X}\boldsymbol{\hat{\beta}} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y} \]
\[ \hat{\mathbf{y}} \overset{\text{def}}{=} \mathbf{X}\boldsymbol{\hat{\beta}} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y} \]
The matrix \[ \mathbf{H} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \] is special. It is called the hat matrix or projection matrix.
\[ \hat{\mathbf{y}} \overset{\text{def}}{=} \mathbf{X}\boldsymbol{\hat{\beta}} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y} \]
We’ll use these quite often
\[ \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} =(\mathbf{I}_n - \mathbf{H}) \mathbf{y} \]
R
What to do?
## y x
## 1 -520.2930 -10
## 2 -397.7454 -9
## 3 -298.1390 -8
## 4 -231.1284 -7
## 5 -144.6627 -6
## 6 -156.7459 -5
R
##
## Call:
## lm(formula = y ~ x + xsquared, data = my_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.172 -31.730 0.934 21.017 58.935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1039 10.3340 0.591 0.562
## x -0.1384 1.1356 -0.122 0.904
## xsquared -5.0683 0.2104 -24.091 3.79e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.51 on 18 degrees of freedom
## Multiple R-squared: 0.9699, Adjusted R-squared: 0.9666
## F-statistic: 290.2 on 2 and 18 DF, p-value: 2.017e-14
R
\[ (\mathbf{X}^{\intercal}\mathbf{X}) \boldsymbol{\hat{\beta}} = \mathbf{X}^{\intercal} \mathbf{y} \] If \(\mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 & \cdots & \mathbf{x}_k \end{bmatrix}\) \[\begin{align*} \mathbf{X}^{\intercal}\mathbf{X} &= \begin{bmatrix} \mathbf{1}^{\intercal} \\ \mathbf{x}_1^{\intercal} \\ \vdots \\ \mathbf{x}_k^{\intercal} \end{bmatrix} \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 & \cdots & \mathbf{x}_k \end{bmatrix} \\ &= \begin{bmatrix} n & \sum_i x_{i1} & \sum_i x_{i2} & \cdots & \sum_i x_{ik} \\ \sum_i x_{i1} & \sum_i x_{i1}^2 & \sum_i x_{i1}x_{i2} & \cdots & \sum_i x_{i1}x_{ik} \\ \vdots & \vdots & \vdots & \ddots & \vdots\\ \sum_i x_{ik} & \sum_i x_{i1}x_{ik} & \sum_i x_{i2}x_{ik} & \cdots & \sum_i x_{ik}^2 \end{bmatrix} \end{align*}\]
Rearranging the normal equations
\[ (\mathbf{X}^{\intercal}\mathbf{X}) \boldsymbol{\hat{\beta}} = \mathbf{X}^{\intercal} \mathbf{y} \] \[ \mathbf{X}^{\intercal} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}\right) = \mathbf{0} \]
can yield a geometrical interpretation. Choosing \(\hat{\boldsymbol{\beta}}\) chooses the error to be orthogonal to the column space of the design matrix.
It’s easier to see if \(n=3\) and \(k=1\):
\[ \begin{bmatrix} \mathbf{1}_1^{\intercal} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}\right)\\ \mathbf{x}_1^{\intercal}\left( \mathbf{y} - \mathbf{X} \boldsymbol{\hat{\beta}}\right) \end{bmatrix} = \begin{bmatrix} 0\\ 0 \end{bmatrix} \]
they deliberately do not label the axes
\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y} \] Taking its expectation is easy; just use linearity and the definition of \(\mathbf{y}\).
Finding the variance is fun. Recall that property that for any fixed matrix \(\mathbf{A}\) and random vector \(\mathbf{W}\) \[ V\left[\mathbf{A} \mathbf{W} \right] = \mathbf{A} V\left[ \mathbf{W} \right]\mathbf{A}^{\intercal} \]
This is analogous to the univariate case: \(V[cX] = c^2V[X]\)
\[\begin{align*} V\left[\boldsymbol{\hat{\beta}}\right] &= V\left[(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y}\right] \\ &= \sigma^2 \left[(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \right]\left[(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \right]^{\intercal} \\ &= \sigma^2 (\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal}\mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^{\intercal} \mathbf{X})^{-1} \\ \end{align*}\] when \(V\left[\mathbf{y}\right] = \sigma^2 \mathbf{I}_n\)
So \[ \boldsymbol{\hat{\beta}} \sim \text{Normal}\left( \boldsymbol{\beta}, \sigma^2(\mathbf{X}^{\intercal} \mathbf{X})^{-1} \right) \]
We still estimate the error variance with the average squared residual:
\[ \hat{\sigma}^2 = MS_{Res} = \frac{\sum_{i=1}^n (y_i- \hat{y}_i)^2 }{n-k-1} \]
Note we use \(n-k -1\) instead of \(n\). With simple linear regression \(k=1\).
It’s cleaner to write the sum of squares in terms of an inner product:
\[ \hat{\sigma}^2 = MS_{Res} = \frac{(\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})^{\intercal}(\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}})}{(n-k-1)} \]
The numerator is the loss function evaluated at \(\hat{\boldsymbol{\beta}}\).
R
Here’s another way to get \(\hat{\sigma}^2\)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 15 15 0.0149 0.9043
## xsquared 1 576253 576253 580.3548 3.794e-15 ***
## Residuals 18 17873 993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Df" "Sum Sq" "Mean Sq" "F value" "Pr(>F)"
## [1] 14.75595 576253.11609 992.93250
## [1] 992.9325
R
Make your own function so you don’t have to re-write that code ever again!
# define a function
getSSHat <- function(model_object){
return(anova(model_object)$'Mean Sq'[3])
}
# use/call the function on a specific model
getSSHat(mlr_model)
## [1] 992.9325
Simulate data from \[ y = 8 - 5x_1 + 12x_2 + \epsilon \] and look at the misleading scatterplots
Writing \[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \] is the same as \[ \mathbf{y} \sim \text{Normal}\left(\mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I}_n \right) \] is the same as \[ f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = (2\pi)^{-n/2}\det[\sigma^2 \mathbf{I}_n]^{-1/2}\exp\left[-\frac{1}{2 \sigma^2 }(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \]
is the same as \[ f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = (2\pi \sigma^2)^{-n/2}\exp\left[-\frac{1}{2 \sigma^2 } \sum_{i=1}^n (y_i - [\beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik}])^2 \right] \]
The negative log of \[ f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = (2\pi)^{-n/2}\det[\sigma^2 \mathbf{I}_n]^{-1/2}\exp\left[-\frac{1}{2 \sigma^2 }(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right] \] is \[ - \log f(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \frac{n}{2}\log(2\pi) + \frac{n}{2}\log \sigma^2 + \frac{1}{2 \sigma^2 }(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\intercal}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \] When we take derivatives with respect to the coefficients, we get the same thing as our original loss!
The MLE estimator of \(\sigma^2\) has a different denominator, though.
We want to test all coefficients (except the intercept) at once: \[ H_0 : \beta_1 = \beta_2 = \cdots = \beta_k = 0 \]
Too many tests inflates type 1 error!
The omnibus F-test test statistic:
\[ F_0 = \frac{ SS_R / df_R }{ SS_{Res}/df_{Res} } \sim F_{df_R, df_{Res}} \]
Exactly the same formula, but the degrees of freedom changes.
Find \(F_0\)…
##
## Call:
## lm(formula = y ~ x + xsquared, data = my_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.172 -31.730 0.934 21.017 58.935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1039 10.3340 0.591 0.562
## x -0.1384 1.1356 -0.122 0.904
## xsquared -5.0683 0.2104 -24.091 3.79e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.51 on 18 degrees of freedom
## Multiple R-squared: 0.9699, Adjusted R-squared: 0.9666
## F-statistic: 290.2 on 2 and 18 DF, p-value: 2.017e-14
\[ t_0 = \frac{\hat{\beta}_j - \beta_j }{\sqrt{ \hat{\sigma}^2 [(\mathbf{X}^{\intercal}\mathbf{X})^{-1}]_{jj}} } \sim t_{df_{Res}} \]
summary()
does them all at once…
##
## Call:
## lm(formula = y ~ x + xsquared, data = my_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.172 -31.730 0.934 21.017 58.935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1039 10.3340 0.591 0.562
## x -0.1384 1.1356 -0.122 0.904
## xsquared -5.0683 0.2104 -24.091 3.79e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.51 on 18 degrees of freedom
## Multiple R-squared: 0.9699, Adjusted R-squared: 0.9666
## F-statistic: 290.2 on 2 and 18 DF, p-value: 2.017e-14
Before we talk about hypothesis tests for chunks of coefficients, let’s play around with the sums of squares…
Want to show:
\[ SS_T = SS_R + SS_{Res} \]
first we should check
\(\mathbf{H}\) is symmetric (i.e. \(\mathbf{H} = \mathbf{H}^{\intercal}\))
\(\mathbf{H}\) is idempotent (i.e. \(\mathbf{H}^2 = \mathbf{H}\))
\(\mathbf{I} - \mathbf{H}\) is symmetric and idempotent
\(\frac{1}{n}\mathbf{1}\mathbf{1}^{\intercal}\) is symmetric and idempotent
\(\mathbf{H} \mathbf{X} = \mathbf{X}\)
Before we talk about hypothesis tests for chunks of coefficients, let’s play around with the sums of squares…
Check
\[ SS_T = (\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )^{\intercal}(\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} ) = \mathbf{y}^{\intercal}\left[\mathbf{I} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal} \right]\mathbf{y} \]
\[ SS_{Res} = (\mathbf{y} - \mathbf{H}\mathbf{y} )^{\intercal}(\mathbf{y} - \mathbf{H}\mathbf{y} ) = \mathbf{y}^{\intercal}\left[\mathbf{I} - \mathbf{H} \right]\mathbf{y} \]
\[ SS_{R} = (\mathbf{H}\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )^{\intercal}(\mathbf{y} - \mathbf{H}\mathbf{y} ) = \mathbf{y}^{\intercal}\left[\mathbf{H} - \frac{1}{n}\mathbf{1}\mathbf{1}^{\intercal} \right]\mathbf{y} \]
\(\mathbf{H}\mathbf{X} = \mathbf{X} \implies \mathbf{H}\mathbf{1} = \mathbf{1}\)
Finally \[\begin{align*} SS_T &= (\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )^{\intercal}(\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} ) \\ &= [(\mathbf{y} - \mathbf{H}\mathbf{y}) + (\mathbf{H}\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )]^{\intercal} [(\mathbf{y} - \mathbf{H}\mathbf{y}) + (\mathbf{H}\mathbf{y} - \frac{1}{n} \mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )] \\ &= SS_R + SS_{Res} + 2(\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal}(\mathbf{H}\mathbf{y} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} )\\ &= SS_R + SS_{Res} + 2(\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal}\mathbf{H}\mathbf{y} - \frac{2}{n}(\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal} \mathbf{1} \mathbf{1}^{\intercal}\mathbf{y} \\ &= SS_R + SS_{Res} \end{align*}\] because of the normal equations again \[ (\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal}\mathbf{H}\mathbf{y} = \underbrace{(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{\intercal}\mathbf{X}}_{0}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \mathbf{y} \] and
\[ \underbrace{(\mathbf{y} - \mathbf{H}\mathbf{y})^{\intercal}\mathbf{1}}_{0} \mathbf{1}^{\intercal}\mathbf{y} \]
R
Find \(SS_R, SS_T, SS_{Res}\):
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 15 15 0.0149 0.9043
## xsquared 1 576253 576253 580.3548 3.794e-15 ***
## Residuals 18 17873 993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
be careful…
First, rewrite
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]
as
\[ \mathbf{y} = \begin{bmatrix} \mathbf{X}_{1} & \mathbf{X}_{2} \end{bmatrix} \begin{bmatrix} \boldsymbol{\beta}_1 \\ \boldsymbol{\beta}_2 \end{bmatrix} + \boldsymbol{\epsilon} = \mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{X}_2 \boldsymbol{\beta}_2 + \boldsymbol{\epsilon} \]
We want to test \(H_0: \boldsymbol{\beta}_2 = \mathbf{0}\)
For the full model \(\mathbf{y} = \mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{X}_2 \boldsymbol{\beta}_2 + \epsilon\)
\[ SS_{R}(\boldsymbol{\beta}) = \mathbf{y}^{\intercal}\left(\mathbf{H} - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\right)\mathbf{y} \]
For the reduced model \(\mathbf{y} = \mathbf{X}_1 \boldsymbol{\beta}_1 + \epsilon\)
\[ SS_{R}(\boldsymbol{\beta}_1) = \mathbf{y}^{\intercal}\left(\mathbf{H}_1 - \frac{1}{n}\mathbf{1} \mathbf{1}^{\intercal}\right)\mathbf{y} \]
where \(\mathbf{H}_1 = \mathbf{X}_1 (\mathbf{X}_1^{\intercal}\mathbf{X}_1)^{-1}\mathbf{X}_1^{\intercal}\)
So our test statistic is \[ F_0 = \frac{(SS_{R}(\boldsymbol{\beta})-SS_{R}(\boldsymbol{\beta}_1))/r }{MS_{Res}} \] where \(r\) is the number of elements of \(\boldsymbol{\beta}_2\), and \(MS_{Res}\) is from the full model.
R
lil_mod <- lm(y ~ x, data = my_df)
big_mod <- lm(y ~ x + xsquared, data=my_df)
anova(lil_mod, big_mod) # smallest to largest
## Analysis of Variance Table
##
## Model 1: y ~ x
## Model 2: y ~ x + xsquared
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19 594126
## 2 18 17873 1 576253 580.35 3.794e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regular \(R^2\) was \[ R^2 = 1 - \frac{SS_{Res}}{SS_T} \]
Adjusted \(R^2\) has a built-in penalty
\[ R^2_{\text{adj}} = 1 - \frac{SS_{Res}/(n-k-1)}{SS_T/(n-1)} \]
summary()
spits this out
Need the distribution of \(\hat{y}_0 = \mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}}\)…
Unbiasedness \[ E\left[ \mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}} \right] = \mathbf{x}_0^{\intercal}E\left[ \hat{\boldsymbol{\beta}} \right] = \mathbf{x}_0^{\intercal}\boldsymbol{\beta} \]
Variance: \[ V[\hat{y}_0] = V[\mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}}] =\sigma^2 \mathbf{x}_0^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{x}_0 \]
Prediction error variance: \[ V[y_0 - \hat{y}_0] = V[y_0] + V[\hat{y}_0] \]
So \[ \hat{y}_0 = \mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}} \sim \text{Normal}\left(\mathbf{x}_0^{\intercal}\boldsymbol{\beta}, \sigma^2 \mathbf{x}_0^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{x}_0 \right) \]
Confidence interval for the mean response:
\[ \mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}} \pm t_{\alpha/2,df_{Res}}\sqrt{ \hat{\sigma}^2 \mathbf{x}_0^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{x}_0 } \]
Prediction interval:
\[ \mathbf{x}_0^{\intercal}\hat{\boldsymbol{\beta}} \pm t_{\alpha/2,df_{Res}}\sqrt{ \hat{\sigma}^2\left(1 + \mathbf{x}_0^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{x}_0 \right) } \]
Same code as before
x0 <- data.frame(x = 30, xsquared = 20) #named!
predict(mlr_model, newdata = x0, interval = "confidence")
## fit lwr upr
## 1 -99.416 -172.8025 -26.0295
## fit lwr upr
## 1 -99.416 -198.2505 -0.5815068