Taylor R. Brown, PhD
How does regression work if we have this?
## 'data.frame': 150 obs. of 3 variables:
## $ y : num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ x1 : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
factor
is not string
!
data.frame
s are not matrix
s [sic]## (Intercept) x1 Speciesversicolor Speciesvirginica
## 145 1 2.5 0 1
## 146 1 2.3 0 1
## 147 1 1.9 0 1
## 148 1 2.0 0 1
## 149 1 2.3 0 1
## 150 1 1.8 0 1
A factor
column looks like strings, but they get converted to indicator or dummy variable columns.
If the factor
column has \(L\) levels, then we need \(L-1\) new columns
In our case
\[ x_{i2} = \begin{cases} 1 & \text{Species}_i = \text{versicolor} \\ 0 & \text{else} \end{cases} \]
\[ x_{i3} = \begin{cases} 1 & \text{Species}_i = \text{virginica} \\ 0 & \text{else} \end{cases} \]
lm()
usally handles this for us automatically
## 'data.frame': 150 obs. of 3 variables:
## $ y : num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ x1 : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## (Intercept) x1 Speciesversicolor Speciesvirginica
## 145 1 2.5 0 1
## 146 1 2.3 0 1
## 147 1 1.9 0 1
## 148 1 2.0 0 1
## 149 1 2.3 0 1
## 150 1 1.8 0 1
Say we have the following \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon \] where \(x_2\) is a dummy and \(x_1\) is a “regular” quantitative variable
We can rewrite this as \[ y = \begin{cases} \beta_0 + \beta_1 x_1 + \epsilon & x_2 = 0 \\ \beta_0 + \beta_1 x_1 + \beta_2 + \epsilon & x_2 = 1 \end{cases} \] This represents a shift of the intercept
The model is still \(\mathbf{y} =\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), where \[ \mathbf{X} = \begin{bmatrix} 1 & 23 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 12.3 & 0 \\ 1 & -68 & 1 \\ \vdots & \vdots & \vdots \\ 1 & x_{11} & 1 \end{bmatrix} \]
Assume again \(x_1\) is a quantitative variable and \(x_2\) is a dummy variable.
This \[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3 x_1 x_2 + \epsilon \] can be rewritten as
\[ y = \begin{cases} \beta_0 + \beta_1x_1 + \epsilon & x_2 = 0\\ \beta_0 + \beta_1x_1 + \beta_2 + \beta_3 x_1 + \epsilon & x_2 = 1 \end{cases} \] or \[ y = \begin{cases} \beta_0 + \beta_1x_1 + \epsilon & x_2 = 0\\ (\beta_0 + \beta_2) + (\beta_1 + \beta_3) x_1 + \epsilon & x_2 = 1 \end{cases} \]
Interpretations?
What to do
\(y = \beta_0 + \beta_1x_1 + \epsilon\)
\(y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)
##
## Call:
## lm(formula = y ~ x1 + Species, data = Xdf)
##
## Coefficients:
## (Intercept) x1 Speciesversicolor
## 1.211 1.019 1.698
## Speciesvirginica
## 2.277
\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \epsilon\)
plot(Xdf[,2], Xdf$y, col = Xdf$Species, xlab = "x1", ylab = "y")
fitted_vals <- predict(lm(y ~ x1 + Species + x1:Species , data = Xdf))# : means interaction
setosa_indices <- Xdf$Species == "setosa"
versicolor_indices <- Xdf$Species == "versicolor"
virginia_indices <- Xdf$Species == "virginica"
points(Xdf[setosa_indices,2], fitted_vals[setosa_indices], type = "l")
points(Xdf[versicolor_indices,2], fitted_vals[versicolor_indices], type = "l")
points(Xdf[virginia_indices,2], fitted_vals[virginia_indices], type = "l")
We can use our hypothesis testing chops to test \(H_0: \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0\)
lil_mod <- lm(y ~ x1 , data = Xdf)
big_mod <- lm(y ~ x1 + Species + x1:Species , data = Xdf)
anova(lil_mod, big_mod)
## Analysis of Variance Table
##
## Model 1: y ~ x1
## Model 2: y ~ x1 + Species + x1:Species
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 148 33.845
## 2 144 18.816 4 15.029 28.755 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we didn’t reject that null, we could go on and test \(H_0: \beta_4 = \beta_5 = 0\) (just intercept shifts) or \(H_0: \beta_2 = \beta_3 = 0\) (just slope shifts)
Analysis of variance models are examples of regression models, usually written in a slightly different way, and are useful for analyzing data from designed experiments.
The one-way fixed effects ANOVA model is
\[ y_{ij} = \mu + \tau_{i} + \epsilon_{ij} \]
Is there a difference between the groups?
The main interest is in testing the hypothesis: \(H_0 : \tau_1 = \cdots = \tau_k = 0\)
Or in other words, is there anything going on?
Visually:
## 'data.frame': 72 obs. of 2 variables:
## $ count: num 10 7 20 14 14 12 10 23 17 20 ...
## $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
Say \(k=3\)
\[ y_{ij} = \mu + \tau_{i} + \epsilon_{ij}. \] Then this is the same as \[ y_{ij} = \beta_0 + \beta_1 \mathbf{1}(i = 2)+ \beta_2 \mathbf{1}(i = 3) + \epsilon_{ij} \]
So
The \(\sum_i \tau_i = 0\) constraint is necessary to solve this: \[ \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} \mu \\ \tau_1 \\ \tau_2 \\ \tau_3 \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ 0 \end{bmatrix} \] We get
\[ \begin{bmatrix} \mu \\ \tau_1 \\ \tau_2 \\ \tau_3 \end{bmatrix} = \begin{bmatrix} 1/3 & 1/3 & 1/3 & -1/3 \\ 2/3 & -1/3 & -1/3 & 1/3 \\ -1/3 & 2/3 & -1/3 & 1/3 \\ -1/3 & -1/3 & 2/3 & 1/3 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ 0 \end{bmatrix} \]
R
The \(F_0\) statistics are the same either way you do it
##
## Call:
## lm(formula = count ~ spray, data = InsectSprays)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.333 -1.958 -0.500 1.667 9.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.5000 1.1322 12.807 < 2e-16 ***
## sprayB 0.8333 1.6011 0.520 0.604
## sprayC -12.4167 1.6011 -7.755 7.27e-11 ***
## sprayD -9.5833 1.6011 -5.985 9.82e-08 ***
## sprayE -11.0000 1.6011 -6.870 2.75e-09 ***
## sprayF 2.1667 1.6011 1.353 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.922 on 66 degrees of freedom
## Multiple R-squared: 0.7244, Adjusted R-squared: 0.7036
## F-statistic: 34.7 on 5 and 66 DF, p-value: < 2.2e-16
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2669 533.8 34.7 <2e-16 ***
## Residuals 66 1015 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1