Module 6

STAT 6021

Taylor R. Brown, PhD

Categorical Variables

How does regression work if we have this?

# head(Xdf)
# table(Xdf$Species)
str(Xdf)
## 'data.frame':    150 obs. of  3 variables:
##  $ y      : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ x1     : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

factor is not string!

data.frames are not matrixs [sic]

Xmat <- model.matrix(y ~ x1 + Species, data = Xdf)
tail(Xmat)
##     (Intercept)  x1 Speciesversicolor Speciesvirginica
## 145           1 2.5                 0                1
## 146           1 2.3                 0                1
## 147           1 1.9                 0                1
## 148           1 2.0                 0                1
## 149           1 2.3                 0                1
## 150           1 1.8                 0                1

The conversion

A factor column looks like strings, but they get converted to indicator or dummy variable columns.

If the factor column has \(L\) levels, then we need \(L-1\) new columns

In our case

\[ x_{i2} = \begin{cases} 1 & \text{Species}_i = \text{versicolor} \\ 0 & \text{else} \end{cases} \]

\[ x_{i3} = \begin{cases} 1 & \text{Species}_i = \text{virginica} \\ 0 & \text{else} \end{cases} \]

lm() usally handles this for us automatically

Back to our example

str(Xdf)
## 'data.frame':    150 obs. of  3 variables:
##  $ y      : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ x1     : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
tail(Xmat)
##     (Intercept)  x1 Speciesversicolor Speciesvirginica
## 145           1 2.5                 0                1
## 146           1 2.3                 0                1
## 147           1 1.9                 0                1
## 148           1 2.0                 0                1
## 149           1 2.3                 0                1
## 150           1 1.8                 0                1

A simpler example

Say we have the following \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon \] where \(x_2\) is a dummy and \(x_1\) is a “regular” quantitative variable

We can rewrite this as \[ y = \begin{cases} \beta_0 + \beta_1 x_1 + \epsilon & x_2 = 0 \\ \beta_0 + \beta_1 x_1 + \beta_2 + \epsilon & x_2 = 1 \end{cases} \] This represents a shift of the intercept

A simpler example

The model is still \(\mathbf{y} =\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), where \[ \mathbf{X} = \begin{bmatrix} 1 & 23 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 12.3 & 0 \\ 1 & -68 & 1 \\ \vdots & \vdots & \vdots \\ 1 & x_{11} & 1 \end{bmatrix} \]

Intercept and Slope Shifts

Assume again \(x_1\) is a quantitative variable and \(x_2\) is a dummy variable.

This \[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3 x_1 x_2 + \epsilon \] can be rewritten as

\[ y = \begin{cases} \beta_0 + \beta_1x_1 + \epsilon & x_2 = 0\\ \beta_0 + \beta_1x_1 + \beta_2 + \beta_3 x_1 + \epsilon & x_2 = 1 \end{cases} \] or \[ y = \begin{cases} \beta_0 + \beta_1x_1 + \epsilon & x_2 = 0\\ (\beta_0 + \beta_2) + (\beta_1 + \beta_3) x_1 + \epsilon & x_2 = 1 \end{cases} \]

Interpretations?

Slope and Intercept Shifts

Example

What to do

Example

\(y = \beta_0 + \beta_1x_1 + \epsilon\)

mod <- lm(y ~ x1 , data = Xdf)
plot(Xdf[,2], Xdf$y, col = Xdf$Species, xlab = "x1", ylab = "y")
abline(mod)

Example

\(y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)

mod <- lm(y ~ x1 + Species , data = Xdf)
mod
## 
## Call:
## lm(formula = y ~ x1 + Species, data = Xdf)
## 
## Coefficients:
##       (Intercept)                 x1  Speciesversicolor  
##             1.211              1.019              1.698  
##  Speciesvirginica  
##             2.277

Example

\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \epsilon\)

plot(Xdf[,2], Xdf$y, col = Xdf$Species, xlab = "x1", ylab = "y")
fitted_vals <- predict(lm(y ~ x1 + Species + x1:Species , data = Xdf))# : means interaction
setosa_indices <- Xdf$Species == "setosa"
versicolor_indices <- Xdf$Species == "versicolor"
virginia_indices <- Xdf$Species == "virginica"
points(Xdf[setosa_indices,2], fitted_vals[setosa_indices], type = "l")
points(Xdf[versicolor_indices,2], fitted_vals[versicolor_indices], type = "l")
points(Xdf[virginia_indices,2], fitted_vals[virginia_indices], type = "l")

Example

We can use our hypothesis testing chops to test \(H_0: \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0\)

lil_mod <- lm(y ~ x1 , data = Xdf)
big_mod <- lm(y ~ x1 + Species + x1:Species , data = Xdf) 
anova(lil_mod, big_mod)
## Analysis of Variance Table
## 
## Model 1: y ~ x1
## Model 2: y ~ x1 + Species + x1:Species
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    148 33.845                                  
## 2    144 18.816  4    15.029 28.755 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we didn’t reject that null, we could go on and test \(H_0: \beta_4 = \beta_5 = 0\) (just intercept shifts) or \(H_0: \beta_2 = \beta_3 = 0\) (just slope shifts)

Connections with Analysis of Variance

Analysis of variance models are examples of regression models, usually written in a slightly different way, and are useful for analyzing data from designed experiments.

Connections with Analysis of Variance

The one-way fixed effects ANOVA model is

\[ y_{ij} = \mu + \tau_{i} + \epsilon_{ij} \]

Connections with Analysis of Variance

Is there a difference between the groups?

The main interest is in testing the hypothesis: \(H_0 : \tau_1 = \cdots = \tau_k = 0\)

Or in other words, is there anything going on?

Connections with Analysis of Variance

Visually:

str(InsectSprays)
## 'data.frame':    72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
boxplot(count ~ spray, data = InsectSprays, 
        ylab = "y", xlab = "treatment/factor level")

Connections with Analysis of Variance

Say \(k=3\)

\[ y_{ij} = \mu + \tau_{i} + \epsilon_{ij}. \] Then this is the same as \[ y_{ij} = \beta_0 + \beta_1 \mathbf{1}(i = 2)+ \beta_2 \mathbf{1}(i = 3) + \epsilon_{ij} \]

So

Connections with Analysis of Variance

The \(\sum_i \tau_i = 0\) constraint is necessary to solve this: \[ \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} \mu \\ \tau_1 \\ \tau_2 \\ \tau_3 \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ 0 \end{bmatrix} \] We get

\[ \begin{bmatrix} \mu \\ \tau_1 \\ \tau_2 \\ \tau_3 \end{bmatrix} = \begin{bmatrix} 1/3 & 1/3 & 1/3 & -1/3 \\ 2/3 & -1/3 & -1/3 & 1/3 \\ -1/3 & 2/3 & -1/3 & 1/3 \\ -1/3 & -1/3 & 2/3 & 1/3 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ 0 \end{bmatrix} \]

using R

The \(F_0\) statistics are the same either way you do it

summary(lm(count ~ spray, data = InsectSprays))
## 
## Call:
## lm(formula = count ~ spray, data = InsectSprays)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.333 -1.958 -0.500  1.667  9.333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.5000     1.1322  12.807  < 2e-16 ***
## sprayB        0.8333     1.6011   0.520    0.604    
## sprayC      -12.4167     1.6011  -7.755 7.27e-11 ***
## sprayD       -9.5833     1.6011  -5.985 9.82e-08 ***
## sprayE      -11.0000     1.6011  -6.870 2.75e-09 ***
## sprayF        2.1667     1.6011   1.353    0.181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.922 on 66 degrees of freedom
## Multiple R-squared:  0.7244, Adjusted R-squared:  0.7036 
## F-statistic:  34.7 on 5 and 66 DF,  p-value: < 2.2e-16
summary(aov(count ~ spray, data = InsectSprays))
##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669   533.8    34.7 <2e-16 ***
## Residuals   66   1015    15.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1