Categorical Variables

How does regression work if we have this?

# head(Xdf)
# table(Xdf$Species)
str(Xdf)

## 'data.frame':    150 obs. of  3 variables:
##  $ y      : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ x1     : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

factor is not string!

`data.frame`s are not `matrix`s [sic]

Xmat <- model.matrix(y ~ x1 + Species, data = Xdf)
tail(Xmat)

##     (Intercept)  x1 Speciesversicolor Speciesvirginica
## 145           1 2.5                 0                1
## 146           1 2.3                 0                1
## 147           1 1.9                 0                1
## 148           1 2.0                 0                1
## 149           1 2.3                 0                1
## 150           1 1.8                 0                1

The conversion

A factor column looks like strings, but they get converted to indicator or dummy variable columns.

If the factor column has \(L\) levels, then we need \(L-1\) new columns

In our case

\[ x_{i2} = \begin{cases} 1 & \text{Species}_i = \text{versicolor} \\ 0 & \text{else} \end{cases} \]

\[ x_{i3} = \begin{cases} 1 & \text{Species}_i = \text{virginica} \\ 0 & \text{else} \end{cases} \]

lm() usally handles this for us automatically

Back to our example

str(Xdf)

## 'data.frame':    150 obs. of  3 variables:
##  $ y      : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ x1     : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

tail(Xmat)

##     (Intercept)  x1 Speciesversicolor Speciesvirginica
## 145           1 2.5                 0                1
## 146           1 2.3                 0                1
## 147           1 1.9                 0                1
## 148           1 2.0                 0                1
## 149           1 2.3                 0                1
## 150           1 1.8                 0                1

A simpler example

Say we have the following \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon \] where \(x_2\) is a dummy and \(x_1\) is a “regular” quantitative variable

We can rewrite this as \[ y = \begin{cases} \beta_0 + \beta_1 x_1 + \epsilon & x_2 = 0 \\ \beta_0 + \beta_1 x_1 + \beta_2 + \epsilon & x_2 = 1 \end{cases} \] This represents a shift of the intercept

A simpler example

The model is still \(\mathbf{y} =\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), where \[ \mathbf{X} = \begin{bmatrix} 1 & 23 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 12.3 & 0 \\ 1 & -68 & 1 \\ \vdots & \vdots & \vdots \\ 1 & x_{11} & 1 \end{bmatrix} \]

Intercept and Slope Shifts

Assume again \(x_1\) is a quantitative variable and \(x_2\) is a dummy variable.

This \[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3 x_1 x_2 + \epsilon \] can be rewritten as

\[ y = \begin{cases} \beta_0 + \beta_1x_1 + \epsilon & x_2 = 0\\ \beta_0 + \beta_1x_1 + \beta_2 + \beta_3 x_1 + \epsilon & x_2 = 1 \end{cases} \] or \[ y = \begin{cases} \beta_0 + \beta_1x_1 + \epsilon & x_2 = 0\\ (\beta_0 + \beta_2) + (\beta_1 + \beta_3) x_1 + \epsilon & x_2 = 1 \end{cases} \]

Interpretations?

Slope and Intercept Shifts

Example

\(y = \beta_0 + \beta_1x_1 + \epsilon\)

mod <- lm(y ~ x1 , data = Xdf)
plot(Xdf[,2], Xdf$y, col = Xdf$Species, xlab = "x1", ylab = "y")
abline(mod)

Example

\(y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)

mod <- lm(y ~ x1 + Species , data = Xdf)
mod

## 
## Call:
## lm(formula = y ~ x1 + Species, data = Xdf)
## 
## Coefficients:
##       (Intercept)                 x1  Speciesversicolor  
##             1.211              1.019              1.698  
##  Speciesvirginica  
##             2.277

Example

\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \epsilon\)

plot(Xdf[,2], Xdf$y, col = Xdf$Species, xlab = "x1", ylab = "y")
fitted_vals <- predict(lm(y ~ x1 + Species + x1:Species , data = Xdf))# : means interaction
setosa_indices <- Xdf$Species == "setosa"
versicolor_indices <- Xdf$Species == "versicolor"
virginia_indices <- Xdf$Species == "virginica"
points(Xdf[setosa_indices,2], fitted_vals[setosa_indices], type = "l")
points(Xdf[versicolor_indices,2], fitted_vals[versicolor_indices], type = "l")
points(Xdf[virginia_indices,2], fitted_vals[virginia_indices], type = "l")

Example

We can use our hypothesis testing chops to test \(H_0: \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0\)

lil_mod <- lm(y ~ x1 , data = Xdf)
big_mod <- lm(y ~ x1 + Species + x1:Species , data = Xdf) 
anova(lil_mod, big_mod)

## Analysis of Variance Table
## 
## Model 1: y ~ x1
## Model 2: y ~ x1 + Species + x1:Species
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    148 33.845                                  
## 2    144 18.816  4    15.029 28.755 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we didn’t reject that null, we could go on and test \(H_0: \beta_4 = \beta_5 = 0\) (just intercept shifts) or \(H_0: \beta_2 = \beta_3 = 0\) (just slope shifts)

Connections with Analysis of Variance

Analysis of variance models are examples of regression models, usually written in a slightly different way, and are useful for analyzing data from designed experiments.

Connections with Analysis of Variance

The one-way fixed effects ANOVA model is

\[ y_{ij} = \mu + \tau_{i} + \epsilon_{ij} \]

\(i=1, \ldots, k\) is the treatment/factor level index
\(j = 1, \ldots, n\) is the observation index (for a particular \(i\))
\(\mu\) is the overall mean
\(\sum_i \tau_i = 0\)
“one-way” refers to there being one factor (with \(k\) levels)
index notation is reminiscent of data in a “wide” (as opposed to the “long”) format

Connections with Analysis of Variance

Is there a difference between the groups?

The main interest is in testing the hypothesis: \(H_0 : \tau_1 = \cdots = \tau_k = 0\)

Or in other words, is there anything going on?

Connections with Analysis of Variance

Visually:

str(InsectSprays)

## 'data.frame':    72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

boxplot(count ~ spray, data = InsectSprays, 
        ylab = "y", xlab = "treatment/factor level")

Connections with Analysis of Variance

Say \(k=3\)

\[ y_{ij} = \mu + \tau_{i} + \epsilon_{ij}. \] Then this is the same as \[ y_{ij} = \beta_0 + \beta_1 \mathbf{1}(i = 2)+ \beta_2 \mathbf{1}(i = 3) + \epsilon_{ij} \]

\(\mu + \tau_1 = \beta_0\)
\(\mu + \tau_2 = \beta_0 + \beta_1\)
\(\mu + \tau_3 = \beta_0 + \beta_2\)

Connections with Analysis of Variance

The \(\sum_i \tau_i = 0\) constraint is necessary to solve this: \[ \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} \mu \\ \tau_1 \\ \tau_2 \\ \tau_3 \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ 0 \end{bmatrix} \] We get

\[ \begin{bmatrix} \mu \\ \tau_1 \\ \tau_2 \\ \tau_3 \end{bmatrix} = \begin{bmatrix} 1/3 & 1/3 & 1/3 & -1/3 \\ 2/3 & -1/3 & -1/3 & 1/3 \\ -1/3 & 2/3 & -1/3 & 1/3 \\ -1/3 & -1/3 & 2/3 & 1/3 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ 0 \end{bmatrix} \]

using `R`

The \(F_0\) statistics are the same either way you do it

summary(lm(count ~ spray, data = InsectSprays))

## 
## Call:
## lm(formula = count ~ spray, data = InsectSprays)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.333 -1.958 -0.500  1.667  9.333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.5000     1.1322  12.807  < 2e-16 ***
## sprayB        0.8333     1.6011   0.520    0.604    
## sprayC      -12.4167     1.6011  -7.755 7.27e-11 ***
## sprayD       -9.5833     1.6011  -5.985 9.82e-08 ***
## sprayE      -11.0000     1.6011  -6.870 2.75e-09 ***
## sprayF        2.1667     1.6011   1.353    0.181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.922 on 66 degrees of freedom
## Multiple R-squared:  0.7244, Adjusted R-squared:  0.7036 
## F-statistic:  34.7 on 5 and 66 DF,  p-value: < 2.2e-16

summary(aov(count ~ spray, data = InsectSprays))

##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669   533.8    34.7 <2e-16 ***
## Residuals   66   1015    15.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Module 6

STAT 6021

Categorical Variables

`data.frame`s are not `matrix`s [sic]

The conversion

Back to our example

A simpler example

A simpler example

Intercept and Slope Shifts

Slope and Intercept Shifts

Example

Example

Example

Example

Example

Connections with Analysis of Variance

Connections with Analysis of Variance

Connections with Analysis of Variance

Connections with Analysis of Variance

Connections with Analysis of Variance

Connections with Analysis of Variance

using `R`

Module 6

STAT 6021

Categorical Variables

data.frames are not matrixs [sic]

The conversion

Back to our example

A simpler example

A simpler example

Intercept and Slope Shifts

Slope and Intercept Shifts

Example

Example

Example

Example

Example

Connections with Analysis of Variance

Connections with Analysis of Variance

Connections with Analysis of Variance

Connections with Analysis of Variance

Connections with Analysis of Variance

Connections with Analysis of Variance

using R

`data.frame`s are not `matrix`s [sic]

using `R`