Taylor R. Brown, PhD
So far we’ve assumed \(y\) is normal. In particular it is continuous and takes values on \(\mathbb{R}\).
What if \(y_i \in \{0,1\}\)? Can we still use
\[ y_i = \mathbf{x}_i^\intercal \boldsymbol{\beta} + \epsilon_i? \]
So far we’ve assumed \(y\) is normal. In particular it is continuous and takes values on \(\mathbb{R}\).
What if \(y_i \in \{0,1\}\)? Can we still use
\[ y_i = \mathbf{x}_i^\intercal \boldsymbol{\beta} + \epsilon_i? \]
Problems:
Two special functions: the logit function, and the inverse logit (aka logistic sigmoid function).
The logit function expands a probability to lie on the real line.
\[ \text{logit} : [0,1] \to \mathbb{R} \]
The inverse logit function “squashes” a real number into a probability (“sigmoid” means s-shaped).
\[ \text{invlogit} : \mathbb{R} \to [0,1] \]
As the names suggest, they are inverses of each other:
\[ \text{invlogit}[\text{logit}(p)] = \frac{ 1 }{1 +(1-p)/p} = p \]
Also, note that \[ 1 - \text{invlogit}(x) = \text{invlogit}(-x) \]
So there are two ways to write our model. Each \(y_i \sim \text{Bernoulli}(\pi_i )\) where \(\pi_i = P(y_i = 1 \mid \mathbf{x}_i, \boldsymbol{\beta})\) and where
\[ \text{logit}(\pi_i) = \mathbf{x}_i^\intercal \boldsymbol{\beta} \]
or equivalently
\[ \pi_i = \text{invlogit}(\mathbf{x}_i^\intercal \boldsymbol{\beta} ) \]
First, let’s write down the likelihood:
\[ L(\mathbf{y} ; \mathbf{X}, \boldsymbol{\beta}) = \prod_{i=1}^n f(y_i ; \mathbf{x}_i, \boldsymbol{\beta}) = \prod_{i=1}^n \pi_i^{y_i}(1-\pi_i)^{1-y_i} = \prod_{i=1}^n\left( \frac{\pi_i}{1-\pi_i}\right)^{y_i} (1 - \pi_i) \]
Now the log-likelihood from \(L(\mathbf{y} ; \mathbf{X}, \boldsymbol{\beta}) = \prod_{i=1}^n\left( \frac{\pi_i}{1-\pi_i}\right)^{y_i} (1 - \pi_i)\)
\[\begin{align*} \log L(\mathbf{y} ; \mathbf{X}, \boldsymbol{\beta}) &= \sum_{i=1}^n y_i \log \left( \frac{\pi_i}{1-\pi_i}\right) + \log(1-\pi_i) \\ &= \sum_{i=1}^n y_i \text{logit} \left( \pi_i \right) + \log(1-\pi_i) \\ &= \sum_{i=1}^n y_i \mathbf{x}_i^\intercal \boldsymbol{\beta} + \log( 1 - \text{invlogit}(\mathbf{x}_i^\intercal \boldsymbol{\beta} )) \\ &= \sum_{i=1}^n y_i \mathbf{x}_i^\intercal \boldsymbol{\beta} + \log( \text{invlogit}( - \mathbf{x}_i^\intercal \boldsymbol{\beta} )) \\ &= \sum_{i=1}^n y_i \mathbf{x}_i^\intercal \boldsymbol{\beta} - \log( 1 + \exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}] )) \end{align*}\]
Now take the derivatives of \(\ell(\boldsymbol{\beta}) = \log L(\mathbf{y} ; \mathbf{X}, \boldsymbol{\beta}) = \sum_{i=1}^n y_i \mathbf{x}_i^\intercal \boldsymbol{\beta} - \log( 1 + \exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}])\)
\[ \nabla \ell(\boldsymbol{\beta}) = \sum_{i=1}^n y_i \mathbf{x}_i - \frac{\exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}]}{( 1 + \exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}] ) }\mathbf{x}_i \overset{\text{set}}{=} \mathbf{0} \]
No way to solve for \(\boldsymbol{\beta}\)
Need an iterative algorithm (e.g. gradient descent/ascent, Newton’s method)
solution guaranteed (\(\ell\) is concave, \(-\ell\) is convex)
R
mod <- glm(chocolate ~ sugar_amount, family = binomial(link = "logit"), data = myDf)
plot(myDf$sugar_amount, myDf$chocolate)
lines(myDf$sugar_amount, predict(mod, type = "response"), type = "l", col = "red")
predict(mod, type = "response")
gives \(\hat{y}_i\)
R
left <- -13
right <- 13
plot(myDf$sugar_amount, myDf$chocolate, xlim = c(left, right))
grid <- seq(left, right, .001)
preds <- predict(mod, list(sugar_amount = grid), type = "response")
lines(grid, preds, type = "l", col = "red")
What do you think \(\hat{\beta}_1\) is?
R
##
## Call:
## glm(formula = chocolate ~ sugar_amount, family = binomial(link = "logit"),
## data = myDf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.299 -1.066 -0.914 1.254 1.519
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2445 0.2208 -1.107 0.268
## sugar_amount 0.1585 0.1338 1.185 0.236
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 116.41 on 84 degrees of freedom
## Residual deviance: 114.96 on 83 degrees of freedom
## AIC: 118.96
##
## Number of Fisher Scoring iterations: 4
Say you have an estimate \(\hat{\beta}_3 = -1.6\). What does that even mean?
Ignoring row numbers, recall that \[ \log\left( \frac{\pi}{1-\pi}\right) = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} \tag{1} \]
When only the third predictor changes… \[ \log\left( \frac{ \tilde{\pi}}{1-\tilde{\pi}} \right) = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 \tilde{x}_{3} \tag{2} \]
Subtracting (1) from (2) \[ \log\left(\frac{ \tilde{\pi}}{1-\tilde{\pi}} \bigg/ \frac{\pi}{1-\pi} \right) = \beta_3 (\tilde{x}_{3} - x_{3}) \]
\[ \log\left(\frac{ \tilde{\pi}}{1-\tilde{\pi}} \bigg/ \frac{\pi}{1-\pi} \right) = \beta_3 (\tilde{x}_{3} - x_{3}) \]
For a one unit change in \(x_3\), \(\beta_3\) tells us the change in the log-odds, or the log odds ratio.
We now learn about how to test parameters.
All of these methods will be asymptotic, which means they’re only valid when you have “enough” rows/observations.
However, they can all be applied to other types of generalized linear models (GLMs) (e.g. Poisson regression, Gamma regression, multinomial regression, etc.)
Here’s a slide from my old STAT 3120 class: \[ -2 \log \left( \frac{ \sup_{\theta \in \Theta_0} f(x_1, \ldots, x_n ; \theta)}{ \sup_{\theta \in \Theta} f(x_1, \ldots, x_n ; \theta)}\right) \overset{\text{approx.}}{\sim} \chi^2_{\nu} \]
where \(\nu\) is the number of parameters involved in the null hypothesis
The numerator is the likelihood with the maximum likelihood estimates plugged in for the smaller model.
The denominator is the likelihood with the maximum likelihood estimates plugged in for the larger model.
The book writes it this way:
Under \(H_0: \text{RM is correct}\), the test statistic has the following sampling distribution:
\[ LR = 2 \log \left(\frac{L(FM)}{L(RM) }\right) \overset{\text{approx.}}{\sim} \chi^2_{\nu} \] where \(\nu\) is the number of extra parameters in the full model. This fraction/test-statistic is called the deviance.
This gives us the likelihood-ratio tests, which assumes that the sample size is large.
\[ LR = 2 \log\left( \frac{L(FM)}{L(RM) } \right)\overset{\text{approx.}}{\sim} \chi^2_{\nu} \]
We reject the null that the smaller model is true when the test statistic is “large”
\[ LR > \chi^2_{\nu, \alpha} \]
or equivalently when the “p-value is low” \[ P(\chi^2_{\nu} \ge LR) < \alpha. \]
The null model (which is also the null hypothesis in this case because it’s smaller):
\[ \text{logit}(\pi_i) = \beta_0 \]
versus the alternative model
\[ \text{logit}(\pi_i) = \beta_0 + \beta_1 x_i \]
reducedMod <- glm(chocolate ~ 1, data = myDf, family = binomial(link = "logit"))
fullMod <- glm(chocolate ~ sugar_amount, data = myDf, family = binomial(link = "logit"))
anova(reducedMod, fullMod) # remember: models must be "nested"
## Analysis of Deviance Table
##
## Model 1: chocolate ~ 1
## Model 2: chocolate ~ sugar_amount
## Resid. Df Resid. Dev Df Deviance
## 1 84 116.41
## 2 83 114.96 1 1.4433
## [1] 0.2296061
A goodness-of-fit test uses the same methodology to test your chosen model:
\[ \text{logit}(\pi_i) = \beta_0 + \beta_1x_{i1} + \cdots + \beta_k x_{ik}, \] which is assumed true under the null hypothesis, against a saturated model:
\[ \text{logit}(\pi_i) = \beta_i \]
This is subtle…each row gets its very own parameter.
Each row gets its own coefficent, so \(\boldsymbol{\beta} \in \mathbb{R}^n\) and the design matrix is \[ \mathbf{X}\boldsymbol{\beta} = \begin{bmatrix} 1 & 0 & \cdots & 0\\ 0 & 1 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1\\ \end{bmatrix} \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_n \end{bmatrix} \]
And then we solve the same equation we mentioned before: \[ \nabla \ell(\boldsymbol{\beta}) = \sum_{i=1}^n y_i \mathbf{x}_i - \frac{\exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}]}{( 1 + \exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}] ) }\mathbf{x}_i \overset{\text{set}}{=} \mathbf{0} \]
Note \(\mathbf{x}_i^\intercal \boldsymbol{\beta} = \beta_i\). We must solve the following
\[ \nabla \ell(\boldsymbol{\beta}) = \sum_{i=1}^n y_i \mathbf{x}_i - \frac{\exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}]}{( 1 + \exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}] ) }\mathbf{x}_i \overset{\text{set}}{=} \mathbf{0}. \] looking at \(\beta_j\): \[ \sum_{i=1}^n y_i x_{i,j} = \sum_{i=1}^n\frac{\exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}]}{( 1 + \exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}] ) }x_{i,j} \]
which can only be true if for each observation \(i\): \[ y_i = \frac{\exp[ \beta_i]}{( 1 + \exp[ \beta_i ] ) } = \text{invlogit}[\beta_i] \]
or in other words each \(\beta_i\) is plus or minus infinity (if \(y_i\) is \(1\) or \(0\), respectively).
A goodness-of-fit test uses the following test statistic:
\[ 2 \log\left( \frac{L(\text{saturated model})}{L( \text{model} ) } \right) \] This has a special name: the residual deviance. This will be discussed in a few slides!
Instead of comparing your single model to a small or large model, hypothesis testing can be done for two reasonable models. Say
\[ \mathbf{X} \boldsymbol{\beta} = \underbrace{\mathbf{X}_1}_{p-r \text{ cols}} \boldsymbol{\beta}_1 + \underbrace{\mathbf{X}_2}_{r \text{ cols}} \boldsymbol{\beta}_2 \]
and our hypotheses are
\[ H_0: \boldsymbol{\beta}_2 = \mathbf{0} \]
\[ H_a: \boldsymbol{\beta}_2 \neq \mathbf{0} \]
Again, under the null \[ LR = 2 \log\left( \frac{L(FM)}{L(RM) } \right)\overset{\text{approx.}}{\sim} \chi^2_{\nu} \] and here \(\nu = r\)
myDf$sugar_amount_squared <- myDf$sugar_amount^2
reducedMod <- glm(chocolate ~ 1, data = myDf, family = binomial(link = "logit"))
fullMod <- glm(chocolate ~ sugar_amount + sugar_amount_squared, data = myDf, family = binomial(link = "logit"))
(test_results <- anova(reducedMod, fullMod)) # parentheses around assignment add printing
## Analysis of Deviance Table
##
## Model 1: chocolate ~ 1
## Model 2: chocolate ~ sugar_amount + sugar_amount_squared
## Resid. Df Resid. Dev Df Deviance
## 1 84 116.41
## 2 82 112.10 2 4.3039
## [1] 0.1162556
We also have tests on individual coefficients. The null and alternative hypotheses would be of the form
\[ H_0: \beta_j = 0 \text{ versus } H_a: \beta_j \neq 0 \]
These will be called Wald Tests, and they are based on a theorem you prove in a mathematical statistics class that shows the MLEs are asymptotically unbiased, have the lowest possible variance, and are normally distributed!
Asymptotic unbiasedness is the easier part: \(E\left[\hat{\boldsymbol{\beta}} \right] \approx \boldsymbol{\beta}\). The covariance matrix has to do with the matrix of second derivatives
\[ V\left[\hat{\boldsymbol{\beta}} \right] \approx [-\mathbf{G}]^{-1} \]
\(\ell(\hat{\boldsymbol{\beta}}) = \log L( \hat{\boldsymbol{\beta}} )\) is the log-likelihood with the MLE plugged in
\(\mathbf{G}_{ij} = \frac{\partial^2 \log L(\hat{\boldsymbol{\beta}})}{\partial \beta_i \partial \beta_j}\) is the Hessian matrix
\(\mathbf{G}\) is negative definite because the log-likelihood is convex
that means \(-\mathbf{G}\) is positive definite (akin to Fisher Information Matrix)
\([-\mathbf{G}]^{-1}\) is akin to the Cramér–Rao bound
Why this is so cool:
the Hessian matrix is output from many optimization algorithms
this approach is justified for nearly all maximum likelihood estimation procedures, which is the most common estimation technique
data is usually large, so not very restrictive
no model-specific derivations (always normally distributed)
Why does the book write \(-\mathbf{G}^{-1} = -[\nabla^2 \ell( \hat{\boldsymbol{\beta}} )] ^{-1} = (\mathbf{X}^\intercal \mathbf{V} \mathbf{X})^{-1}\)?
Why does the book write \(-\mathbf{G}^{-1} = -[\nabla^2 \ell( \hat{\boldsymbol{\beta}} )] ^{-1} = (\mathbf{X}^\intercal \mathbf{V} \mathbf{X})^{-1}\)?
Recall
\[ \nabla \ell(\boldsymbol{\beta}) = \sum_{i=1}^n y_i \mathbf{x}_i - \frac{\exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}]}{( 1 + \exp[ \mathbf{x}_i^\intercal \boldsymbol{\beta}] )) }\mathbf{x}_i \]
So \[\begin{align*} \nabla^2 \ell( \hat{\boldsymbol{\beta}} ) &= \sum_{i=1}^n \frac{\exp[ -\mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}} ] }{( 1 + \exp[- \mathbf{x}_i^\intercal \hat{\boldsymbol{\beta}} ] )^2 }\mathbf{x}_i\mathbf{x}_i^\intercal \\ &= \mathbf{X}^\intercal \begin{bmatrix} \frac{\exp[ -\mathbf{x}_1^\intercal \hat{\boldsymbol{\beta}} ] }{( 1 + \exp[- \mathbf{x}_1^\intercal \hat{\boldsymbol{\beta}} ] )^2} & \cdots & 0\\ \vdots & \ddots & \vdots \\ 0 & \cdots & \frac{\exp[ -\mathbf{x}_n^\intercal \hat{\boldsymbol{\beta}} ] }{( 1 + \exp[- \mathbf{x}_n^\intercal \hat{\boldsymbol{\beta}} ] )^2} \end{bmatrix} \mathbf{X} \\ \end{align*}\]
R
Standard errors come from taking the square root of the diagonal element of the covariance matrix: \[ \hat{\beta}_j \pm z_{\alpha/2} \underbrace{\text{SE}(\hat{\beta}_j)}_{ \text{standard error}} \]
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -0.5159848 0.55175058
## sugar_amount -0.1225894 0.47543798
## sugar_amount_squared -0.2543933 0.01497578
R
summary
gives us the Wald hypothesis tests for individual parameters
##
## Call:
## glm(formula = chocolate ~ sugar_amount + sugar_amount_squared,
## family = binomial(link = "logit"), data = myDf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.211 -1.120 -0.753 1.182 1.931
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.01469 0.27094 0.054 0.957
## sugar_amount 0.16632 0.14967 1.111 0.266
## sugar_amount_squared -0.10298 0.06721 -1.532 0.125
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 116.41 on 84 degrees of freedom
## Residual deviance: 112.10 on 82 degrees of freedom
## AIC: 118.1
##
## Number of Fisher Scoring iterations: 4
Why was there a null deviance, and a residual deviance?
This is related to what we had with regular linear regression: \[ \sum_{i=1}^n(y_i - \bar{y})^2 = \sum_{i=1}^n(y_i - \hat{y}_i)^2 + \sum_{i=1}^n(\hat{y}_i - \bar{y})^2? \]
Null deviance is kind of like \(SS_T\) \[ 2 \log\left(\frac{L(\text{saturated model})}{L(\text{null model}) }\right) \]
Residual deviance is kind of like \(SS_{Res}\) \[ 2 \log\left( \frac{L(\text{saturated model})}{L(\text{your model}) }\right) \]
Null deviance : \(2 \log\left(\frac{L(\text{saturated model})}{L(\text{null model}) }\right)\)
Residual deviance: \(2 \log\left( \frac{L(\text{saturated model})}{L(\text{your model}) }\right)\)
So \[ 2 \log\left(\frac{L(\text{saturated model})}{L(\text{null model})}\right) = 2 \log\left(\frac{L(\text{saturated model})}{L(\text{your model})}\right) + 2 \log \left( \frac{L(\text{your model})}{L(\text{null model})}\right) \]
kind of like \(SS_T = SS_R + SS_{Res}\)!
Assume one predictor, no intercept \[ \pi = \text{invlogit}(x \beta) \]
What do we think \(\beta\) is?
\[ \pi = \text{invlogit}(x \beta) \]
Recall
So for this linearly-separable data set:
Or in other words \(\beta = \infty\)
R
In R
:
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call: glm(formula = sepYDat ~ sepXDat, family = binomial(link = "logit"))
##
## Coefficients:
## (Intercept) sepXDat
## -14.95 2989.09
##
## Degrees of Freedom: 200 Total (i.e. Null); 199 Residual
## Null Deviance: 278.6
## Residual Deviance: 1.292e-06 AIC: 4
So R
isn’t making a mistake, and the math isn’t wrong, but in practice, this probably means you are overfitting