Taylor R. Brown, PhD
Classification doesn’t have to be for binary variables. E.g. blood type, genres of a movie, personality types of a human, topics of a tweet, etc.
It’s relatively easy to extend logistic regression to classify \(y_i\) that take on more than two values.
Alternatively, you might consider a one-hot representation of each \(y_i\). For example, instead of saying \(y_i = 2\), you could say \[ y_i = \underbrace{ \begin{bmatrix} 0 & 0 & 1 & \cdots & 0 \end{bmatrix}^\intercal}_{m+1 \text{ elements}} \]
For this situation, we need to use the softmax function, which is analagous to a logistic sigmoid function. It maps a vector to another vector: \[ \text{softmax} : \mathbb{R}^{m+1} \to [0,1]^{m+1} \]
It is defined as
\[ \text{softmax}(\mathbf{z}) = \begin{bmatrix} \text{softmax}(\mathbf{z})_0 \\ \text{softmax}(\mathbf{z})_1 \\ \vdots \\ \text{softmax}(\mathbf{z})_m \end{bmatrix} = \begin{bmatrix} \frac{\exp(z_0)}{\sum_{l=0}^{m}\exp(z_l)} \\ \frac{\exp(z_1)}{\sum_{l=0}^{m}\exp(z_l)}\\ \vdots \\ \frac{\exp(z_m)}{\sum_{l=0}^{m}\exp(z_l)} \end{bmatrix} \]
Notice that this vector sums to \(1\), so it gives us probabilities for \(m+1\) categories!
In multinomial logistic regression, you have each \(\mathbf{x}_i \in \mathbb{R}^p\), but now you have more \(p\)-dimensional coefficient vectors because you have more categories.
Assume you have \(\boldsymbol{\beta}^{(0)}, \boldsymbol{\beta}^{(1)}, \ldots, \boldsymbol{\beta}^{(m)}\). For each row \(i\) you could take the softmax of \(\begin{bmatrix}\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(0)} & \mathbf{x}_i^\intercal \boldsymbol{\beta}^{(1)} & \cdots & \mathbf{x}_i^\intercal \boldsymbol{\beta}^{(m)} \end{bmatrix}^\intercal\) yielding
\[ \begin{bmatrix} P(y_i = 0) \\ P(y_i = 1) \\ \vdots \\ P(y_i = m) \end{bmatrix} = \begin{bmatrix} \frac{\exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(0)}] }{ \sum_{k=0}^{m} \exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(k)}] } \\ \frac{\exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(1)}] }{ \sum_{k=0}^{m} \exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(k)}] } \\ \vdots \\ \frac{\exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(m)}] }{ \sum_{k=0}^{m} \exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(k)}] } \end{bmatrix}. \]
However, the likelihood is not identifiable because multiple parameter collections will give the same likelihood. For example, shifting all the coefficient vectors by the same vector \(\mathbf{c}\) will produce the same likelihood. This can be seen if you multiply the numerator and denominator of each element of the vector by a constant \(\exp[-\mathbf{x}_i^\intercal \mathbf{c}]\), nothing changes:
\[ \begin{bmatrix} \frac{\exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(0)}] }{ \sum_{k=0}^{m} \exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(k)}] } \\ \frac{\exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(1)}] }{ \sum_{k=0}^{m} \exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(k)}] } \\ \vdots \\ \frac{\exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(m)}] }{ \sum_{k=0}^{m} \exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(k)}] } \end{bmatrix} = \begin{bmatrix} \frac{\exp[\mathbf{x}_i^\intercal (\boldsymbol{\beta}^{(0)}-\mathbf{c})] }{ \sum_{k=0}^{m} \exp[\mathbf{x}_i^\intercal (\boldsymbol{\beta}^{(k)}-\mathbf{c})] } \\ \frac{\exp[\mathbf{x}_i^\intercal (\boldsymbol{\beta}^{(1)}-\mathbf{c})] }{ \sum_{k=0}^{m} \exp[\mathbf{x}_i^\intercal (\boldsymbol{\beta}^{(k)}-\mathbf{c})] } \\ \vdots \\ \frac{\exp[\mathbf{x}_i^\intercal (\boldsymbol{\beta}^{(m)} - \mathbf{c})] }{ \sum_{k=0}^{m} \exp[\mathbf{x}_i^\intercal (\boldsymbol{\beta}^{(k)} -\mathbf{c}) ] } \end{bmatrix}. \]
The way to fix this is to constrain the parameters. Fixing one of them will lead to identifiability, because shifting all of them will no longer be permitted.
There are two common choices:
So, here is the multinomial logistic regression model we will use:
\[ \begin{bmatrix} P(y_i = 0) \\ P(y_i = 1) \\ \vdots \\ P(y_i = m) \end{bmatrix} = \begin{bmatrix} \frac{1}{1 + \sum_{k=1}^{m} \exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(k)}] } \\ \frac{\exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(1)}] }{ 1 + \sum_{k=1}^{m} \exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(k)}] } \\ \vdots \\ \frac{\exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(m)}] }{ 1 + \sum_{k=1}^{m} \exp[\mathbf{x}_i^\intercal \boldsymbol{\beta}^{(k)}] } \end{bmatrix}. \] (note that we are summing from \(k=1\) to \(m\) now)
Using category \(0\) as a baseline, we can get logits (log-odds) using some simple algebra on the previous equation:
\[ \begin{bmatrix} \log\left( \frac{ P(y_i = 1)}{P(y_i = 0) }\right) \\ \vdots \\ \log\left( \frac{P(y_i = m)}{P(y_i = 0) }\right) \end{bmatrix} = \begin{bmatrix} \mathbf{x}_i^\intercal \boldsymbol{\beta}^{(1)} \\ \vdots \\ \mathbf{x}_i^\intercal \boldsymbol{\beta}^{(m)} \end{bmatrix}. \]
R
First let’s get some data and see what format it’s in
library(foreign) # for read.dta() to read in STATA files
library(nnet) # for multinom()
ml <- read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
head(ml, n = 3)
## id female ses schtyp prog read write math science socst
## 1 45 female low public vocation 34 35 41 29 26
## 2 108 male middle public general 34 33 41 36 36
## 3 15 male high public vocation 39 39 44 26 42
## honors awards cid
## 1 not enrolled 0 1
## 2 not enrolled 0 1
## 3 not enrolled 0 1
## prog
## general academic vocation
## 45 105 50
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.00 45.75 54.00 52.77 60.00 67.00
R
ml$prog2 <- relevel(ml$prog, ref = "academic") #specify academic as baseline/0
myModel <- multinom(prog2 ~ write, data = ml)
## # weights: 9 (4 variable)
## initial value 219.722458
## final value 185.510837
## converged
## Call:
## multinom(formula = prog2 ~ write, data = ml)
##
## Coefficients:
## (Intercept) write
## general 2.712485 -0.06600782
## vocation 5.359000 -0.11780909
##
## Std. Errors:
## (Intercept) write
## general 1.132805 0.02100133
## vocation 1.115266 0.02161889
##
## Residual Deviance: 371.0217
## AIC: 379.0217
If our response \(y_i \in \{0, 1, 2, \ldots, m\}\) is ordinal (think “order”) then the order matters for the categories (e.g. low, medium, high blood pressure).
We can use ordinal logistic regression for this setup.
In the case of ordinal logistic regression, things change a little. Assume \(\mathbf{x}_i\) doesn’t have an intercept term. Each group will get it’s own intercept:
\[ \begin{bmatrix} P(y_i \le 0) \\ P(y_i \le 1) \\ \vdots \\ P(y_i \le m-1) \end{bmatrix} = \text{invlogit}\left( \begin{bmatrix} \alpha_0 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \\ \alpha_1 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \\ \vdots \\ \alpha_{m-1} + \mathbf{x}_i^\intercal \boldsymbol{\beta} \end{bmatrix} \right) = \begin{bmatrix} \frac{\exp\left[ \alpha_0 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right]}{ 1 + \exp\left[ \alpha_0 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] } \\ \frac{\exp\left[ \alpha_1 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right]}{ 1 + \exp\left[ \alpha_1 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] } \\ \vdots \\ \frac{\exp\left[ \alpha_{m-1} + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right]}{ 1 + \exp\left[ \alpha_{m-1} + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] } \end{bmatrix} \]
A couple things to notice about \[ \begin{bmatrix} P(y_i \le 0) \\ P(y_i \le 1) \\ \vdots \\ P(y_i \le m-1) \end{bmatrix} = \text{invlogit}\left( \begin{bmatrix} \alpha_0 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \\ \alpha_1 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \\ \vdots \\ \alpha_{m-1} + \mathbf{x}_i^\intercal \boldsymbol{\beta} \end{bmatrix} \right) = \begin{bmatrix} \frac{\exp\left[ \alpha_0 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right]}{ 1 + \exp\left[ \alpha_0 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] } \\ \frac{\exp\left[ \alpha_1 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right]}{ 1 + \exp\left[ \alpha_1 + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] } \\ \vdots \\ \frac{\exp\left[ \alpha_{m-1} + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right]}{ 1 + \exp\left[ \alpha_{m-1} + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] } \end{bmatrix} \]
Also notice that \[ \frac{\exp\left[ \alpha_k + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right]}{ 1 + \exp\left[ \alpha_k + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] } = P(y_i \le k) \le P(y_i \le k+1) = \frac{\exp\left[ \alpha_{k+1} + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right]}{ 1 + \exp\left[ \alpha_{k+1} + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] } \] so \[ \alpha_k + \mathbf{x}_i^\intercal \boldsymbol{\beta} \le \alpha_{k+1} + \mathbf{x}_i^\intercal \boldsymbol{\beta} \] so \[ \alpha_k \le \alpha_{k+1}. \]
and that is why we restrict \[ \alpha_0 < \alpha_1 < \cdots < a_{m-1}. \]
Also notice that \(P(y_i \le k) = \frac{\exp\left[ \alpha_k + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right]}{ 1 + \exp\left[ \alpha_k + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] }\) implies \[ 1 - P(y_i \le k) = \frac{1}{ 1 + \exp\left[ \alpha_k + \mathbf{x}_i^\intercal \boldsymbol{\beta} \right] } \] so \[ \log \left( \frac{P(y_i \le k)}{1 - P(y_i \le k)} \right) = \alpha_k + \mathbf{x}_i^\intercal \boldsymbol{\beta}. \]
R
First let’s get some data
carsdata <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
header=F, # open the url in a web browser to see why we need this
stringsAsFactors=F) # import string variables as characters not factors.
colnames(carsdata) <- c("buying", "maint", "doors", "persons", "lug_boot", "safety", "class")
head(carsdata)
## buying maint doors persons lug_boot safety class
## 1 vhigh vhigh 2 2 small low unacc
## 2 vhigh vhigh 2 2 small med unacc
## 3 vhigh vhigh 2 2 small high unacc
## 4 vhigh vhigh 2 2 med low unacc
## 5 vhigh vhigh 2 2 med med unacc
## 6 vhigh vhigh 2 2 med high unacc
R
a little data cleaning
##
## acc good unacc vgood
## 384 69 1210 65
carsdata$class <- factor(carsdata$class, levels=c("unacc", "acc", "good", "vgood"), ordered=TRUE) # order it better!
table(carsdata$class) # our ordinal y variable
##
## unacc acc good vgood
## 1210 384 69 65
R
## lug_boot
## big med small
## 576 576 576
##
## Re-fitting to get Hessian
## Call:
## polr(formula = class ~ lug_boot, data = carsdata)
##
## Coefficients:
## Value Std. Error t value
## lug_bootmed -0.2078 0.1227 -1.693
## lug_bootsmall -0.7517 0.1316 -5.713
##
## Intercepts:
## Value Std. Error t value
## unacc|acc 0.5449 0.0861 6.3287
## acc|good 2.1896 0.1113 19.6742
## good|vgood 2.9607 0.1421 20.8424
##
## Residual Deviance: 2853.20
## AIC: 2863.20
R
It looks like polr()
gives you the negative of the coefficients:
It’s worth mentioning now that we can expand the original logistic regression model in another way.
Originally, we assumed \(y_i \sim \text{Binomial}(1, \pi_i)\). But it’s also easy to work in situations where we assume each \(y_i \sim \text{Binomial}(n_i, \pi_i)\). Most of the time \(n_i = N\) for all \(i\).
This isn’t the same as multiple categories. This is just modeling an observation that is a count. For example, if we wanted to model the number of birdies for each golfer. This follows a \(\text{Binomial}(18, \pi_i)\).
This changes the likelihood from \(L(\mathbf{y} ; \mathbf{X}, \boldsymbol{\beta}) = \prod_{i=1}^n \pi_i^{y_i}(1-\pi_i)^{1-y_i}\) to
\[ L(\mathbf{y} ; \mathbf{X}, \boldsymbol{\beta}) = \prod_{i=1}^n {n_i \choose y_i} \pi_i^{y_i}(1-\pi_i)^{n_i-y_i} \]
Now we’ll get a little more general. We will discuss generalized linear models.
We’ll need to keep track of a bunch of greek letters, and they’ll be relatively unfamiliar, but once you get past that it’s not that bad…
Every type of regression model we will learn and many more are special case of GLMs. The normal regression model, logistic regression, Poisson regression, Exponential regression, Gamma regression are some examples.
The reason they can all be tied together is because the response’s distribution is a particular example of an exponential family distribution. An exponential family distribution has the form \[ f(y_i \mid \theta_i, \phi) = \exp\left\{ [y_i \theta_i - b(\theta_i)]/a(\phi) + h(y_i, \phi) \right\} \] where
Bernoulli distribution for logistic regression \[ f(y_i \mid \theta_i, \phi) = \exp\left\{ [y_i \theta_i - b(\theta_i)]/a(\phi) + h(y_i, \phi) \right\} \]
\[ \pi_i^{y_i}(1-\pi_i)^{1-y_i} = \exp\left[\log\left( \pi_i^{y_i}(1-\pi_i)^{1-y_i}\right) \right] = \exp\left[y_i \underbrace{\log \left(\frac{\pi_i}{1-\pi_i} \right)}_{\theta_i} + \underbrace{\log (1 - \pi_i)}_{-b(\theta_i) } \right] \] and
\[ f(y_i \mid \theta_i, \phi) = \exp\left\{ [y_i \theta_i - b(\theta_i)]/a(\phi) + h(y_i, \phi) \right\} \]
Poisson regression
\[ \frac{e^{-\lambda} \lambda^{y_i} }{y_i!} = \exp\left[ y_i \underbrace{\log \lambda}_{\theta_i } - \underbrace{\lambda}_{b(\theta_i)} - \underbrace{\log(y_i!)}_{-h(y_i, \phi)} \right] \] again \(a(\phi) = 1\).
\[ f(y_i \mid \theta_i, \phi) = \exp\left\{ [y_i \theta_i - b(\theta_i)]/a(\phi) + h(y_i, \phi) \right\} \]
A normal distribution
\[\begin{align*} (2\pi \sigma^2)^{-1/2}\exp\left[-\frac{1}{2\sigma^2}(y_i - \mu)^2 \right] &= \exp\left[-\frac{1}{2\sigma^2}(y_i^2 - 2y_i \mu + \mu^2) - \frac{1}{2}\log(2\pi \sigma^2)\right] \\ &= \exp\left[\frac{y_i \mu}{\sigma^2} -\frac{\mu^2}{2\sigma^2} -\frac{y_i^2}{2\sigma^2} - \frac{1}{2}\log(2\pi \sigma^2)\right] \\ &= \exp\left[\frac{y_i \mu - \mu^2/2}{\sigma^2} - \underbrace{\left( \frac{y_i^2}{2\sigma^2} + \frac{1}{2}\log(2\pi \sigma^2) \right) }_{-h(y_i, \phi)} \right] \end{align*}\]
\[ f(y_i \mid \theta_i, \phi) = \exp\left\{ [y_i \theta_i - b(\theta_i)]/a(\phi) + h(y_i, \phi) \right\} \]
Gamma regression (Exponential is a special case)
\[\begin{align*} &\frac{1}{\Gamma(\alpha)\beta^{\alpha}}y_i^{\alpha-1}\exp\left[-\frac{y_i}{\beta} \right] \\ &= \exp\left[-\frac{y_i}{\beta} + (\alpha - 1)\log y_i - \log \Gamma(\alpha) - \alpha \log \beta \right] \\ &= \exp\left[ \frac{y_i\left( -1/\alpha \beta \right)}{1/\alpha} - \alpha \log \beta + (\alpha - 1)\log y_i - \log \Gamma(\alpha) \right] \\ &= \exp\left[ \frac{y_i\left( -1/\alpha \beta \right)}{1/\alpha} + \alpha \log\left(\frac{\alpha}{\alpha\beta} \right) + (\alpha - 1)\log y_i - \log \Gamma(\alpha) \right] \\ &= \exp\left[ \frac{y_i\left( -1/\alpha \beta \right) + \log\left( 1/(\alpha\beta) \right) }{1/\alpha} + \underbrace{\alpha\log(\alpha) + (\alpha - 1)\log y_i - \log \Gamma(\alpha)}_{h(y_i, \theta)} \right] \end{align*}\]
One set of cool properties is that you can find the means and variances by taking derivatives (easier than performing different integrations for each random variable!)
For logistic regression, we found
\[ \pi_i^{y_i}(1-\pi_i)^{1-y_i} = \exp\left[\log\left( \pi_i^{y_i}(1-\pi_i)^{1-y_i}\right) \right] = \exp\left[y_i \underbrace{\log \left(\frac{\pi_i}{1-\pi_i} \right)}_{\theta_i} + \underbrace{\log (1 - \pi_i)}_{-b(\theta_i) } \right] \] and \(a(\phi) = 1\). Checking the moments:
For the Poisson regression, we found
\[ \frac{e^{-\lambda} \lambda^{y_i} }{y_i!} = \exp\left[ y_i \underbrace{\log \lambda}_{\theta } - \underbrace{\lambda}_{b(\theta)} - \underbrace{\log(y_i!)}_{-h(y_i, \phi)} \right] \] so
For the regular normal regression, we found \(\theta_i = \mu\), \(b(\theta) = \mu^2/2\), \(\alpha(\phi) = \sigma^2\), and
\[ \exp\left[\frac{y_i \mu - \mu^2/2}{\sigma^2} - \underbrace{\left( \frac{y_i^2}{2\sigma^2} + \frac{1}{2}\log(2\pi \sigma^2) \right) }_{-h(y_i, \phi)} \right], \]
so
For the Gamma regression example, we found \(\theta_i = -1/(\alpha\beta)\) and \(a(\phi) = 1/\alpha\) \[ \exp\left[ \frac{y_i\left( -1/\alpha \beta \right) + \log\left( 1/(\alpha\beta) \right) }{1/\alpha} + \underbrace{\alpha\log(\alpha) + (\alpha - 1)\log y_i - \log \Gamma(\alpha)}_{h(y_i, \theta)} \right] \]
so
Another cool thing about recognizing exponential families is that they tell you which link functions you should use to help squash your linear predictor \(\mathbf{x}_i^\intercal \boldsymbol{\beta}\) or stretch your mean \(E[y_i]\).
In the logistic regression we used \(\text{invlogit}\) to squash, and \(\text{logit}\) to stretch.
We call any stretchy function a link function. The special link function suggested by the exponential family’s \(\theta_i\) parameter is called the canonical link function. Some other link functions are given on page 452.
NB1: “logistic link” should be replaced with “logit link”
NB2: The canoninical link for the gamma rv is missing a negative sign, but it’s more common to use this version (which is called the “inverse link”)
More cool things about GLMs:
\(\boldsymbol{\beta}\) is asymptotically normal and unbiased. The hard part is to find the covariance matrix. Same formula as before with logistic regression
\[ \text{V}\left[\boldsymbol{\beta} \right] = [-\mathbf{G}]^{-1} \] where \(\mathbf{G}\) is the hessian, or matrix of second derivatives.
vcov()
or a hessian=TRUE
argument)First, the log-likelihood \[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \log f(y_i \mid \theta_i, \phi) = \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + h(y_i, \phi) \]
then the first derivative (with respect to beta!). Using the chain rule, we take with respect to \(\theta\) first \[ \sum_{i=1}^n \frac{d}{d\theta}\frac{y_i \theta_i - b(\theta_i)}{a(\phi)} = \sum_{i=1}^n \frac{y_i - g^{-1}(\mathbf{x}_i^\intercal \boldsymbol{\beta})}{a(\phi)} \]
and then with respect to the betas \[ \nabla \ell(\boldsymbol{\beta}) = \frac{1}{a(\phi)}\sum_{i=1}^n -\mathbf{x}_i [g^{-1}]^{'}(\mathbf{x}_i^\intercal \boldsymbol{\beta} ) \]
\[ \nabla \ell(\boldsymbol{\beta}) = \frac{1}{a(\phi)}\sum_{i=1}^n -\mathbf{x}_i [g^{-1}]^{'}(\mathbf{x}_i^\intercal \boldsymbol{\beta} ) \]
and then another derivative
\[ \nabla^2 \ell(\boldsymbol{\beta}) = \frac{1}{a(\phi)}\sum_{i=1}^n -\mathbf{x}_i\mathbf{x}_i^\intercal [g^{-1}]^{''}(\mathbf{x}_i^\intercal \boldsymbol{\beta} ) = - \frac{1}{a(\phi)} (\mathbf{X}^\intercal \mathbf{V} \mathbf{X} ) \]
so the variance is the negative of the inverse of this matrix above (as seen on page 455): \[ \text{V}(\hat{\boldsymbol{\beta}}) = a(\phi) (\mathbf{X}^\intercal \mathbf{V} \mathbf{X} )^{-1}. \]
R
All of the GLMs we’ve discussed can be estimated with the R
function glm()
. Just change the family=
argument.
The syntax for how to fill in that argument can be found here.
Predictions are available after you have your parameter estimates. Just plug in a predictor vector \(\mathbf{x}_0\)
\[ \hat{y}_0 = g^{-1}(\mathbf{x}_0^\intercal \hat{\boldsymbol{\beta}} ). \]
You can use the covariance matrix from before: \[ \text{V}(g(\hat{y}_0)) = \text{V}(\mathbf{x}_0^\intercal\hat{\boldsymbol{\beta}}) = a(\phi) \mathbf{x}_0^\intercal(\mathbf{X}^\intercal \mathbf{V} \mathbf{X} )^{-1}\mathbf{x}_0 \]
\[ \text{V}(g(\hat{y}_0)) = \text{V}(\mathbf{x}_0^\intercal\hat{\boldsymbol{\beta}}) = a(\phi) \mathbf{x}_0^\intercal(\mathbf{X}^\intercal \mathbf{V} \mathbf{X} )^{-1}\mathbf{x}_0 \]
Because \[ -z_{\alpha/2} \le \frac{g(\hat{y}_0) - g(y_0) }{\sqrt{\text{V}(\mathbf{x}_0^\intercal\hat{\boldsymbol{\beta}})} } \le z_{\alpha/2} \] has probability \(1-\alpha\), an interval for the mean response is \[ \left[ g^{-1}\left(\mathbf{x}_0^\intercal \hat{\boldsymbol{\beta}} - z_{\alpha/2}\sqrt{\text{V}(\mathbf{x}_0^\intercal\hat{\boldsymbol{\beta}})} \right) , g^{-1}\left(\mathbf{x}_0^\intercal \hat{\boldsymbol{\beta}} + z_{\alpha/2}\sqrt{\text{V}(\mathbf{x}_0^\intercal\hat{\boldsymbol{\beta}})} \right) \right] \]
R
predict()
gave us fitted values last module, but it can also predict on new observations and calculate the standard errors \(\sqrt{\text{V}(\mathbf{x}_0^\intercal\hat{\boldsymbol{\beta}})}\)
mod <- glm(chocolate ~ sugar_amount, family = binomial(link = "logit"), data = myDf)
myNewData <- data.frame(sugar_amount = c(4, 4.5))
predict(mod, type = "response", newdata = myNewData, se.fit = T)
## $fit
## 1 2
## 0.5961598 0.6150836
##
## $se.fit
## 1 2
## 0.1414920 0.1539358
##
## $residual.scale
## [1] 1