Taylor R. Brown, PhD
When are predictions high-variance? \[ V[\hat{\mathbf{y}}] = \sigma^2 \mathbf{H} \] So in particular \[ V[\hat{y}_i] = \sigma^2 \mathbf{H}_{ii} \]
This has to do with “extrapoloating” (more on this later)
When are predictions high-variance? \[ V[\mathbf{e}] = \sigma^2 [\mathbf{I} - \mathbf{H}] \] So in particular \[ V[e_i] = \sigma^2(1- \mathbf{H}_{ii}) \]
high leverage points mean that we predict with full certainty (not believable!)
Recall \[ \mathbf{H} = \mathbf{H}^2 \] So \[ \mathbf{H}_{ii} = \sum_{l \neq i} \mathbf{H}_{i,l}^2 + \mathbf{H}_{ii}^2 \ge \mathbf{H}_{ii}^2 \ge 0 \] So \[ 0 \le \mathbf{H}_{ii} \le 1. \]
Recall \[ \mathbf{H} \mathbf{X} = \mathbf{X}. \]
Looking at the first column \[ \mathbf{H} \mathbf{1} = \mathbf{1} \]
So row-sums of the hat matrix are equal to \(1\).
We know
\[ \hat{\mathbf{y}} = \mathbf{H}\mathbf{y} \]
and if you look at one of the rows you have
\[ \hat{y}_j = \sum_{i=1}^n y_i \mathbf{H}_{j,i} \]
So \(\mathbf{H}_{j,i}\) is the leverage of obs. \(i\) on fitted value \(j\).
Rule of thumb: flag row \(j\) when \(\mathbf{H}_{jj} > 3 \frac{k+1}{n}\). More on this in module 8!
Q: what does \(\mathbf{H}_{ii}\) mean, though?
A: it’s related to how far away the \(i\)th row/observation is from the typical observation
I.e. how far is \(\mathbf{x}_i^{\intercal} = (1, x_{i1}, x_{i2}, \ldots, x_{ik})\) from \(\bar{\mathbf{x}}^{\intercal}\)
(Squared) Euclidean distance?
\[ [1-1]^2 + [x_{i1} - \bar{x}_{1}]^2 + \cdots [x_{ik} - \bar{x}_{k}]^2 \]
If our model has an intercept, this can be written as \[ (\mathbf{x}_i - \bar{\mathbf{x}})^{\intercal}(\mathbf{x}_i - \bar{\mathbf{x}}) \]
How far away is the \(i\)th row/observation from the typical observation?
This is better
\[ \frac{[x_{i1} - \bar{x}_{1}]^2}{ s_1^2 } + \cdots + \frac{[x_{ik} - \bar{x}_{k}]^2}{s_k^2 } \]
This can be written as \[ (\mathbf{x}_{i,-1} - \bar{\mathbf{x}}_{-1})^{\intercal} \begin{bmatrix} s_1^{2} & \cdots & 0\\ \vdots & \ddots & \vdots \\ 0 & \cdots & s_k^2 \end{bmatrix}^{-1} (\mathbf{x}_{i,-1} - \bar{\mathbf{x}}_{-1}) \]
e.g. \(s_1^2 = \sum_{i}(x_{i1} - \bar{x}_1)^2/(n-1)\)
How far away is the \(i\)th row/observation from the typical observation?
In general we have Mahalanobis distance \[ (\mathbf{x}_{i,-1} - \bar{\mathbf{x}})^{\intercal} \mathbf{S}^{-1} (\mathbf{x}_{i,-1} - \bar{\mathbf{x}}_{-1}) \] where \[ \mathbf{S} = \begin{bmatrix} s_1^{2} & \cdots & s_{1k}\\ \vdots & \ddots & \vdots \\ s_{k1} & \cdots & s_k^2 \end{bmatrix} \]
How does this relate to \[ \mathbf{H} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal} \]
\[ \mathbf{H}_{i,j} = \mathbf{X}_i^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\underbrace{\mathbf{X}_j}_{\text{row j}} \]
We will show:
\(\mathbf{H}_{i,j}\) is the similarity between observations \(i\) and \(j\)
\(\mathbf{H}_{i,i}\) is the distance of observation \(i\) from the center.
They’re not exactly the same. Recall \[ \mathbf{X}^\intercal \mathbf{X} = \pmatrix{n & n\bar{\mathbf{x}}_{-1}^{\intercal} \\ n\bar{\mathbf{x}}_{-1} & \mathbf{C}} \] where \(\mathbf{C}_{jk} = \sum_{i=1}^n x_{ij} x_{ik}\). The inverse of this block matrix is
\[ \begin{bmatrix} 1/n+\bar{\mathbf{x}}_{-1}^{\intercal}( \mathbf{C}- n\bar{\mathbf{x}}_{-1}\bar{\mathbf{x}}_{-1}^{\intercal})^{-1}\bar{\mathbf{x}}_{-1} & -\bar{\mathbf{x}}_{-1}^{\intercal}( \mathbf{C}- n\bar{\mathbf{x}}_{-1}\bar{\mathbf{x}}_{-1}^{\intercal})^{-1} \\ - ( \mathbf{C}- n\bar{\mathbf{x}}_{-1}\bar{\mathbf{x}}_{-1}^{\intercal})^{-1}\bar{\mathbf{x}}_{-1} & ( \mathbf{C}- n\bar{\mathbf{x}}_{-1}\bar{\mathbf{x}}_{-1}^{\intercal})^{-1} \end{bmatrix} \] which equals \[ \begin{bmatrix} 1/n+\bar{\mathbf{x}}_{-1}^{\intercal}( [n-1] \mathbf{S})^{-1}\bar{\mathbf{x}}_{-1} & -\bar{\mathbf{x}}_{-1}^{\intercal}( [n-1] \mathbf{S})^{-1} \\ - ([n-1] \mathbf{S})^{-1}\bar{\mathbf{x}}_{-1} & ( [n-1] \mathbf{S})^{-1} \end{bmatrix} \]
So
\[\begin{align*} \mathbf{H}_{i,j} &= \mathbf{X}_i^{\intercal}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}_j \\ &= \begin{bmatrix} 1 & \mathbf{x}_{i,-1}^\intercal \end{bmatrix} \begin{bmatrix} 1/n+\bar{\mathbf{x}}_{-1}^{\intercal}( [n-1] \mathbf{S})^{-1}\bar{\mathbf{x}}_{-1} & -\bar{\mathbf{x}}_{-1}^{\intercal}( [n-1] \mathbf{S})^{-1} \\ - ([n-1] \mathbf{S})^{-1}\bar{\mathbf{x}}_{-1} & ( [n-1] \mathbf{S})^{-1} \end{bmatrix} \begin{bmatrix} 1 \\ \mathbf{x}_{j,-1} \end{bmatrix}\\ &= \begin{bmatrix} \frac{1}{n} - (\mathbf{x}_{i,-1} - \bar{\mathbf{x}}_{-1})^{\intercal}( [n-1] \mathbf{S})^{-1}\bar{\mathbf{x}}_{-1} & (\mathbf{x}_{i,-1} - \bar{\mathbf{x}}_{-1})^{\intercal}( [n-1] \mathbf{S})^{-1} \end{bmatrix} \begin{bmatrix} 1 \\ \mathbf{x}_{j,-1} \end{bmatrix}\\ &= \frac{1}{n} + (\mathbf{x}_{i,-1} - \bar{\mathbf{x}}_{-1})^{\intercal} \mathbf{S}^{-1}(\mathbf{x}_{j,-1} - \bar{\mathbf{x}}_{-1})/(n-1) \end{align*}\]
In particular \(\mathbf{H}_{i,i} = \frac{1}{n} + \underbrace{(\mathbf{x}_{i,-1} - \bar{\mathbf{x}}_{-1})^{\intercal}\mathbf{S}^{-1}(\mathbf{x}_{i,-1} - \bar{\mathbf{x}}_{-1})}_{\text{Mahalanobis dist.} }/(n-1)\)
When you have linear dependence between your predictors, bad things can happen!
Example, trying to predict your salary next year (\(y\)) and using letting years spent in school (\(x_1\)) and decades spent in school (\(x_2\)).
Linear dependence because \[ x_1 - 10 \times x_2 = 0 \]
Near-linear dependence is bad, too
Why?
Recall \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1}\mathbf{X}^\intercal \mathbf{y} \]
\((\mathbf{X}^\intercal \mathbf{X})^{-1}\) doesn’t exist unless \(\mathbf{X}^\intercal \mathbf{X}\) is invertible
\(\mathbf{X}^\intercal \mathbf{X}\) is invertible if and only if it’s of full rank
\(\text{rank}(\mathbf{X}^\intercal \mathbf{X}) = \text{rank}(\mathbf{X})\)
R
has your back
##
## Call:
## lm(formula = y ~ ., data = lil_df)
##
## Coefficients:
## (Intercept) x1 x2
## 0.72423 0.05491 NA
See the NA
. It doesn’t make a number up.
Why is near-linear dependence bad?
Recall \[ V[\hat{\boldsymbol{\beta}}] = \sigma^2(\mathbf{X}^\intercal\mathbf{X})^{-1} \]
Main idea: when the columns of \(\mathbf{X}\) are nearly-linearly dependent, the diagonals and off-diagonals of this matrix will be high.
For example, let’s look at the variance of \(\hat{\beta}_3\). So we’re just interested in one of the diagonals.
Without loss of generality, let’s put \(\mathbf{x}_3\) in the first column of \(\mathbf{X}\)
\[ \mathbf{X}^\intercal \mathbf{X} = \begin{bmatrix} \mathbf{x}_3^\intercal \\ \mathbf{x}_{-3}^\intercal \end{bmatrix} \begin{bmatrix} \mathbf{x}_3 & \mathbf{x}_{-3} \end{bmatrix} = \begin{bmatrix} \mathbf{x}_3^\intercal\mathbf{x}_3 & \mathbf{x}_3^\intercal\mathbf{x}_{-3} \\ \mathbf{x}_{-3}^\intercal\mathbf{x}_3 & \mathbf{x}_{-3}^\intercal\mathbf{x}_{-3} \end{bmatrix} \]
Again using the (second) formula for the inverse of a block matrix \[ V[\hat{\beta}_3] = \frac{\sigma^2}{\mathbf{x}_3^\intercal\mathbf{x}_3 - \mathbf{x}_3^\intercal\mathbf{x}_{-3} [\mathbf{x}_{-3}^\intercal\mathbf{x}_{-3}]^{-1}\mathbf{x}_{-3}^\intercal\mathbf{x}_3 } \]
Can you recognize anything special about the denominator?
Hint: think about running an auxiliary regression of \(\mathbf{x}_3\) on \(\mathbf{x}_{-3}\)…
So \[\begin{align*} V[\hat{\beta}_3] &= \frac{\sigma^2}{\mathbf{x}_3^\intercal\mathbf{x}_3 - \mathbf{x}_3^\intercal\mathbf{x}_{-3} [\mathbf{x}_{-3}^\intercal\mathbf{x}_{-3}]^{-1}\mathbf{x}_{-3}^\intercal\mathbf{x}_3 } \\ &= \frac{\sigma^2}{\mathbf{x}_3^\intercal\left[\mathbf{I} - \mathbf{H}_{-3} \right]\mathbf{x}_3 } \\ &= \frac{\sigma^2}{SS_{T,3} - SS_{R,3} } \\ &= \frac{\sigma^2}{SS_{T,3} }\left[\frac{1}{1 - R^2_{3} }\right] \end{align*}\]
We call \(\text{VIF}_3 = \frac{1}{1 - R^2_{3}}\) the variance inflation factor for the third coefficient.
NB: the co-variances increase with the VIF, too!
Remember, we want to be alerted if \[ \sum_{i=1}^p t_i \mathbf{x}_i \approx \mathbf{0} \] for \(t_1, \ldots, t_p\) not all zero.
Looking at the correlation matrix of these \(p\) variables would only alert us to pairwise correlations.
Each VIF tells us about how a particular column relates to all other columns.
There is cause for concern if any VIF is higher than \(10\) (or maybe even \(5\)).
We can also look at the determinant of \(\mathbf{X}^\intercal\mathbf{X}\), too
R
Pinot Noir!
## Loading required package: carData
## Clarity Aroma Body Flavor Oakiness Quality Region
## 1 1 3.3 2.8 3.1 4.1 9.8 1
## 2 1 4.4 4.9 3.5 3.9 12.6 1
## 3 1 3.9 5.3 4.8 4.7 11.9 1
## 4 1 3.9 2.6 3.1 3.6 11.1 1
## 5 1 5.6 5.1 5.5 5.1 13.3 1
## 6 1 4.6 4.7 5.0 4.1 12.8 1
my_mod <- lm(Quality ~ Clarity + Aroma + Body + Flavor + Oakiness + Region, data = my_df)
vif(my_mod)
## Clarity Aroma Body Flavor Oakiness Region
## 1.268415 2.898856 2.094459 2.687421 1.172503 1.776852
## Clarity Aroma Body Flavor Oakiness
## Clarity 1.00000000 0.0619021 -0.3083783 -0.08515993 0.18321471
## Aroma 0.06190210 1.0000000 0.5489102 0.73656121 0.20164445
## Body -0.30837826 0.5489102 1.0000000 0.64665917 0.15210591
## Flavor -0.08515993 0.7365612 0.6466592 1.00000000 0.17976051
## Oakiness 0.18321471 0.2016444 0.1521059 0.17976051 1.00000000
## Region -0.02032013 0.6201006 0.4354403 0.50591814 -0.05956862
## Region
## Clarity -0.02032013
## Aroma 0.62010056
## Body 0.43544030
## Flavor 0.50591814
## Oakiness -0.05956862
## Region 1.00000000
To understand multicollinearity, we need to think about where it comes from:
data collection method
constraints on model or population
model specification
an overdefined model
Then we can decide on how to deal with it.
Example:
A soft drink bottler is analyzing the vending machine service routes in his distribution system. He is interested in predicting the amount of time required by the route driver to service the vending machines in an outlet. This service activity includes stocking the machine with beverage products and minor maintenance or house-keeping. The industrial engineer responsible for the study has suggested that the two most important variables affecting the delivery time ( y ) are the number of cases of product stocked ( x1 ) and the distance walked by the route driver ( x2 )
E.g.
\[ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon \]
what if \(0 < x < 1\)?
E.g.
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]
where \(\mathbf{X}\) is “fat” (i.e \(n < p\)).
“Dimension reduction” and “variable selection” are huge areas of research