Module 1

STAT 6021

Taylor R. Brown, PhD

A brief tour

Communicating

My email: trb5me@virginia.edu

slack: https://stat6021.slack.com/

Other Information: Collab (e.g. syllabus)

Getting started

Download Instructions for R:

Download R from here: http://archive.linux.duke.edu/cran/ See the first box at the top. Choose among Windows, Mac and Linux options.

Download RStudio Desktop from here: https://www.rstudio.com/products/rstudio/download/

Getting started

Assignments are graded with “gradeR”.

Questions?

Questions before we start the material?

Regression!

We want to develop a framework for modelling a dependence between a \(Y\) and an \(X\). The general form of a model would be


\[ Y = f(X) + \epsilon \]


Regression!


How do we get an idea for what \(f(\cdot)\) should be?


Typically we plot \(x_1, \ldots, x_n\) against \(y_1, \ldots, y_n\). This is called a scatterplot.

A scatterplot


Theoretical/scientific justification is often useful

A Workhorse: Simple Linear Regression

When there looks to be a linear relationship, we can use the simple linear regression model.


\[ Y = \beta_0 +\beta_1 X + \epsilon \]

The Conditional Distribution of \(Y\) given \(X\)


\[\begin{align*} E[Y \mid X] &= E[\beta_0 +\beta_1 X + \epsilon \mid X ] \\ &= \beta_0 +\beta_1 X + E[\epsilon \mid X] \\ &= \beta_0 + \beta_1 X \end{align*}\]


\[\begin{align*} V[Y \mid X ] &= V[\beta_0 +\beta_1 X + \epsilon \mid X ] \\ &= V[\epsilon \mid X ] \\ &= \sigma^2 \end{align*}\]

The Conditional Distribution of \(Y\) given \(X\)

Estimation


With an SLR model \(Y = \beta_0 + \beta_1 X + \epsilon\).


We don’t know \(\beta_0, \beta_1, \sigma^2\), so we have to use data to estimate these.

Estimation


To estimate \(\beta_0\) and \(\beta_1\) we minimize the loss: \[ S(\beta_0, \beta_1) = \sum_{i=1}^n \left[ y_i - (\beta_0 + \beta_1 x_i)\right]^2 \]

Estimation

This is the loss-minimizing line:

Estimation

Estimation


The \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that minimize this are called the least squares estimates.


The estimated regression line is then \[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x \]

Chain rule to the rescue

\[\begin{align*} \frac{\partial}{\partial \beta_0} S(\beta_0, \beta_1) &= \frac{\partial}{\partial \beta_0}\sum_{i=1}^n ( y_i - \beta_0 - \beta_1 x_i)^2 \\ &= \sum_{i=1}^n \frac{\partial}{\partial \beta_0}( y_i - \beta_0 - \beta_1 x_i)^2 \\ &= \sum_{i=1}^n 2 ( y_i - \beta_0 - \beta_1 x_i) (-1) \\ &\overset{\text{set}}{=}0 \end{align*}\]

\[\begin{align*} \frac{\partial}{\partial \beta_1} S(\beta_0, \beta_1) &= \frac{\partial}{\partial \beta_1}\sum_{i=1}^n ( y_i - \beta_0 - \beta_1 x_i)^2 \\ &= \sum_{i=1}^n \frac{\partial}{\partial \beta_1}( y_i - \beta_0 - \beta_1 x_i)^2 \\ &= \sum_{i=1}^n 2 ( y_i - \beta_0 - \beta_1 x_i) (-x_i) \\ &\overset{\text{set}}{=}0 \end{align*}\]

The normal equations


The normal equations are what you get when you re-arrange things to get rid of negative signs:

\[ n\hat{\beta}_0 + \left(\sum_i x_i \right)\hat{\beta}_1 = \sum_i y_i \]


\[ \left(\sum_i x_i \right)\hat{\beta}_0 + \left(\sum_i x_i^2 \right) \hat{\beta}_1 = \sum_i x_i y_i \]


Sometimes people divide through by \(n\)

\[ \hat{\beta}_0 + \bar{x}\hat{\beta}_1 = \bar{y} \]


\[ \bar{x}\hat{\beta}_0 + \frac{\sum_i x_i^2 }{n} \hat{\beta}_1 = \frac{\sum_i x_i y_i}{n} \]

The final solution


finishing off the work, we solve for \(\hat{\beta}_0\) and \(\hat{\beta}_1\) \[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]


\[ \hat{\beta}_1 = \frac{\sum_i x_i y_i - n \bar{x}\bar{y}}{\sum_i x_i^2 - n\bar{x}^2} = \frac{\sum_i(x_i - \bar{x})(y_i - \bar{y})}{\sum_j (x_j - \bar{x})^2} \]

The fitted values


To get the fitted (predicted) values, denoted \(\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_n\), are what you get when you plug in \(x_1, \ldots, x_n\) into the estimated regression line \(\hat{\beta}_0 + \hat{\beta}_1x\).


\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \] for \(i = 1, \ldots, n\).

A Trick


\[ \hat{\beta}_1 = \sum_i c_i y_i \]

where \[ c_i = \frac{(x_i - \bar{x}) }{\sum_{j=1}^n (x_j - \bar{x})^2 } = \frac{(x_i - \bar{x}) }{S_{xx}} . \]

The following equation is true by definition: \(\sum(x_i - \bar{x})^2 = S_{xx}\).

A Trick


Why:


\[\begin{align*} \hat{\beta}_1 &= \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{S_{xx}} \\ &= \frac{\sum_i (x_i - \bar{x})y_i - \sum (x_i - \bar{x})\bar{y}}{S_{xx}}\\ &= \frac{\sum_i (x_i - \bar{x})y_i }{S_{xx}} \\ &= \sum_i c_i y_i \end{align*}\]

because \(\sum(x_i - \bar{x}) = 0\).

showing unbiasedness with our new trick

Drop from notation dependence on \(x_1, \ldots, x_n\)

\[\begin{align*} E[\hat{\beta}_1 ] &= E[ \sum_i c_i y_i ] \\ &= \sum_i c_i E[y_i] \\ &= \sum_i c_i E[\beta_0 + \beta_1 x_i + \epsilon_i ] \\ &= \sum_i c_i [\beta_0 + \beta_1 x_i ] \\ &= \beta_0 \sum_i c_i + \beta_1 \sum_i c_i x_i \\ &= \beta_0 \sum_i \frac{(x_i - \bar{x}) }{S_{xx}} + \beta_1 \sum_i \frac{(x_i - \bar{x}) }{S_{xx}} x_i \\ &= \beta_1 \sum_i x_i \frac{(x_i - \bar{x}) }{S_{xx}} \\ &= \beta_1 \sum_i \frac{(x_i - \bar{x})^2 }{S_{xx}} \\ &= \beta_1 \end{align*}\]

calculating the variance with our new trick

\[\begin{align*} V[\hat{\beta}_1 ] &= V[ \sum_i c_i y_i ] \\ &= \sum_i c_i^2 V[y_i] \\ &= \sum_i c_i^2 V[\beta_0 + \beta_1 x_i + \epsilon_i ] \\ &= \sum_i c_i^2 \sigma^2 \\ &= \sigma^2 \sum_i \frac{(x_i - \bar{x})^2 }{S_{xx}^2} \\ &= \sigma^2 \frac{S_{xx} }{S_{xx}^2} \\ &= \frac{\sigma^2 }{S_{xx}} \end{align*}\]

mean of intercept

Similarly \[ E[\hat{\beta}_0]= E[\bar{y} - \hat{\beta}_1 \bar{x}] = E[\bar{y}] - \beta_1\bar{x} = \beta_0 \]

\[ V[\hat{\beta}_0]= \sigma^2\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}} \right) \]

Estimating the third parameter


Estimating \(\sigma^2\) can be done by taking the average squared error

Estimating the third parameter

\[ \hat{\sigma}^2 = \frac{\sum_{i=1}^n (y_i - \hat{y}_i )^2 }{n-2} = \frac{SS_{\text{res}} }{n-2}. \] Note we use \(n-2\) instead of \(n\).

In the next chapter, we call \(\hat{\sigma}^2 = \text{MS}_{Res}\).

Sums of Squares

\[ SS_T = SS_R + SS_{Res} \] or \[ \sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \]

Breaking up total sums of squares into good and bad pieces has a few other uses

Sums of Squares

\[\begin{align*} \sum_{i=1}^n (y_i - \bar{y})^2 &= \sum_{i=1}^n (y_i -\hat{y}_i + \hat{y}_i - \bar{y})^2 \\ &=\sum_{i=1}^n (y_i - \hat{y}_i)^2 + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 + 2 \sum_{i=1}^n (y_i - \hat{y}_i)(\hat{y}_i - \bar{y}) \end{align*}\]

use the normal equations on that last piece:

\[\begin{align*} \sum_{i=1}^n (y_i - \hat{y}_i)(\hat{y}_i - \bar{y}) &= \sum_{i=1}^n (y_i - \hat{y}_i)\hat{y}_i - \bar{y} \sum_{i=1}^n (y_i - \hat{y}_i) \end{align*}\]

it’s clearer after you write

\[\begin{align*} \sum_{i=1}^n (y_i - \hat{y}_i)\hat{y}_i &= \sum_{i=1}^n (y_i - \hat{y}_i)(\hat{\beta}_0 + \hat{\beta}_1 x_i ) \\ &= \hat{\beta}_0 \sum_{i=1}^n (y_i - \hat{y}_i) + \hat{\beta}_1\sum_{i=1}^n (y_i - \hat{y}_i) x_i \end{align*}\]

Coefficient of Determination

\[ R^2 = \frac{SS_R}{SS_T} \] “the proportion of variation explained by the regressor x”

The red line is the quantile \(F_{\alpha,1, 20}\).

R examples!

setwd("/home/t/UVa/all_teaching/summer19_6021/data")
my_data <- read.csv("data-table-B3.csv")
head(my_data) # always check that it worked
##       y  x1  x2  x3   x4   x5 x6 x7    x8   x9  x10 x11
## 1 18.90 350 165 260 8.00 2.56  4  3 200.3 69.9 3910   1
## 2 17.00 350 170 275 8.50 2.56  4  3 199.6 72.9 3860   1
## 3 20.00 250 105 185 8.25 2.73  1  3 196.7 72.2 3510   1
## 4 18.25 351 143 255 8.00 3.00  2  3 199.9 74.0 3890   1
## 5 20.07 225  95 170 8.40 2.76  1  3 194.1 71.8 3365   0
## 6 11.20 440 215 330 8.20 2.88  4  3 184.5 69.0 4215   1

R examples!

reg_mod_1 <- lm(y ~ x1, data = my_data)
summary(reg_mod_1)  # print a lot
## 
## Call:
## lm(formula = y ~ x1, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7923 -1.9752  0.0044  1.7677  6.8171 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.722677   1.443903   23.36  < 2e-16 ***
## x1          -0.047360   0.004695  -10.09 3.74e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.065 on 30 degrees of freedom
## Multiple R-squared:  0.7723, Adjusted R-squared:  0.7647 
## F-statistic: 101.7 on 1 and 30 DF,  p-value: 3.743e-11

R examples!

plot(my_data$x1, my_data$y)
abline(reg_mod_1)  # overlay the best fit line