Course Introduction and Review


  • What is a regression model?

  • Descriptive statistics -- numerical

  • Descriptive statistics -- graphical

  • Inference about a population mean

  • Difference between two population means

What is course about?

  • It is a course on applied statistics.

  • Hands-on: we use R, an open-source statistics software environment as well as ipython.

  • We will start out with a review of introductory statistics to see R in action.

  • Main topic is (linear) regression models: these are the bread and butter of applied statistics.

What is a regression model?

A regression model is a model of the relationships between some covariates (predictors) and an outcome.

Specifically, regression is a model of the average outcome given or having fixed the covariates.

Heights of mothers and daughters

  • We will consider the heights of mothers and daughters collected by Karl Pearson in the late 19th century.

  • One of our goals is to understand height of the daughter, D, knowing the height of the mother, M.

  • A mathematical model might look like $$ D = f(M) + \varepsilon$$ where $f$ gives the average height of the daughter of a mother of height M and $\varepsilon$ is error: not every daughter has the same height.

  • A statistical question: is there any relationship between covariates and outcomes -- is $f$ just a constant?

Let's create a plot of the heights of the mother/daughter pairs. The data is in an R package that can be downloaded from CRAN with the command:


If the package is not installed, then you will get an error message when calling library(alr3).

In [1]:
%%R -h 600 -w 600
M = heights$Mheight
D = heights$Dheight
plot(M, D, pch = 23, bg = "red", cex = 2)
Loading required package: car

In the first part of this course we'll talk about fitting a line to this data. Let's do that and remake the plot, including this "best fitting line".

In [2]:
%%R -h 600 -w 600
plot(M, D, pch = 23, bg = "red", cex = 2)
height.lm = lm(D ~ M)
abline(height.lm, lwd = 3, col = "yellow")

Linear regression model

  • How do find this line? With a model.

  • We might model the data as $$ D = \beta_0+ \beta_1 M + \varepsilon. $$

  • This model is linear in $\beta_1$, the coefficient of M (the mother's height), it is a simple linear regression model.

  • Another model: $$ D = \beta_0 + \beta_1 M + \beta_2 M^2 + \beta_3 F + \varepsilon $$ where $F$ is the height of the daughter's father.

  • Also linear (in the coefficients of $M,M^2,F$).

  • Which model is better? We will need a tool to compare models... more to come later.

A more complex model

  • Our example here was rather simple: we only had one independent variable.

  • Independent variables are sometimes called features or covariates.

  • In practice, we often have many more than one independent variable.


This example considers the effect of right-to-work legislation (which varies by state) on various factors. A description of the data can be found here.

The variables are:

  • Income: income for a four-person family

  • COL: cost of living for a four-person family

  • PD: Population density

  • URate: rate of unionization in 1978

  • Pop: Population

  • Taxes: Property taxes in 1972

  • RTWL: right-to-work indicator

In a study like this, there are many possible questions of interest. Our focus will be on the relationship between RTWL and Income. However, we should recognize that other variables have an effect on Income. Let's look at some of these relationships.

In [3]:
url = ""
rtw.table <- read.table(url, header=T, sep='\t')
         City COL   PD URate     Pop Taxes Income RTWL
1     Atlanta 169  414  13.6 1790128  5128   2961    1
2      Austin 143  239  11.0  396891  4303   1711    1
3 Bakersfield 339   43  23.7  349874  4166   2122    0
4   Baltimore 173  951  21.0 2147850  5001   4654    0
5 Baton Rouge  99  255  16.0  411725  3965   1620    1
6      Boston 363 1257  24.4 3914071  4928   5634    0

A graphical way to visualize the relationship between Income and RTWL is the boxplot.

In [4]:
%%R -h 600 -w 600
boxplot(Income ~ RTWL, col='orange', pch=23, bg='red')

One variable that may have an important effect on the relationship between is the cost of living COL. It also varies between right-to-work states.

In [5]:
%%R -h 600 -w 600
boxplot(COL ~ RTWL, col='orange', pch=23, bg='red')

We may want to include more than one plot in a given display. The first line of the code below achieves this.

In [6]:
%%R -h 600 -w 600
plot(URate, COL, pch=23, bg='red')
plot(URate, Income, pch=23, bg='red')
plot(URate, Pop, pch=23, bg='red')
plot(COL, Income, pch=23, bg='red')

R has a builtin function that will try to display all pairwise relationships in a given dataset, the function pairs.

In [7]:
%%R -h 600 -w 600
pairs(rtw.table, pch=23, bg='red')

In looking at all the pairwise relationships. There is a point that stands out from all the rest. This data point is New York City, the 27th row of the table. (Note that R uses 1-based instead of 0-based indexing for rows and columns of arrays.)

In [8]:
pairs(rtw.table[-27,], pch=23, bg='red')
       City COL   PD URate     Pop Taxes Income RTWL
27 New York 323 6908  39.2 9561089  5260   4862    0

Right-to-work example

Building a model

Some of the main goals of this course:

  • Build a statistical model describing the effect of RTWL on Income.

  • This model should recognize that other variables also affect Income.

  • What sort of statistical confidence do we have in our conclusion about RTWL and Income?

  • Is the model adequate do describe this dataset?

  • Are there other (simpler, more complicated) models?

Numerical descriptive statistics

Mean of a sample

Given a sample of numbers $X=(X_1, \dots, X_n)$ the sample mean, $\overline{X}$ is $$ \overline{X} = \frac1n \sum_{i=1}^n X_i.$$

There are many ways to compute this in R.

In [9]:
X = c(1,3,5,7,8,12,19)
[1]  1  3  5  7  8 12 19
[1] 7.857143
[1] 7.857143
[1] 7.857143

We'll also illustrate thes calculations with part of an example we consider below, on differences in blood pressure between two groups.

In [10]:
%%R -h 600 -w 600
url = ''
calcium.table = read.table(url, header=T, skip=28, nrow=21)
treated = Decrease[(Treatment == 'Calcium')]
placebo = Decrease[(Treatment == 'Placebo')]
 [1]  7 -4 18 17 -3 -5  1 10 11 -2

Standard deviation of a sample

Given a sample of numbers $X=(X_1, \dots, X_n)$ the sample standard deviation $S_X$ is $$ S^2_X = \frac{1}{n-1} \sum_{i=1}^n (X_i-\overline{X})^2.$$

In [11]:
S2 = sum((treated - mean(treated))^2) / (length(treated)-1)
[1] 8.743251
[1] 8.743251

Median of a sample

Given a sample of numbers $X=(X_1, \dots, X_n)$ the sample median is the middle of the sample: if $n$ is even, it is the average of the middle two points. If $n$ is odd, it is the midpoint.

In [12]:
[1]  1  3  5  7  8 12 19
[1] 7

Quantiles of a sample

Given a sample of numbers $X=(X_1, \dots, X_n)$ the $q$-th quantile is a point $x_q$ in the data such that $q \cdot 100\%$ of the data lie to the left of $x_q$.


The $0.5$-quantile is the median: half of the data lie to the right of the median.

In [13]:

Graphical statistical summaries

We've already seen a boxplot. Another common statistical summary is a histogram.

In [14]:
%%R -h 600 -w 600
hist(treated, main='Treated group', xlab='Decrease', col='orange')

Inference about a population mean

A testing scenario

  • Suppose we want to determine the efficacy of a new drug on blood pressure.

  • Our study design is: we will treat a large patient population (maybe not so large: budget constraints limit it $n=20$) with the drug and measure their blood pressure before and after taking the drug.

  • One way to conclude that the drug is effective if the blood pressure has decreased. That is, if the average difference between before and after is positive.

Setting up the test

  • The null hypothesis, $H_0$ is: the average difference is less than zero.

  • The alternative hypothesis, $H_a$, is: the average difference is greater than zero.

  • Sometimes (actually, often), people will test the alternative, $H_a$: the average difference is not zero vs. $H_0$: the average difference is zero.

  • The test is performed by estimating the average difference and converting to standardized units.

Drawing from a box

  • Formally, could set up the above test as drawing from a box of differences in blood pressure.

  • A box model is a useful theoretical device that describes the experiment under consideration. In our example, we can think of the sample of decreases drawn 20 patients at random from a large population (box) containing all the possible decreases in blood pressure.

A simulated box model

  • In our box model, we will assume that the decrease is an integer drawn at random from $-3$ to 6.

  • We will draw 20 random integers from -3 to 6 with replacement and test whether the mean of our "box" is 0 or not.

In [15]:
mysample = as.integer(10*runif(20))-3
 [1]  3 -2 -3  6  6  5  5 -1 -1  0 -1  2 -3  5 -2  0  6  0  0  4

The test is usually a $T$ test that has the form $$ T = \frac{\overline{X}-0}{S_X/\sqrt{n}} $$

In [16]:
T = (mean(mysample) - 0) / (sd(mysample) / sqrt(20))
[1] 2.003954

This $T$ value is compared to a table for the appropriate $T$ distribution (in this case there are 19 degrees of freedom) and the 5% cutoff is

In [17]:
cutoff = qt(0.975, 19)
[1] 2.093024

The result of the test is

In [18]:
result = (abs(T) > cutoff)

If result is TRUE, then we reject $H_0$ the mean is 0, while if it is FALSE we do not reject. Of course, in this example we know the mean in our "box" is not 0, it is 1.5.

This rule can be visualized with the $T$ density. The total grey area is 0.05=5%, and the cutoff is chosen to be symmetric around zero and such that this area is exactly 5%.

For a test of size $\alpha$ we write this cutoff $t_{n-1,1-\alpha/2}$.

In [20]:

As mentioned above, sometimes tests are one-sided. If the null hypothesis we tested was that the mean is less than 0, then we would reject this hypothesis if our observed mean was much smaller than 0. This corresponds to a positive $T$ value.

In [22]:
The rejection rules are affected by the degrees of freedom. Here is the rejection region when we only have 5 samples from our "box".
In [24]:

Confidence intervals

  • Instead of testing a particular hypothesis, we might be interested in coming up with a reasonable range for the mean of our "box".

  • Statistically, this is done via a confidence interval.

  • If the 5% cutoff is $q$ for our test, then the 95% confidence interval is $$ [\bar{X} - q S_X / \sqrt{n}, \bar{X} + q S_X / \sqrt{n}] $$ where we recall $q=t_{n-1,0.975}$ with $n=20$.

  • If we wanted 90% confidence interval, we would use $q=t_{19,0.95}$. Why?

In [25]:
L = mean(mysample) - cutoff*sd(mysample)/sqrt(20)
U = mean(mysample) + cutoff*sd(mysample)/sqrt(20)
            L        U
1 -0.06444809 2.964448

Note that the endpoints above depend on the data. Not every interval will cover the true mean of our "box" which is 1.5. Let's take a look at 100 intervals of size 90%. We would expect that roughly 90 of them cover 1.5.

In [26]:
cutoff = qt(0.95, 19)
L = c()
U = c()
covered = c()
for (i in 1:100) {
   mysample = as.integer(10*runif(20))-3
   l = mean(mysample) - cutoff*sd(mysample)/sqrt(20)
   u = mean(mysample) + cutoff*sd(mysample)/sqrt(20)
   L = c(L, l)
   U = c(U, u)
   covered = c(covered, (l < 1.5) * (u > 1.5))
[1] 82

A useful picture is to plot all these intervals so we can see the randomness in the intervals, why the true mean of the box is unchanged.

In [27]:
%%R -h 600 -w 600
mu = 1.5
plot(c(1, 100), c(-2.5+mu,2.5+mu), type='n', ylab='Confidence Intervals', xlab='Sample')
for (i in 1:100) {
   if (covered[i] == TRUE) {
       lines(c(i,i), c(L[i],U[i]), col='green', lwd=2)
   else {
      lines(c(i,i), c(L[i],U[i]), col='red', lwd=2)
abline(h=mu, lty=2, lwd=4)