It is a course on applied statistics.

Hands-on: we use R, an open-source statistics software environment.

Course notes will be jupyter notebooks.

We will start out with a review of introductory statistics to see

`R`

in action.Main topic is

*(linear) regression models*: these are the*bread and butter*of applied statistics.

A regression model is a model of the relationships between some
*covariates (predictors)* and an *outcome*.

Specifically, regression is a model of the *average* outcome *given or having fixed* the covariates.

We will consider the heights of mothers and daughters collected by Karl Pearson in the late 19th century.

One of our goals is to understand height of the daughter,

`D`

, knowing the height of the mother,`M`

.

A mathematical model might look like $$ D = f(M) + \varepsilon$$ where $f$ gives the average height of the daughter of a mother of height

`M`

and $\varepsilon$ is*error*: not*every*daughter has the same height.A statistical question: is there

*any*relationship between covariates and outcomes -- is $f$ just a constant?

Let's create a plot of the heights of the mother/daughter pairs. The data is in an `R`

package that can be downloaded
from CRAN with the command:

```
install.packages("alr3")
```

If the package is not installed, then you will get an error message when calling `library(alr3)`

.

In [1]:

```
library(alr3)
data(heights)
M = heights$Mheight
D = heights$Dheight
plot(M, D, pch = 23, bg = "red", cex = 2)
```

In [2]:

```
plot(M, D, pch = 23, bg = "red", cex = 2)
height.lm = lm(D ~ M)
abline(height.lm, lwd = 3, col = "yellow")
```

How do we find this line? With a model.

We might model the data as $$ D = \beta_0+ \beta_1 M + \varepsilon. $$

This model is

*linear*in $\beta_1$, the coefficient of`M`

(the mother's height), it is a*simple linear regression model*.Another model: $$ D = \beta_0 + \beta_1 M + \beta_2 M^2 + \beta_3 F + \varepsilon $$ where $F$ is the height of the daughter's father.

Also linear (in the coefficients of $M,M^2,F$).

Which model is better? We will need a tool to compare models... more to come later.

Our example here was rather simple: we only had one independent variable.

Independent variables are sometimes called

*features*or*covariates*.In practice, we often have many more than one independent variable.

This example considers the effect of right-to-work legislation (which varies by state) on various factors. A description of the data can be found here.

The variables are:

Income: income for a four-person family

COL: cost of living for a four-person family

PD: Population density

URate: rate of unionization in 1978

Pop: Population

Taxes: Property taxes in 1972

RTWL: right-to-work indicator

`RTWL`

and `Income`

. However, we should recognize that other variables
have an effect on `Income`

. Let's look at some of these relationships.

In [3]:

```
url = "http://www1.aucegypt.edu/faculty/hadi/RABE4/Data4/P005.txt"
rtw.table <- read.table(url, header=TRUE, sep='\t')
print(head(rtw.table))
```

A graphical way to
visualize the relationship between `Income`

and `RTWL`

is the *boxplot*.

In [4]:

```
attach(rtw.table) # makes variables accessible in top namespace
boxplot(Income ~ RTWL, col='orange', pch=23, bg='red')
```

`COL`

. It also varies between right-to-work states.

In [5]:

```
boxplot(COL ~ RTWL, col='orange', pch=23, bg='red')
```

In [6]:

```
par(mfrow=c(2,2))
plot(URate, COL, pch=23, bg='red')
plot(URate, Income, pch=23, bg='red')
plot(URate, Pop, pch=23, bg='red')
plot(COL, Income, pch=23, bg='red')
```

`R`

has a builtin function that will try to display all pairwise relationships in a given dataset, the function `pairs`

.

In [7]:

```
pairs(rtw.table, pch=23, bg='red')
```

`R`

uses 1-based instead of 0-based indexing for rows and columns of arrays.)

In [8]:

```
print(rtw.table[27,])
pairs(rtw.table[-27,], pch=23, bg='red')
```