*Key ideas:* Nested dependence structures

This notebook demonstrates how regression models for nested data can be estimated using GEE.

The data are from a study of GCSE (General Certificate of Seconday Education) exam scores in the UK. We know the school in which each student is enrolled. In addition, the schools are clustered into "local educational areas" (LEA). Thus we have two nested levels of clustering (schools within LEA's). The top level (LEA) can be handled by treating it as the GEE `group`

. The inner level (school) can be handled using a nested dependence structure.

All of the models here use a linear link and a Gaussian model for the data.

The data are available from: http://www.bristol.ac.uk/cmm/learning/support/datasets/

Here are all the libraries that we will need:

In [1]:

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
```

The data file has no column labels, so we attach them manually.

In [2]:

```
data = pd.read_csv("a-level-chemistry.txt", header=None, sep=' ')
data.columns = ["Board", "A-Score", "Gtot", "Gnum",
"Gender", "Age", "Inst_Type", "LEA",
"Institute", "Student"]
```

The GCSE has several components, and not all students take all of the components. The variable 'Gtot' in the data set is the sum of exam scores over all the components taken by a given student. We divide the score sum by the number of components taken, in order to obtain a mean score that is not inflated just because a student takes more exams.

In [3]:

```
data["GCSE"] = data["Gtot"] / data["Gnum"]
```

Our primary focus here is on the nested dependence structure. For comparison, we will also consider a single-level dependence structure determined only by the school. Since the school codes are recycled across the LEA's, we need to create a new variable that combines the two codes.

In [4]:

```
data["School"] = [str(x) + str(y) for x,y in
zip(data["LEA"], data["Institute"])]
us = set(data["School"])
us = {x: k for k,x in enumerate(list(us))}
data["School"] = [us[x] for x in data["School"]]
```

These are all the variables in the analysis.

In [5]:

```
qdata = data[["GCSE", "Gender", "Age", "LEA", "Institute", "School"]]
qdata = np.asarray(qdata)
```

In [6]:

```
family = sm.families.Gaussian()
```

The initial model considers only the school effect, and does not account for the fact that the schools are nested in LEA's.

In [7]:

```
ex = sm.cov_struct.Exchangeable()
model1 = sm.GEE.from_formula("GCSE ~ Age + Gender", "LEA",
data=data, family=family, cov_struct=ex)
result1 = model1.fit()
print result1.summary()
```

The dependence among subjects within a school is fairly weak:

In [8]:

```
print ex.summary()
```

Next we fit the model using the nested dependence structure. Since the iterations are somewhat slow, we limit the number of iterations. This may result in a warning message.

In [9]:

```
ne = sm.cov_struct.Nested()
model2 = sm.GEE.from_formula("GCSE ~ Age + Gender", "LEA",
data=data, family=family, cov_struct=ne,
dep_data=data["Institute"])
result2 = model2.fit() #maxiter=10)
print result2.summary()
```

The nested dependence structure captures more of the dependence in the data compared to the exchangeable dependence structure. Component 1 is the variation among LEA's. It is fairly weak, suggesting that the schools within a LEA are fairly heterogeneous. The variance among schools within an LEA is stronger, suggesting that students within a school share similar characteristics. Variation among students (residual variance in the table below) is the greatest contributor to the overall variance.

In [10]:

```
print ne.summary()
```