*Key ideas:* GEE, ordinal data

This notebook demonstrates how ordinal regression models can be estimated using GEE. Ordinal regression models have a dependent variable that takes on finitely many values that are ordered.

The data are from a study of A-level exam scores in the UK. A student receives one of 6 possible ordered scores on the exam, so the data are ordinal. The students are clustered based on the school that they attended prior to taking the exam. In addition to gender and age, we also have access to the student's score on a previous exam. We can use regression analysis to assess how these three predictors are related to the score on the A-level exam. GEE allows us to do this while accounting for the ordered structure of the response, and the (presumed) non-independence among students who attend the same school. This non-independence could be driven by socioeconomic and other factors that cluster within neighborhoods or schools.

The data are available from: http://www.bristol.ac.uk/cmm/learning/support/datasets/

First we import all the libraries that we will need:

In [15]:

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
```

`momeg.zip`

, which contains several files. Here we will focus on the chemistry scores, contained in the file `a-level-chemistry.txt`

.

In [16]:

```
data = pd.read_csv("a-level-chemistry.txt", header=None, sep=' ')
data.columns = ["Board", "A-Score", "Gtot", "Gnum",
"Gender", "Age", "Inst_Type", "LEA",
"Institute", "Student"]
print data.head()
print data.describe()
```

Here is a plot of the marginal distribution of exam scores.

In [17]:

```
sc = np.unique(data["A-Score"])
sc_n = [sum(data["A-Score"] == x) for x in sc]
plt.bar(sc-0.25, sc_n, 0.5)
plt.gca().set_xticks(sc)
plt.gca().set_xticklabels([str(x) for x in sc])
plt.xlim(-1, 11)
plt.xlabel("A-level score", size=17)
plt.ylabel("Number of students", size=17)
```

Out[17]:

We will use the student's score on the GCSE as one predictor of the A-level score. This can be viewed as a control for baseline achievement. The GCSE is a more basic exam than the A-level exam, but there should be a positive association between GCSE and A-level scores.

The GCSE has several components, and not all students take all of the components. The variable 'Gtot' in the data set is the sum of scores over all the components taken by a given student. We divide the sum by the number of components taken to obtain a mean score that is not inflated just because a student takes more exams.

In [18]:

```
data["GCSE"] = data["Gtot"] / data["Gnum"]
```

In [19]:

```
data["School"] = [str(x) + str(y) for x,y in
zip(data["LEA"], data["Institute"])]
us = set(data["School"])
us = {x: k for k,x in enumerate(list(us))}
data["School"] = [us[x] for x in data["School"]]
```

These are all the variables in the analysis.

In [20]:

```
endog = data["A-Score"]
exog = data[["Gender", "Age", "GCSE"]]
groups = data["School"]
```

We will convert the GCSE score to a Z-score since we don't really care about the scale.

In [21]:

```
mn = exog.loc[:, "GCSE"].mean()
sd = exog.loc[:, "GCSE"].std()
exog.loc[:, "GCSE"] = (exog.loc[:, "GCSE"] - mn) / sd
```

In [22]:

```
ind = sm.cov_struct.Independence()
model0 = sm.OrdinalGEE(endog, exog, groups, cov_struct=ind)
result0 = model0.fit()
print result0.summary()
print ind.summary()
```

In [23]:

```
ex = sm.cov_struct.Exchangeable()
model1 = sm.OrdinalGEE(endog, exog, groups, cov_struct=ex)
result1 = model1.fit(maxiter=5)
print result1.summary()
print ex.summary()
```

The dependence is weak, perhaps due to the oversimplicity of the dependence model:

In [24]:

```
ev = [{"Gender": 1}, {"Gender": 0}]
fig = result1.plot_distribution(None, ev)
ax = fig.get_axes()[0]
leg = plt.figlegend(ax.get_lines(), ["Females", "Males"], "upper center", ncol=2)
leg.draw_frame(False)
```

In [25]:

```
ev = [{"GCSE": -1, "Gender": 1}, {"GCSE": 1, "Gender": 0}]
fig = result1.plot_distribution(None, ev)
ax = fig.get_axes()[0]
leg = plt.figlegend(ax.get_lines(), ["-1SD GCSE", "+1SD GCSE"], "upper center", ncol=2)
leg.draw_frame(False)
```

In [29]:

```
dep = sm.cov_struct.GlobalOddsRatio("ordinal")
```

In [ ]:

```
model2 = sm.OrdinalGEE(endog, exog, groups, cov_struct=dep)
result2 = model2.fit(start_params=result1.params, maxiter=2)
print result2.summary()
```

In [ ]:

```
print dep.summary()
```