*Key ideas:* GEE, ordinal data

This notebook demonstrates how ordinal regression models can be estimated using GEE. Ordinal regression models have a dependent variable that takes on finitely many values that are ordered.

The data are from a study of A-level exam scores in the UK. A student receives one of 6 possible ordered scores on the exam, so the data are ordinal. The students are clustered based on the school that they attended prior to taking the exam. In addition to gender and age, we also have access to the student's score on a previous exam. We can use regression analysis to assess how these three predictors are related to the score on the A-level exam. GEE allows us to do this while accounting for the ordered structure of the response, and the (presumed) non-independence among students who attend the same school. This non-independence could be driven by socioeconomic and other factors that cluster within neighborhoods or schools.

The data are available from: http://www.bristol.ac.uk/cmm/learning/support/datasets/

First we import all the libraries that we will need:

In [15]:

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
```

The data are available from the link http://www.bristol.ac.uk/cmm/learning/support/datasets/ as the archive `momeg.zip`

, which contains several files. Here we will focus on the chemistry scores, contained in the file `a-level-chemistry.txt`

.

In [16]:

```
data = pd.read_csv("a-level-chemistry.txt", header=None, sep=' ')
data.columns = ["Board", "A-Score", "Gtot", "Gnum",
"Gender", "Age", "Inst_Type", "LEA",
"Institute", "Student"]
print data.head()
print data.describe()
```

Here is a plot of the marginal distribution of exam scores.

In [17]:

```
sc = np.unique(data["A-Score"])
sc_n = [sum(data["A-Score"] == x) for x in sc]
plt.bar(sc-0.25, sc_n, 0.5)
plt.gca().set_xticks(sc)
plt.gca().set_xticklabels([str(x) for x in sc])
plt.xlim(-1, 11)
plt.xlabel("A-level score", size=17)
plt.ylabel("Number of students", size=17)
```

Out[17]:

We will use the student's score on the GCSE as one predictor of the A-level score. This can be viewed as a control for baseline achievement. The GCSE is a more basic exam than the A-level exam, but there should be a positive association between GCSE and A-level scores.

The GCSE has several components, and not all students take all of the components. The variable 'Gtot' in the data set is the sum of scores over all the components taken by a given student. We divide the sum by the number of components taken to obtain a mean score that is not inflated just because a student takes more exams.

In [18]:

```
data["GCSE"] = data["Gtot"] / data["Gnum"]
```

There are two levels of clustering in these data. The "LEA" is a grouping of schools in a local area, and the "Institute" variable is the actual school that a student attends. Institute codes are recycled across the LEA's, so here we create a composite code that reflects both the LEA and the institute.

In [19]:

```
data["School"] = [str(x) + str(y) for x,y in
zip(data["LEA"], data["Institute"])]
us = set(data["School"])
us = {x: k for k,x in enumerate(list(us))}
data["School"] = [us[x] for x in data["School"]]
```

These are all the variables in the analysis.

In [20]:

```
endog = data["A-Score"]
exog = data[["Gender", "Age", "GCSE"]]
groups = data["School"]
```

We will convert the GCSE score to a Z-score since we don't really care about the scale.

In [21]:

```
mn = exog.loc[:, "GCSE"].mean()
sd = exog.loc[:, "GCSE"].std()
exog.loc[:, "GCSE"] = (exog.loc[:, "GCSE"] - mn) / sd
```

As a basic initial model we will treat all the indicators as being independent. Although this is clearly not correct, the GEE estimates and standard errors are still meaningful.

In [22]:

```
ind = sm.cov_struct.Independence()
model0 = sm.OrdinalGEE(endog, exog, groups, cov_struct=ind)
result0 = model0.fit()
print result0.summary()
print ind.summary()
```

Next we fit the same model with an exhangeable covariance structure. This dependence structure treats all binary indicators from a given cluster as having the same pairwise correlation. Since some indicators are "between subjects" and others are "within subjects", this may not reflect the true dependence very well. Here we limit the iterations so there will be a convergence warning.

In [23]:

```
ex = sm.cov_struct.Exchangeable()
model1 = sm.OrdinalGEE(endog, exog, groups, cov_struct=ex)
result1 = model1.fit(maxiter=5)
print result1.summary()
print ex.summary()
```

The dependence is weak, perhaps due to the oversimplicity of the dependence model:

Next we plot the conditional probability distributions of the A-level exam scores for females and males. Age and the GCSE score are held fixed at their mean values.

In [24]:

```
ev = [{"Gender": 1}, {"Gender": 0}]
fig = result1.plot_distribution(None, ev)
ax = fig.get_axes()[0]
leg = plt.figlegend(ax.get_lines(), ["Females", "Males"], "upper center", ncol=2)
leg.draw_frame(False)
```

Here is a plot of the fitted distribution of A-level exam scores for females of average age, when the GCSE exam score is either 1 standard deviation below or 1 standard deviation above the mean.

In [25]:

```
ev = [{"GCSE": -1, "Gender": 1}, {"GCSE": 1, "Gender": 0}]
fig = result1.plot_distribution(None, ev)
ax = fig.get_axes()[0]
leg = plt.figlegend(ax.get_lines(), ["-1SD GCSE", "+1SD GCSE"], "upper center", ncol=2)
leg.draw_frame(False)
```

Now we can try a more sophisticated dependence model which accounts for the differing correlations between within-subject and between-subject indicators (as well as for different degrees of correlation among the within-subject indicators).

In [29]:

```
dep = sm.cov_struct.GlobalOddsRatio("ordinal")
```

Here we fit the model again using the new dependence structure. The parameter estimates from the exchangeable model can be used as starting values. This currently takes quite a while to run to convergence, so here we limit the number of iterations. Since we limit the number of iterations, we get a convergence warning message. We can ignore this message here since we have done this deliberately.

In [ ]:

```
model2 = sm.OrdinalGEE(endog, exog, groups, cov_struct=dep)
result2 = model2.fit(start_params=result1.params, maxiter=2)
print result2.summary()
```

We have already captured much more of the dependence structure than when working with the exchangeable correlation. However due to the large sample size, this does not impact the parameter estimates or standard errors very much.

In [ ]:

```
print dep.summary()
```