In [1]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.mpl_style', 'default')
%matplotlib inline
```

In [2]:

```
df = pd.read_csv('GPA.txt')
```

In [3]:

```
df['GPA'].value_counts()
```

Out[3]:

This tells us something about the underlying algorithm: 4.0 is *by far* the most common score, so this algorithm doesn't output a smooth score. The most common scores are the round numbers: 0.0, 1.0, 2.0, and 3.0. Interesting. (weird?)

I'd be interested to know what's going on here.

Before I start, how much data do we have available? Let's see!

In [4]:

```
totals_by_author_count = df.groupby('AuthorCount').aggregate(lambda x: len(x))
totals_by_author_count.plot(kind='bar', figsize=(13, 3))
```

Out[4]:

So let's just look at author counts up to 10, say. (also, what's an author count of 0? Interesting.)

There are a ton of 4s, so this is obviously a score that this GPA algorithm thinks is important. Notice that as we go down, the percentage of 4s allocated decreases *dramatically*.

In [5]:

```
fours_by_author_count = df.groupby('AuthorCount').aggregate(lambda x: float(np.count_nonzero(x['GPA'] == 4)) / len(x))
fours_by_author_count[:11].plot(kind='bar', figsize=(13, 3))
```

Out[5]:

Let's look at the median!

In [6]:

```
medians_by_author_count = df.groupby('AuthorCount').aggregate(lambda x: np.median(x['GPA']))
medians_by_author_count[:11].plot(kind='bar', figsize=(13, 3))
```

Out[6]:

This is pretty interesting. It's quite different from the "percentage of 4s" graph -- it's much much flatter. The highest median GPA is about 3.4, going down to 2.7 or so.

I wonder what happens if we find the median GPA, excluding all the 4s?

In [7]:

```
medians_by_author_count_without_4 = df.groupby('AuthorCount').aggregate(lambda x: np.median(x['GPA'][x['GPA'] != 4]))
medians_by_author_count_without_4[:11].plot(kind='bar', figsize=(13, 3))
```

Out[7]:

This distribution is super different! It's hard to tell from the graph, but the maximum is actually at 4 authors, and then it goes down around 9 or 10. So it seems like the author count influences whether your GPA is 4 or not really strongly, but if your GPA is smaller than 4 it doesn't have as strong of an effect.

Let's look more closely at the GPA distributions for a few different author counts:

In [8]:

```
def draw_hist(sub_df, ax=None, title=None):
gpa_col = sub_df['GPA']
ax.set_title(title)
gpa_col[gpa_col != 4].hist(bins=20, figsize= (20, 10), ax=ax)
_, axes = plt.subplots(3,2)
draw_hist(df[df.AuthorCount == 1], ax=axes[0][0], title="Author count 1")
draw_hist(df[df.AuthorCount == 2], ax=axes[1][0], title="Author count 2")
draw_hist(df[df.AuthorCount == 3], ax=axes[2][0], title="Author count 3")
draw_hist(df[df.AuthorCount == 4], ax=axes[0][1], title="Author count 4")
draw_hist(df[df.AuthorCount == 5], ax=axes[1][1], title="Author count 5")
draw_hist(df[df.AuthorCount == 6], ax=axes[2][1], title="Author count 6")
plt.tight_layout()
```

Hmm. I can't draw any really good conclusions from this. Let's try plotting a few GPA percentiles, across numbers of authors:

In [9]:

```
def get_percentile(p):
series = df.groupby('AuthorCount').aggregate(lambda x: np.percentile(x['GPA'][x['GPA'] != 4], p))[:11]
series.columns = [str(p)]
return series
ax = plt.axes()
get_percentile(90).plot(kind='line', figsize=(13, 3), ax=ax)
get_percentile(80).plot(kind='line', figsize=(13, 3), ax=ax)
get_percentile(70).plot(kind='line', figsize=(13, 3), ax=ax)
get_percentile(60).plot(kind='line', figsize=(13, 3), ax=ax)
get_percentile(50).plot(kind='line', figsize=(13, 3), ax=ax)
```

Out[9]:

These all look pretty flat to me! Certainly the highest median GPA is at 4, but the 90th percentile seems very very flat across different numbers of authors.