On Goodreads, the popular social site for people who read a lot, you can rate books. That way you contribute to the overall average rating of a book. Most people use the social aspects of Goodreads to determine which books they should read. That means they see what their friends like or recommend and based on their relationship with different people and their tastes books are added to the "to-read" pile. But those average ratings are still there, on every book page you see a rating between 1 and 5 stars. The problem with those ratings is that they are heavily clustered around the 4.0 mark and are basically useless to determine how all the people who rated a book thought about it. I wanted to see if I couldn't get some more information from these average ratings.
The first step was to obtain a sample of ratings. I used the [Best Books Ever] [https://www.goodreads.com/list/show/1.Best_Books_Ever] list on Goodreads and extracted the ratings of these books. This led to a sample of 20899 ratings which I will investigate as part of this article.
I will use [Pandas] [http://pandas.pydata.org] to analyze the data and create pretty graphs.
After loading the sample I had obtained into a Dataframe object, let's take a look at how the ratings are currently distributed.
df['rating'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x6614350>
As you can see, the distribution is very uneven. Most books receive extremely high ratings. We don't see a bimodal distribution over the full sample set. This usually hapens with rating systems, because many people feel strongly over the items they are rating and choose the lowest or the highest rating possible, depending on their opinion and emotions. With Goodreads people mostly seem to like everything.
Let's look at how the single ratings are distributed. For the sample I got I also retrieved the rating distributions through the Goodreads API. This means I know how many people gave a book 5 stars, 4 stars and so on.
sums = [x.sum() for x in (df.r5, df.r4, df.r3, df.r2, df.r1)]
rl = [x / len(df) for x in sums]
rl.reverse()
s = pd.Series(rl, index=list("12345"))
s.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x5e5e950>
As you can see there are about twice as many 5 star ratings then there are 3 star ratings.
What I take from the analysis so far is that most people who rate a book think very highly of it. If they choose the books based on the recommendations from friends and reviews from strangers on Goodreads, this is quite the success for Goodreads!
Now, let's see what we can do with the average votes. If you count all the votes together ("number of 5 star ratings * 5 + number of 4 star ratings * 4 + [and so on]") and divide them by the number of total votes, you end up with the average as displayed on Goodreads. It's the most basic algorithm there is and many sites use it. Is it a good way to display the information? I don't think so. Especially since Goodreads promotes a 3.5 rating to 4 stars being displayed. Not very helpful it you want to see how all the readers of the book think about it.
But there is still valuable information in there, we just need to find a way to stretch this huge clump of ratings so we can see it. A way to do this is to group the ratings into a fixed number of bins. Each bin has a lower and upper boundary and the average rating of a book determines into which bin it falls.
To start I picked 11 bins from 0 to 10. They are a bit wider at the lower and higher ends and smaller in the middle, where most of the ratings are.
Another reason to only use the average rating for an attemp at normalizing it is that this information is available whenever a rating is displayed. To use the vote distribution would require an API call for every book.
bins = [
[0, 3.3], # 0
[3.31, 3.6], # 1
[3.61, 3.7], # 2
[3.71, 3.8], # 3
[3.81, 3.9], # 4
[3.91, 4.0], # 5
[4.01, 4.1], # 6
[4.11, 4.2], # 7
[4.21, 4.3], # 8
[4.31, 4.5], # 9
[4.51, 5] # 10
]
The index of the bin determines the normalized rating. For example, a book with an average rating of 3.69 would fall into the third bin, because 3.69 is higher than 3.61 and lower than 3.70. As the index starts with 0, the normalized rating for 3.69 would be 2. This means that a book that would be displayed on Goodreads with 4 stars actually falls into the lower end of the ratings.
Next, we calculate the normalized rating for every book in our sample set.
def get_bin(rating):
for index, bin in enumerate(bins):
if bin[0] <= rating and rating <= bin[1]:
return index
return 0
df['new'] = df['rating'].apply(get_bin)
And then we see how this changes our rating distribution.
df['new'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0xc951ef0>
Much better already! There is still a huge spike at the higher end, but the means is much closer to the middle now.
df['rating'].mean()
4.0187157280252652
df['new'].mean()
5.569453083879611
At this point I suspect that books with just a few votes might skew these results. If most people tend to rate books very favorably, it's easy to assume that books usually get a lot of high votings first. Let's see if that's true.
The number of 200 votes is something I chose arbitrarily.
df[df['count']<200]['new'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0xcafcb70>
Here we looked at the distribution of normalized ratings for books with less than 200 votes. As you can see a huge majority of these books has an average rating of 4.51 or higher, based on the bins I defined above.
Now to check what the distribution looks like for books with 200 and more votes.
df[df['count']>=200]['new'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0xc872090>
That's more like it. The huge spike at the end is now much smaller and the means look better (meaning lower) too.
df[df['count']>=200]['new'].mean()
5.3593405939415577
df[df['count']>=200]['rating'].mean()
3.9885728738915662
Still, this only fixes the graph, not the normalized ratings. When the vote distribution is available we can apply some weights, for example counting each 1 star vote twice. But like I mentioned above, retrieving the vote distribution requires an API call for every book. Another option would be to define separate sets of bins based on how many votes a book has. That means if a book has less than 200 votes, we use much smaller bins for ratings over 4.0 and larger ones for ratings below.