Exploring the review scores data

There's a lot going on in this dataset. This notebook follows my intuitions in an attempt to get a sense of the data.

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.signal import savgol_filter

pd.set_option('precision', 2)
np.set_printoptions(precision=2)

con = sqlite3.connect('../pitchfork.db')
reviews = pd.read_sql('SELECT * FROM reviews', con)
genres = pd.read_sql('SELECT * FROM genres', con)
con.close()

Score and Best New Music distributions

In [2]:
print('\nAverages:')
print(np.mean(reviews[['best_new_music', 'score']]))

print('\nStandard Deviation:')
print(np.std(reviews[['best_new_music', 'score']]))

g = reviews.groupby('score')
info = g['best_new_music'].agg(['sum','count']).reset_index()

plt.plot(info['score'], savgol_filter(info['count'], 5, 1), label = 'All Reviews') 
plt.plot(info['score'], savgol_filter(info['sum'], 5, 1), label = "Best New Music") 
plt.legend(loc = 'best')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()
Averages:
best_new_music    0.05
score             7.01
dtype: float64

Standard Deviation:
best_new_music    0.22
score             1.29
dtype: float64
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)

Scores are roughly normally distributed, a little negative skew. The average release gets a score of about 7.0, and about 5% of releases are best new music, though this figure is artificially low because I included released from before best new music was around. The distribution declines pretty sharply at about 8.0, no wonder Pitchfork has a special page for 8.0+ reviews...

There is also effectively no best new music before 8.0. And most releases above 8.5 are categorized as best new music. Makes you kind of feel bad for all the 8.3s and 8.4s that narrowly missed the cutoff...

Sanity check: what are the 10.0 albums that are not best new music?

In [3]:
idx = (reviews.best_new_music == 0) & (reviews.score == 10.0)
reviews.loc[idx, ['artist', 'title', 'pub_date'] ]
Out[3]:
artist title pub_date
200 bob dylan blood on the tracks 2016-10-30
355 brian eno another green world 2016-09-18
451 stevie wonder songs in the key of life 2016-08-21
530 nina simone in concert 2016-07-30
654 neil young tonight's the night 2016-06-26
706 kate bush hounds of love 2016-06-12
857 prince sign "o" the times 2016-04-30
858 prince 1999 2016-04-30
861 prince, the revolution purple rain 2016-04-29
862 prince dirty mind 2016-04-29
1207 david bowie "heroes" 2016-01-22
1209 david bowie low 2016-01-22
5965 can tago mago [40th anniversary edition] 2011-12-09
6219 nirvana nevermind [20th anniversary edition] 2011-09-27
8619 the beatles the beatles 2009-09-10
8621 the beatles abbey road 2009-09-10
8624 the beatles rubber soul 2009-09-09
8625 the beatles revolver 2009-09-09
8626 the beatles sgt. pepper's lonely hearts club band 2009-09-09
8627 the beatles magical mystery tour 2009-09-09
9530 r.e.m. murmur [deluxe edition] 2008-11-24
10206 otis redding otis blue: otis redding sings soul [collector'... 2008-05-09
10831 joy division unknown pleasures 2007-10-29
11306 sonic youth daydream nation: deluxe edition 2007-06-13
12632 wire pink flag 2006-05-05
13166 bruce springsteen born to run: 30th anniversary edition 2005-11-18
13357 neutral milk hotel in the aeroplane over the sea 2005-09-26
13726 dj shadow endtroducing... [deluxe edition] 2005-06-09
14437 pavement crooked rain, crooked rain: la's desert origins 2004-10-25
14557 the clash london calling: 25th anniversary legacy edition 2004-09-21
15024 boards of canada music has the right to children 2004-04-26
15107 james brown live at the apollo [expanded edition] 2004-03-30
15259 various artists no thanks!: the 70s punk rebellion 2004-02-10
15383 television marquee moon 2003-12-09
15913 glenn branca the ascension 2003-06-19
17009 elvis costello & the attractions this year's model 2002-05-09
17066 wilco yankee hotel foxtrot 2002-04-21
17199 ...and you will know us by the trail of dead source tags and codes 2002-02-28
17544 john coltrane the olatunji concert: the last live recording 2001-10-15
17900 radiohead kid a 2000-10-02
18071 pink floyd animals 2000-04-25
18219 bonnie prince billy i see a darkness 1999-09-30

All of these are re-issued classics that somehow missed the "best new reissue" label, or highly-rated albums reviewed prior to the best new music label.

Ratings by Genre

In [4]:
genre_data = pd.merge(reviews[['reviewid','score']], genres, 
                  on = 'reviewid')

g = genre_data.groupby('genre')
table = g['score'].agg(['count', 'mean', 'std']).reset_index()

# plot the average at each level of count
avgline = table.groupby('count')['mean'].mean().reset_index()
avgline['mean'] = savgol_filter(avgline['mean'], 5, 1)
plt.plot(avgline['count'], avgline['mean'],'k--')

plt.plot(table['count'],table['mean'],'o', alpha = 1)

for j, row in table.iterrows():
    curr_avg = float(avgline.loc[avgline['count'] == row['count'], 'mean'])
    jitter = np.random.uniform(0.1, high = 0.5)
    if row['mean'] < curr_avg: jitter*= -1.0
    
    plt.plot([row['count'], row['count']], [row['mean'], row['mean'] + jitter], 'k-', alpha = 0.1)
    plt.text(row['count'], row['mean'] + jitter, row['genre'], 
             ha = 'center', va = 'center')

plt.ylabel('Average Score')
plt.xlabel('Number of Reviews')
plt.ylim([5, 10])
plt.show()

Ratings by Author

In [5]:
g = reviews.groupby('author')
table = g.score.agg(('mean','std','count'))
table['ratio'] = table['mean'] / table['count']

# remove labels with only a handful of reviews
table = table.loc[table['count'] > 15]

# plot the average at each level of count
avgline = table.groupby('count')['mean'].mean().reset_index()
avgline['mean'] = savgol_filter(avgline['mean'], 5, 1)
plt.plot(avgline['count'], avgline['mean'],'k--')

# plot each author as a point
plt.plot(table['count'], table['mean'],'o', alpha = 0.5)

# identify some standouts
items = [
         table['mean'].idxmax(), 
         table['mean'].idxmin(),
         table['count'].idxmax()
        ]

for idx in items:
    
    x, y = table.loc[idx, 'count'], table.loc[idx, 'mean']
    curr_avg = float(avgline.loc[avgline['count'] == x, 'mean'])
        
    jitter = np.random.uniform(0.1, high = 0.5)
    if y < curr_avg: jitter*= -1.0
    
    plt.plot([x, x], [y, y + jitter], 'k-', alpha = 0.1)
    plt.text(x, y + jitter, idx, ha = 'center', va = 'center')

plt.ylabel('Average Score')
plt.xlabel('Number of Reviews')
plt.ylim([5, 10])
plt.show()