Exploring the review scores data¶

There's a lot going on in this dataset. This notebook follows my intuitions in an attempt to get a sense of the data.

In [1]:

import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.signal import savgol_filter

pd.set_option('precision', 2)
np.set_printoptions(precision=2)

con = sqlite3.connect('../pitchfork.db')
reviews = pd.read_sql('SELECT * FROM reviews', con)
genres = pd.read_sql('SELECT * FROM genres', con)
con.close()

Score and Best New Music distributions¶

In [2]:

print('\nAverages:')
print(np.mean(reviews[['best_new_music', 'score']]))

print('\nStandard Deviation:')
print(np.std(reviews[['best_new_music', 'score']]))

g = reviews.groupby('score')
info = g['best_new_music'].agg(['sum','count']).reset_index()

plt.plot(info['score'], savgol_filter(info['count'], 5, 1), label = 'All Reviews') 
plt.plot(info['score'], savgol_filter(info['sum'], 5, 1), label = "Best New Music") 
plt.legend(loc = 'best')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()

Averages:
best_new_music    0.05
score             7.01
dtype: float64

Standard Deviation:
best_new_music    0.22
score             1.29
dtype: float64

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)

Scores are roughly normally distributed, a little negative skew. The average release gets a score of about 7.0, and about 5% of releases are best new music, though this figure is artificially low because I included released from before best new music was around. The distribution declines pretty sharply at about 8.0, no wonder Pitchfork has a special page for 8.0+ reviews...

There is also effectively no best new music before 8.0. And most releases above 8.5 are categorized as best new music. Makes you kind of feel bad for all the 8.3s and 8.4s that narrowly missed the cutoff...

Sanity check: what are the 10.0 albums that are not best new music?

In [3]:

idx = (reviews.best_new_music == 0) & (reviews.score == 10.0)
reviews.loc[idx, ['artist', 'title', 'pub_date'] ]

Out[3]:

	artist	title	pub_date
200	bob dylan	blood on the tracks	2016-10-30
355	brian eno	another green world	2016-09-18
451	stevie wonder	songs in the key of life	2016-08-21
530	nina simone	in concert	2016-07-30
654	neil young	tonight's the night	2016-06-26
706	kate bush	hounds of love	2016-06-12
857	prince	sign "o" the times	2016-04-30
858	prince	1999	2016-04-30
861	prince, the revolution	purple rain	2016-04-29
862	prince	dirty mind	2016-04-29
1207	david bowie	"heroes"	2016-01-22
1209	david bowie	low	2016-01-22
5965	can	tago mago [40th anniversary edition]	2011-12-09
6219	nirvana	nevermind [20th anniversary edition]	2011-09-27
8619	the beatles	the beatles	2009-09-10
8621	the beatles	abbey road	2009-09-10
8624	the beatles	rubber soul	2009-09-09
8625	the beatles	revolver	2009-09-09
8626	the beatles	sgt. pepper's lonely hearts club band	2009-09-09
8627	the beatles	magical mystery tour	2009-09-09
9530	r.e.m.	murmur [deluxe edition]	2008-11-24
10206	otis redding	otis blue: otis redding sings soul [collector'...	2008-05-09
10831	joy division	unknown pleasures	2007-10-29
11306	sonic youth	daydream nation: deluxe edition	2007-06-13
12632	wire	pink flag	2006-05-05
13166	bruce springsteen	born to run: 30th anniversary edition	2005-11-18
13357	neutral milk hotel	in the aeroplane over the sea	2005-09-26
13726	dj shadow	endtroducing... [deluxe edition]	2005-06-09
14437	pavement	crooked rain, crooked rain: la's desert origins	2004-10-25
14557	the clash	london calling: 25th anniversary legacy edition	2004-09-21
15024	boards of canada	music has the right to children	2004-04-26
15107	james brown	live at the apollo [expanded edition]	2004-03-30
15259	various artists	no thanks!: the 70s punk rebellion	2004-02-10
15383	television	marquee moon	2003-12-09
15913	glenn branca	the ascension	2003-06-19
17009	elvis costello & the attractions	this year's model	2002-05-09
17066	wilco	yankee hotel foxtrot	2002-04-21
17199	...and you will know us by the trail of dead	source tags and codes	2002-02-28
17544	john coltrane	the olatunji concert: the last live recording	2001-10-15
17900	radiohead	kid a	2000-10-02
18071	pink floyd	animals	2000-04-25
18219	bonnie prince billy	i see a darkness	1999-09-30

All of these are re-issued classics that somehow missed the "best new reissue" label, or highly-rated albums reviewed prior to the best new music label.

Ratings by Genre¶

In [4]:

genre_data = pd.merge(reviews[['reviewid','score']], genres, 
                  on = 'reviewid')

g = genre_data.groupby('genre')
table = g['score'].agg(['count', 'mean', 'std']).reset_index()

# plot the average at each level of count
avgline = table.groupby('count')['mean'].mean().reset_index()
avgline['mean'] = savgol_filter(avgline['mean'], 5, 1)
plt.plot(avgline['count'], avgline['mean'],'k--')

plt.plot(table['count'],table['mean'],'o', alpha = 1)

for j, row in table.iterrows():
    curr_avg = float(avgline.loc[avgline['count'] == row['count'], 'mean'])
    jitter = np.random.uniform(0.1, high = 0.5)
    if row['mean'] < curr_avg: jitter*= -1.0
    
    plt.plot([row['count'], row['count']], [row['mean'], row['mean'] + jitter], 'k-', alpha = 0.1)
    plt.text(row['count'], row['mean'] + jitter, row['genre'], 
             ha = 'center', va = 'center')

plt.ylabel('Average Score')
plt.xlabel('Number of Reviews')
plt.ylim([5, 10])
plt.show()

Ratings by Author¶

In [5]:

g = reviews.groupby('author')
table = g.score.agg(('mean','std','count'))
table['ratio'] = table['mean'] / table['count']

# remove labels with only a handful of reviews
table = table.loc[table['count'] > 15]

# plot the average at each level of count
avgline = table.groupby('count')['mean'].mean().reset_index()
avgline['mean'] = savgol_filter(avgline['mean'], 5, 1)
plt.plot(avgline['count'], avgline['mean'],'k--')

# plot each author as a point
plt.plot(table['count'], table['mean'],'o', alpha = 0.5)

# identify some standouts
items = [
         table['mean'].idxmax(), 
         table['mean'].idxmin(),
         table['count'].idxmax()
        ]

for idx in items:
    
    x, y = table.loc[idx, 'count'], table.loc[idx, 'mean']
    curr_avg = float(avgline.loc[avgline['count'] == x, 'mean'])
        
    jitter = np.random.uniform(0.1, high = 0.5)
    if y < curr_avg: jitter*= -1.0
    
    plt.plot([x, x], [y, y + jitter], 'k-', alpha = 0.1)
    plt.text(x, y + jitter, idx, ha = 'center', va = 'center')

plt.ylabel('Average Score')
plt.xlabel('Number of Reviews')
plt.ylim([5, 10])
plt.show()