There's a lot going on in this dataset. This notebook follows my intuitions in an attempt to get a sense of the data.
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.signal import savgol_filter
pd.set_option('precision', 2)
np.set_printoptions(precision=2)
con = sqlite3.connect('../pitchfork.db')
reviews = pd.read_sql('SELECT * FROM reviews', con)
genres = pd.read_sql('SELECT * FROM genres', con)
con.close()
print('\nAverages:')
print(np.mean(reviews[['best_new_music', 'score']]))
print('\nStandard Deviation:')
print(np.std(reviews[['best_new_music', 'score']]))
g = reviews.groupby('score')
info = g['best_new_music'].agg(['sum','count']).reset_index()
plt.plot(info['score'], savgol_filter(info['count'], 5, 1), label = 'All Reviews')
plt.plot(info['score'], savgol_filter(info['sum'], 5, 1), label = "Best New Music")
plt.legend(loc = 'best')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()
Averages: best_new_music 0.05 score 7.01 dtype: float64 Standard Deviation: best_new_music 0.22 score 1.29 dtype: float64
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver. warnings.warn(mesg, RuntimeWarning)
Scores are roughly normally distributed, a little negative skew. The average release gets a score of about 7.0, and about 5% of releases are best new music, though this figure is artificially low because I included released from before best new music was around. The distribution declines pretty sharply at about 8.0, no wonder Pitchfork has a special page for 8.0+ reviews...
There is also effectively no best new music before 8.0. And most releases above 8.5 are categorized as best new music. Makes you kind of feel bad for all the 8.3s and 8.4s that narrowly missed the cutoff...
Sanity check: what are the 10.0 albums that are not best new music?
idx = (reviews.best_new_music == 0) & (reviews.score == 10.0)
reviews.loc[idx, ['artist', 'title', 'pub_date'] ]
artist | title | pub_date | |
---|---|---|---|
200 | bob dylan | blood on the tracks | 2016-10-30 |
355 | brian eno | another green world | 2016-09-18 |
451 | stevie wonder | songs in the key of life | 2016-08-21 |
530 | nina simone | in concert | 2016-07-30 |
654 | neil young | tonight's the night | 2016-06-26 |
706 | kate bush | hounds of love | 2016-06-12 |
857 | prince | sign "o" the times | 2016-04-30 |
858 | prince | 1999 | 2016-04-30 |
861 | prince, the revolution | purple rain | 2016-04-29 |
862 | prince | dirty mind | 2016-04-29 |
1207 | david bowie | "heroes" | 2016-01-22 |
1209 | david bowie | low | 2016-01-22 |
5965 | can | tago mago [40th anniversary edition] | 2011-12-09 |
6219 | nirvana | nevermind [20th anniversary edition] | 2011-09-27 |
8619 | the beatles | the beatles | 2009-09-10 |
8621 | the beatles | abbey road | 2009-09-10 |
8624 | the beatles | rubber soul | 2009-09-09 |
8625 | the beatles | revolver | 2009-09-09 |
8626 | the beatles | sgt. pepper's lonely hearts club band | 2009-09-09 |
8627 | the beatles | magical mystery tour | 2009-09-09 |
9530 | r.e.m. | murmur [deluxe edition] | 2008-11-24 |
10206 | otis redding | otis blue: otis redding sings soul [collector'... | 2008-05-09 |
10831 | joy division | unknown pleasures | 2007-10-29 |
11306 | sonic youth | daydream nation: deluxe edition | 2007-06-13 |
12632 | wire | pink flag | 2006-05-05 |
13166 | bruce springsteen | born to run: 30th anniversary edition | 2005-11-18 |
13357 | neutral milk hotel | in the aeroplane over the sea | 2005-09-26 |
13726 | dj shadow | endtroducing... [deluxe edition] | 2005-06-09 |
14437 | pavement | crooked rain, crooked rain: la's desert origins | 2004-10-25 |
14557 | the clash | london calling: 25th anniversary legacy edition | 2004-09-21 |
15024 | boards of canada | music has the right to children | 2004-04-26 |
15107 | james brown | live at the apollo [expanded edition] | 2004-03-30 |
15259 | various artists | no thanks!: the 70s punk rebellion | 2004-02-10 |
15383 | television | marquee moon | 2003-12-09 |
15913 | glenn branca | the ascension | 2003-06-19 |
17009 | elvis costello & the attractions | this year's model | 2002-05-09 |
17066 | wilco | yankee hotel foxtrot | 2002-04-21 |
17199 | ...and you will know us by the trail of dead | source tags and codes | 2002-02-28 |
17544 | john coltrane | the olatunji concert: the last live recording | 2001-10-15 |
17900 | radiohead | kid a | 2000-10-02 |
18071 | pink floyd | animals | 2000-04-25 |
18219 | bonnie prince billy | i see a darkness | 1999-09-30 |
All of these are re-issued classics that somehow missed the "best new reissue" label, or highly-rated albums reviewed prior to the best new music label.
genre_data = pd.merge(reviews[['reviewid','score']], genres,
on = 'reviewid')
g = genre_data.groupby('genre')
table = g['score'].agg(['count', 'mean', 'std']).reset_index()
# plot the average at each level of count
avgline = table.groupby('count')['mean'].mean().reset_index()
avgline['mean'] = savgol_filter(avgline['mean'], 5, 1)
plt.plot(avgline['count'], avgline['mean'],'k--')
plt.plot(table['count'],table['mean'],'o', alpha = 1)
for j, row in table.iterrows():
curr_avg = float(avgline.loc[avgline['count'] == row['count'], 'mean'])
jitter = np.random.uniform(0.1, high = 0.5)
if row['mean'] < curr_avg: jitter*= -1.0
plt.plot([row['count'], row['count']], [row['mean'], row['mean'] + jitter], 'k-', alpha = 0.1)
plt.text(row['count'], row['mean'] + jitter, row['genre'],
ha = 'center', va = 'center')
plt.ylabel('Average Score')
plt.xlabel('Number of Reviews')
plt.ylim([5, 10])
plt.show()
g = reviews.groupby('author')
table = g.score.agg(('mean','std','count'))
table['ratio'] = table['mean'] / table['count']
# remove labels with only a handful of reviews
table = table.loc[table['count'] > 15]
# plot the average at each level of count
avgline = table.groupby('count')['mean'].mean().reset_index()
avgline['mean'] = savgol_filter(avgline['mean'], 5, 1)
plt.plot(avgline['count'], avgline['mean'],'k--')
# plot each author as a point
plt.plot(table['count'], table['mean'],'o', alpha = 0.5)
# identify some standouts
items = [
table['mean'].idxmax(),
table['mean'].idxmin(),
table['count'].idxmax()
]
for idx in items:
x, y = table.loc[idx, 'count'], table.loc[idx, 'mean']
curr_avg = float(avgline.loc[avgline['count'] == x, 'mean'])
jitter = np.random.uniform(0.1, high = 0.5)
if y < curr_avg: jitter*= -1.0
plt.plot([x, x], [y, y + jitter], 'k-', alpha = 0.1)
plt.text(x, y + jitter, idx, ha = 'center', va = 'center')
plt.ylabel('Average Score')
plt.xlabel('Number of Reviews')
plt.ylim([5, 10])
plt.show()