This assignment uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.
Description of the data:
yelp.json
is the original format of the file. yelp.csv
contains the same data, in a more convenient format. Both of the files are in this repo, so there is no need to download the data from the Kaggle website.Read yelp.csv
into a DataFrame.
# access yelp.csv using a relative path
import pandas as pd
yelp = pd.read_csv('../data/yelp.csv')
yelp.head(1)
business_id | date | review_id | stars | text | type | user_id | cool | useful | funny | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 9yKzy9PApeiPPOUJEtnvkg | 2011-01-26 | fWKvX83p0-ka4JS3dc6E5A | 5 | My wife took me here on my birthday for breakf... | review | rLtl8ZkDX5vH5nAx9C3q5Q | 2 | 5 | 0 |
Ignore the yelp.csv
file, and construct this DataFrame yourself from yelp.json
. This involves reading the data into Python, decoding the JSON, converting it to a DataFrame, and adding individual columns for each of the vote types.
# read the data from yelp.json into a list of rows
# each row is decoded into a dictionary using using json.loads()
import json
with open('../data/yelp.json', 'rU') as f:
data = [json.loads(row) for row in f]
# show the first review
data[0]
{u'business_id': u'9yKzy9PApeiPPOUJEtnvkg', u'date': u'2011-01-26', u'review_id': u'fWKvX83p0-ka4JS3dc6E5A', u'stars': 5, u'text': u'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!', u'type': u'review', u'user_id': u'rLtl8ZkDX5vH5nAx9C3q5Q', u'votes': {u'cool': 2, u'funny': 0, u'useful': 5}}
# convert the list of dictionaries to a DataFrame
yelp = pd.DataFrame(data)
yelp.head(1)
business_id | date | review_id | stars | text | type | user_id | votes | |
---|---|---|---|---|---|---|---|---|
0 | 9yKzy9PApeiPPOUJEtnvkg | 2011-01-26 | fWKvX83p0-ka4JS3dc6E5A | 5 | My wife took me here on my birthday for breakf... | review | rLtl8ZkDX5vH5nAx9C3q5Q | {u'funny': 0, u'useful': 5, u'cool': 2} |
# add DataFrame columns for cool, useful, and funny
yelp['cool'] = [row['votes']['cool'] for row in data]
yelp['useful'] = [row['votes']['useful'] for row in data]
yelp['funny'] = [row['votes']['funny'] for row in data]
# drop the votes column
yelp.drop('votes', axis=1, inplace=True)
yelp.head(1)
business_id | date | review_id | stars | text | type | user_id | cool | useful | funny | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 9yKzy9PApeiPPOUJEtnvkg | 2011-01-26 | fWKvX83p0-ka4JS3dc6E5A | 5 | My wife took me here on my birthday for breakf... | review | rLtl8ZkDX5vH5nAx9C3q5Q | 2 | 5 | 0 |
Explore the relationship between each of the vote types (cool/useful/funny) and the number of stars.
# treat stars as a categorical variable and look for differences between groups
yelp.groupby('stars').mean()
cool | useful | funny | |
---|---|---|---|
stars | |||
1 | 0.576769 | 1.604806 | 1.056075 |
2 | 0.719525 | 1.563107 | 0.875944 |
3 | 0.788501 | 1.306639 | 0.694730 |
4 | 0.954623 | 1.395916 | 0.670448 |
5 | 0.944261 | 1.381780 | 0.608631 |
# correlation matrix
%matplotlib inline
import seaborn as sns
sns.heatmap(yelp.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x1b95d0f0>
# multiple scatter plots
sns.pairplot(yelp, x_vars=['cool', 'useful', 'funny'], y_vars='stars', size=6, aspect=0.7, kind='reg')
<seaborn.axisgrid.PairGrid at 0xe983a20>
Define cool/useful/funny as the features, and stars as the response.
feature_cols = ['cool', 'useful', 'funny']
X = yelp[feature_cols]
y = yelp.stars
Fit a linear regression model and interpret the coefficients. Do the coefficients make intuitive sense to you? Explore the Yelp website to see if you detect similar trends.
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
zip(feature_cols, linreg.coef_)
[('cool', 0.27435946858859317), ('useful', -0.14745239099400748), ('funny', -0.13567449053705449)]
Evaluate the model by splitting it into training and testing sets and computing the RMSE. Does the RMSE make intuitive sense to you?
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np
# define a function that accepts a list of features and returns testing RMSE
def train_test_rmse(feature_cols):
X = yelp[feature_cols]
y = yelp.stars
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
return np.sqrt(metrics.mean_squared_error(y_test, y_pred))
# calculate RMSE with all three features
train_test_rmse(['cool', 'useful', 'funny'])
1.1842905282165919
Try removing some of the features and see if the RMSE improves.
print train_test_rmse(['cool', 'useful'])
print train_test_rmse(['cool', 'funny'])
print train_test_rmse(['useful', 'funny'])
1.19623908761 1.19426732565 1.20982720239
Think of some new features you could create from the existing data that might be predictive of the response. Figure out how to create those features in Pandas, add them to your model, and see if the RMSE improves.
# new feature: review length (number of characters)
yelp['length'] = yelp.text.apply(len)
# new features: whether or not the review contains 'love' or 'hate'
yelp['love'] = yelp.text.str.contains('love', case=False).astype(int)
yelp['hate'] = yelp.text.str.contains('hate', case=False).astype(int)
# add new features to the model and calculate RMSE
train_test_rmse(['cool', 'useful', 'funny', 'length', 'love', 'hate'])
1.1584039830984094
Compare your best RMSE on the testing set with the RMSE for the "null model", which is the model that ignores all features and simply predicts the mean response value in the testing set.
# split the data (outside of the function)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# create a NumPy array with the same shape as y_test
y_null = np.zeros_like(y_test, dtype=float)
# fill the array with the mean of y_test
y_null.fill(y_test.mean())
# calculate null RMSE
print np.sqrt(metrics.mean_squared_error(y_test, y_null))
1.21232761249
Instead of treating this as a regression problem, treat it as a classification problem and see what testing accuracy you can achieve with KNN.
# import and instantiate KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=50)
# classification models will automatically treat the response value (1/2/3/4/5) as unordered categories
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_class)
0.3524
Figure out how to use linear regression for classification, and compare its classification accuracy with KNN's accuracy.
# use linear regression to make continuous predictions
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
# round its predictions to the nearest integer
y_pred_class = y_pred.round()
# calculate classification accuracy of the rounded predictions
print metrics.accuracy_score(y_test, y_pred_class)
0.3456