by Keith Qu
It may be a good idea to segregate the data by business type (restaurant, hardware store, etc.). It could be easier and less computationally intensive per category. But it would be interesting to find very general external features that can help determine star ratings for all businesses. These are factors that business owners have no direct control over, but knowing their effects can help with forming a plan to counteract negative customer sentiment.
Three stages: first we look at comment text alone, to see how accurate we can predict star rating based on that. Then we add in weather effects. Since star ratings are highly subjective, users may be influenced by many things when it comes to the rating. Finally, we'll see if the day of the week that a review is written on can help predict ratings. If these factors have an effect on sentiment, they will undoubtedly affect the review text as well, but there may also be subtle additional effects on the star rating.
The goal isn't so much to painstakingly tune a NN for the last bit of accuracy, but rather to see if adding one or two engineered features can have a significant improvement regardless of model. Or if I can embarrass myself with a complete lack of improvement!
I picked weather and day of week since they are known to have effects on customer activity (how many customers visit), but let's see if they can also help predict ratings.
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
from keras.models import Sequential
from keras.layers import (Dense, Dropout, Input, LSTM, Activation, Flatten,
Convolution1D, MaxPooling1D, Bidirectional,
GlobalMaxPooling1D, Embedding, BatchNormalization,
SpatialDropout1D)
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from keras.optimizers import SGD
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
%matplotlib inline
/home/keith/anaconda/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters Using TensorFlow backend.
PATH = "/d/data/yelpdata/dataset/"
#PATH = "d:\\data\\yelpdata\\dataset\\"
WEAT = f'{PATH}processed_weather/'
#businesses = pd.read_csv(f'{PATH}business_on.csv', index_col=0)
reviews = pd.read_csv(f'{PATH}review_on.csv', index_col=0)
reviews = reviews[['stars','text']]
reviews['text'].fillna('empty', inplace=True)
def clean_up(t):
t = t.strip().lower()
words = t.split()
# first get rid of the stopwords, or a lemmatized stopword might not
# be recognized as a stopword
imp_words = ' '.join(w for w in words if w not in set(stopwords.words('english')))
# lemmatize based on adjectives (J), verbs (V), nouns (N) and adverbs (R) to
# return only the base words (as opposed to stemming which can return
# non-words). e.g. ponies -> poni with stemming, and pony with lemmatizing
final_words = ''
lemma = WordNetLemmatizer()
for (w,tag) in pos_tag(word_tokenize(imp_words)):
if tag.startswith('J'):
final_words += ' '+ lemma.lemmatize(w, pos='a')
elif tag.startswith('V'):
final_words += ' '+ lemma.lemmatize(w, pos='v')
elif tag.startswith('N'):
final_words += ' '+ lemma.lemmatize(w, pos='n')
elif tag.startswith('R'):
final_words += ' '+ lemma.lemmatize(w, pos='r')
else:
final_words += ' '+ w
return final_words
# what a great name. do_stuff
def do_stuff (df):
text = df['text'].copy()
text.replace(to_replace={r'[^\x00-\x7F]':' '},inplace=True,regex=True)
text.replace(to_replace={r'[^a-zA-Z]': ' '},inplace=True,regex=True)
# Then lower case, tokenize and lemmatize
# with over 600,000 entries, this is going to be one hell of a long apply...
text = text.apply(lambda t:clean_up(t))
return text
# bidirectional LSTM, as described by Zhou et. al. (2016) http://www.aclweb.org/anthology/C16-1329
def lstm_model (X_train, y_train,test, val='no'):
model = Sequential()
model.add(Embedding(50000,300,input_length=500,weights=[emb_matrix]))
model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))
model.add(Dense(5,activation='softmax'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9)
model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])
if val == 'no':
model.fit(X_train,y_train,batch_size=128,epochs=3)
else:
model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)
pred = model.predict(test)
return pred
# converging to a very conventional convolutional NN model to convert non-conversational text to star ratings
# uh... with a non-convex loss function
# an LSTM network could do better, but it would also take significantly longer to run
#
# (not actually using the CNN model here)
def cnn_model (X_train, y_train, test, val='no'):
model=Sequential()
model.add(Embedding(50000,128,input_length=500))
model.add(Convolution1D(128,5,activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.2))
model.add(Convolution1D(128,5,activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.2))
model.add(Convolution1D(128,5,activation='relu'))
model.add(MaxPooling1D(35))
model.add(Flatten())
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(5,activation='softmax'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9)
model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])
if val == 'no':
model.fit(X_train,y_train,batch_size=128,epochs=5)
else:
model.fit(X_train,y_train,batch_size=128,epochs=5,validation_split=0.2)
pred = model.predict(test)
return pred
#data = do_stuff(reviews)
#data.to_csv(f'{PATH}review_on_processed_text.csv')
data = pd.Series.from_csv(f'{PATH}review_on_processed_text.csv', index_col=0)
stars = reviews['stars']
del reviews
enc = LabelEncoder()
enc.fit(stars)
y = enc.transform(stars)
dummy_y = np_utils.to_categorical(y)
data.fillna('empty', inplace=True)
tok = Tokenizer(num_words=50000)
tok.fit_on_texts(data)
sequenced = tok.texts_to_sequences(data)
padded = pad_sequences(sequenced,maxlen=500)
# getting the pretrained weight matrix
# based on https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout
# by which I mean it's pretty much just that...
EMBED_FILE = '/d/data/glove.42B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBED_FILE))
embed_size = 300
max_features = 50000
maxlen = 500
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
word_index = tok.word_index
nb_words = min(50000, len(word_index))
emb_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: emb_matrix[i] = embedding_vector
del embedding_vector, embeddings_index
X_train, X_test, y_train, y_test = train_test_split(padded, dummy_y, test_size=0.2, random_state=202)
#del data,emb_mean,emb_std,embed_size
# normally 1 comes before 2, but... this just starts at 2
pred2 = lstm_model (X_train, y_train, X_test, val='yes')
Train on 405993 samples, validate on 101499 samples Epoch 1/3 405993/405993 [==============================] - 1226s 3ms/step - loss: 1.0164 - acc: 0.5483 - val_loss: 0.9206 - val_acc: 0.5885 Epoch 2/3 405993/405993 [==============================] - 1215s 3ms/step - loss: 0.9203 - acc: 0.5910 - val_loss: 0.8964 - val_acc: 0.5967 Epoch 3/3 405993/405993 [==============================] - 1224s 3ms/step - loss: 0.8912 - acc: 0.6035 - val_loss: 0.8804 - val_acc: 0.6102
Maybe one more epoch would've been helpful, but for my purposes now that's fine.
roc_auc_score(y_test,pred2)
0.8826335410697205
preds2 = np.argmax(pred2, axis=1)
ys = np.argmax(y_test, axis=1)
print(classification_report(ys,preds2))
precision recall f1-score support 0 0.77 0.74 0.75 15267 1 0.49 0.39 0.43 13048 2 0.52 0.45 0.49 22102 3 0.54 0.64 0.59 39179 4 0.70 0.69 0.70 37278 avg / total 0.61 0.61 0.61 126874
confusion_matrix (ys, preds2)
array([[11294, 2525, 705, 423, 320], [ 2516, 5032, 3991, 1252, 257], [ 534, 2225, 10040, 8539, 764], [ 213, 308, 4067, 25051, 9540], [ 187, 78, 409, 10706, 25898]])
While 0.61 precision/recall isn't great, the 0.8826 AUC score is very very okay, one of the better kinds of okay. Also, the validation scores during training were very good, which is always helpful. A benefit, no doubt, of using all 630,000 reviews of business in the Toronto area.
The AUC score suggests that most of the predicted ratings are not too far off, and we can see that the vast majority of incorrect scores are within 1 star of the actual rating. Additionally, 1 and 5 star ratings had the greatest precision and recall, so our model is decent at picking up extreme sentiment (or the users are effusive in praise and unrestrained in condemnation). If I had split this into a positive/negative binary problem, obviously the accuracy would be a lot higher (at a glance, over 86% if we consider 3 to be negative and over 90% if we consider it to be positive), but it is interesting to try to pick up on the sublte differences between, say a 4 and a 5 star rating.
Let's see if adding in weather and relative price can increase accuracy.
Star ratings are neither objective nor scientific. We humans often make bizarre, irrational and otherwise inconsistent choices due to many internal and external factors. Let's consider weather as one of the external factors, especially with regards to giving a star rating for a business. While good weather and a good mood might influence me to leave a more positive review as well as a higher star rating, there is really no way know the sort of review I would have left had the weather been different (the old problem of not knowing probabilities conditional on histories that haven't happened).
What we can do is see if the review text matches with the score, and if knowing the weather conditions can improve the accuracy of our star predictions.
reviews_w = pd.read_csv(f'{PATH}review_on.csv', index_col=0)
reviews_w = reviews_w[['stars','date','text']]
weather = pd.read_csv(f'{WEAT}all_weather.csv', index_col='Unnamed: 0')
c:\programdata\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2728: DtypeWarning: Columns (12,16,20) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
weather['Year'] = weather['Year'].astype(int)
weather['Month'] = weather['Month'].astype(int)
weather['Day'] = weather['Day'].astype(int)
weather['Temp (°C)'] = weather['Temp (°C)'].astype(float)
reviews_w['date'] = pd.to_datetime(reviews_w['date'])
reviews_w.head()
stars | date | text | |
---|---|---|---|
0 | 4 | 2012-05-11 | Who would have guess that you would be able to... |
1 | 4 | 2015-10-27 | Always drove past this coffee house and wonder... |
2 | 3 | 2013-02-09 | Not bad!! Love that there is a gluten-free, ve... |
3 | 5 | 2016-04-06 | Love this place! Peggy is great with dogs and... |
4 | 4 | 2013-05-01 | This is currently my parents new favourite res... |
Let's get the temperature noon (12:00), afternoon (16:00) and night (20:00). Other possible features would be the number of hours described as raining or snowing, or adding in more hourly temperature snippets (like for 0:00 and 4:00). But 3 temperatures is already more than enough just to see if it'll work at all.
It is noteable that this is the weather for the day that the user wrote the review rather than when they engaged the business.
A few missing values, but interpolation should provide good estimates.
weather['Temp (°C)']=weather['Temp (°C)'].interpolate()
weather[weather['Temp (°C)'].isnull()]
Date/Time | Year | Month | Day | Time | Data Quality | Temp (°C) | Temp Flag | Dew Point Temp (°C) | Dew Point Temp Flag | ... | Wind Spd Flag | Visibility (km) | Visibility Flag | Stn Press (kPa) | Stn Press Flag | Hmdx | Hmdx Flag | Wind Chill | Wind Chill Flag | Weather | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.0 | 2006-01-01 00:00 | 2006 | 1 | 1 | 00:00 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1.0 | 2006-01-01 01:00 | 2006 | 1 | 1 | 01:00 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 25 columns
weather [(weather['Year'] == 2012) & (weather['Month'] == 5) & (weather['Day'] == 11) & (weather['Time'] == '09:00')]['Temp (°C)'].values[0]
15.8
def get_noon(d):
year = d.year
month = d.month
day = d.day
noon = "12:00"
return (weather [(weather['Year'] == year) & (weather['Month'] == month) & (weather['Day'] == day) & (weather['Time'] == noon)]['Temp (°C)'].values[0])
def get_afternoon(d):
year = d.year
month = d.month
day = d.day
afternoon = "16:00"
return (weather [(weather['Year'] == year) & (weather['Month'] == month) & (weather['Day'] == day) & (weather['Time'] == afternoon)]['Temp (°C)'].values[0])
def get_night(d):
year = d.year
month = d.month
day = d.day
night = "20:00"
return (weather [(weather['Year'] == year) & (weather['Month'] == month) & (weather['Day'] == day) & (weather['Time'] == night)]['Temp (°C)'].values[0])
Apply is probably slower than manual iteration, since there is the overhead of calling the function, which then just performs iteration. But it's already done...
#reviews_w['noon'] = reviews_w['date'].apply(lambda d: get_noon(d))
#reviews_w['afternoon'] = reviews_w['date'].apply(lambda d: get_afternoon(d))
#reviews_w['night'] = reviews_w['date'].apply(lambda d: get_night(d))
#reviews_w.to_csv(f'{PATH}augmented_comments.csv')
reviews_w = pd.read_csv(f'{PATH}augmented_comments.csv', index_col=0)
new_features = reviews_w[['noon','afternoon','night']]
new_features_array = np.array(new_features)
def lstm_model2 (X_train, y_train,test, val='no'):
model = Sequential()
model.add(Embedding(50000,300,input_length=503, weights=[emb_matrix]))
model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))
model.add(Dense(5,activation='softmax'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9)
model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])
if val == 'no':
model.fit(X_train,y_train,batch_size=128,epochs=3)
else:
model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)
pred = model.predict(test)
return pred
The simplest possible way to add in the new features, just add them directly onto the existing vectorized features.
XX = np.concatenate((padded,new_features_array),axis=1)
del padded
X_train, X_test, y_train, y_test = train_test_split(XX, dummy_y, test_size=0.2, random_state=202)
pred3 = lstm_model2 (X_train, y_train, X_test, val='yes')
Train on 405993 samples, validate on 101499 samples Epoch 1/3 405993/405993 [==============================] - 1132s 3ms/step - loss: 1.0049 - acc: 0.5525 - val_loss: 0.9282 - val_acc: 0.5897 Epoch 2/3 405993/405993 [==============================] - 1095s 3ms/step - loss: 0.9188 - acc: 0.5904 - val_loss: 0.9006 - val_acc: 0.5972 Epoch 3/3 405993/405993 [==============================] - 1091s 3ms/step - loss: 0.8900 - acc: 0.6034 - val_loss: 0.8928 - val_acc: 0.6014
del X_train, X_test, y_train, y_test
roc_auc_score(y_test,pred3)
0.8816730152568608
preds3 = np.argmax(pred3, axis=1)
ys = np.argmax(y_test, axis=1)
print(classification_report(ys,preds3))
precision recall f1-score support 0 0.69 0.82 0.75 15267 1 0.52 0.26 0.34 13048 2 0.47 0.61 0.53 22102 3 0.56 0.60 0.58 39179 4 0.74 0.62 0.68 37278 avg / total 0.61 0.60 0.60 126874
confusion_matrix (ys, preds3)
array([[12551, 1335, 1027, 159, 195], [ 3768, 3333, 5228, 558, 161], [ 963, 1441, 13517, 5613, 568], [ 438, 202, 7674, 23675, 7190], [ 415, 69, 1172, 12324, 23298]])
It doesn't seem like weather helps that much, at least not in this implementation. It's outright awful at recalling 2 star ratings. Maybe most of the weather effect has already gone into the comment itself, maybe the effect is insignificant, or maybe a change in implementing weather effects would help. Maybe I should look at how the weather differs from average rather than just a simple temperature.
For a few ratings, there seems to be a tradeoff between precision and recall among the two models, but I can't be sure of how consistent that is.
I suspect this could be useful! But then I also suspected weather would be as well!
# at this point you'd think i would be smart enough to write a function that accepts
# a customizable input_length but obviously i'm not
def lstm_model3 (X_train, y_train,test, val='no'):
model = Sequential()
model.add(Embedding(50000,300,input_length=507, weights=[emb_matrix]))
model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))
model.add(Dense(5,activation='softmax'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9)
model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])
if val == 'no':
model.fit(X_train,y_train,batch_size=128,epochs=3)
else:
model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)
pred = model.predict(test)
return pred
reviews_d = pd.read_csv(f'{PATH}review_on.csv', index_col=0)
reviews_d = reviews_d['date']
for i,d in enumerate(reviews_d):
reviews_d[i] = datetime.strptime(d, '%Y-%m-%d').weekday()
enc = LabelEncoder()
enc.fit(reviews_d)
dow = enc.transform(reviews_d)
dummy_dow = np_utils.to_categorical(dow)
X3 = np.concatenate((padded,dummy_dow),axis=1)
# Remember when 16 gb of RAM was more than enough for pretty much anything?
del padded, dummy_dow, data
X_train, X_test, y_train, y_test = train_test_split(X3, dummy_y, test_size=0.2, random_state=202)
pred4 = lstm_model3 (X_train, y_train, X_test, val='yes')
Train on 405993 samples, validate on 101499 samples Epoch 1/3 405993/405993 [==============================] - 1114s 3ms/step - loss: 1.0068 - acc: 0.5516 - val_loss: 0.9247 - val_acc: 0.5856 Epoch 2/3 405993/405993 [==============================] - 1107s 3ms/step - loss: 0.9121 - acc: 0.5944 - val_loss: 0.8907 - val_acc: 0.6033 Epoch 3/3 405993/405993 [==============================] - 1147s 3ms/step - loss: 0.8840 - acc: 0.6064 - val_loss: 0.8784 - val_acc: 0.6111
roc_auc_score(y_test,pred4)
0.8835195453706615
preds4 = np.argmax(pred4, axis=1)
ys = np.argmax(y_test, axis=1)
print(classification_report(ys,preds4))
precision recall f1-score support 0 0.77 0.75 0.76 15267 1 0.51 0.37 0.43 13048 2 0.53 0.47 0.50 22102 3 0.54 0.67 0.60 39179 4 0.72 0.65 0.68 37278 avg / total 0.61 0.61 0.61 126874
confusion_matrix (ys, preds4)
array([[11421, 2288, 701, 391, 466], [ 2622, 4869, 4012, 1218, 327], [ 520, 2133, 10494, 8167, 788], [ 186, 271, 4311, 26426, 7985], [ 170, 52, 405, 12543, 24108]])
So again, not much difference from just looking at the comment text. A big takeaway from all of this is that 2 star ratings seem to be the most ambiguous, followed by 3 star ratings.
Also, the GLoVe embeddings don't really seem to do much here except take up RAM. Having the pretrained weights seems to have a very slightly positive effect on accuracy - as long as additional training is kept on. 630000+ reviews is a lot of text to train on, probably enough to get a very good picture of the semantic relationships in Yelp reviews.
It would probably make more sense to look at business types separately, especially restaurants. The kinds of things people talk about in a restaurant review would seem to be different from what they would write for a hardware store. Similarly, mood effects from weather or day of week could differ for different types of businesses. Or the effects are not strong enough.
While weather and day of week don't seem to have a huge behavioral effect when it comes to rating businesses (or the effects have already been expressed in the review text), it should be worth exploring other factors that might affect consumer perceptions. For example, they might give more favorable ratings on holidays, less favorable ratings if their favorite political candidate loses an election, or if economic conditions worsen, if there has been a swine flu or mad cow outbreak, or if recent news events have been very negative.
If businesses can better understand the things affecting their customers' moods, they would be better equipped to perhaps try to counteract certain kinds of negative sentiments that might negatively affect their ratings.