Here, we conduct data exploration & analysis of Yelp data and the goal is to frame a few interesting questions and try to find the answers using the data.
Yelp is a website which publishes crowd-sourced public reviews
. In other words, it offers a service where people can review many types of businesses like restaurants, bars, automotive related etc.
.
The dataset acts as a rich source of information on users & businesses and helps us to find out many interesting topics like evolving food trends, best food by region, cheap & best beauty parlours nearby etc. The users' data can also be used as a social graph to find interesting connections amongst users based on their interests, friend circles and so on. Many research articles have made use of this evolving data to derive interesting insights. Though many have concentrated on Reviews, Businesses & Users'
data, there are only few papers that made use of Tips
. Here among other things, we shall explore its relation to other datasets and its importance.
We ask the following questions:
NOTE: The internals of the code is not much explained here as this intends to give a concise view of what we explored and how we did it. You can head over to the blog post for in-detail explanation of why we are doing what we are doing. Or you can select the 'open' option under 'file' menu above and inspect the code yourself.
Before we proceed further, let's load data analysis and visualization libraries
import pandas as pd
np = pd.np
from matplotlib import pyplot as plt
import seaborn as sns
from bokeh.plotting import figure, show, output_notebook #, output_file
from bokeh.charts import Area, defaults
from bokeh.layouts import row, gridplot, column
from bokeh.models import Range1d
Output the plots to notebook for inline
visualization
Load the Yelp Data set, clean it and keep it ready
from yelp_analysis import Yelp
yp = Yelp(data_folder='./data/')
yp.preProcess_data(names_folder='./data/names')
The above code reads the datasets with spark
(although not all into memory), cleans the text, assigns genders to users, corrects data formats (e.g., 'string' to 'dates' / 'integers') etc.
Note that the original users
dataset has no gender
attribute. They contain only first name
of the users. Hence we download the dataset containing all the babynames of US citizens since 1880 maintained by US govt. and infer genders based on that data.
After inferring gender for all 686,556
unique users in our dataset, we have identified the gender of (333658 + 291805) / 686556 =
91.1% of the people. Overall we have 48.6% female population and 42.5% male population and 8.9%
of unknown gender.
We assign the labels "0, 1, 2" to "female, male & unknown" genders respectively.
In this we will find out how actively people tip. They are generous in the sense that they not only provide a review but go the extra mile to leave small TL;DR messages for businesses. We shall also see the monthly contribution activity of Reviews as well. Before we filter this data based on Region
& Gender
, let's find out how this rate varied throughout these years.
For this, we do the following:
end-of-months
format (i.e., 2013-12-15 => 2013-12-31)end-of-month
attriubute and aggregate the counts for each month in each year.# # Get the growth rates of tips & reviews.
# # WARNING: this will take some time...
tg = yp.get_overall_growth_rate('tips')
rg = yp.get_overall_growth_rate('reviews')
Plot the overall growth rates of tips & reviews
# rename the columns for plotting convenience
rg.columns = ['dt', 'reviews']
tg.columns = ['dt', 'tips']
# set the datetime as Index
tg.set_index('dt', inplace=True)
rg.set_index('dt', inplace=True)
df = yp.ut.align_timeseries_data(dfs=[tg, rg])
defaults.width = 1000
defaults.height = 450
area = Area(df.reset_index(), x='months', y=['reviews', 'tips'], xlabel='years', color=['orange', 'green'],
ylabel='no. of tips / reviews given per month', title='Monthly growth of Tips & Reviews')
# output_file('reviews_tips_monthly_growth.html')
show(area)
We observe that the tipping trend had hit its peak somewhere in mid-2013 and continued to drop over all throughout the years. So less people are generous these days. 😥
Reviewers on the other hand register a healthy growth throughout the years and are far high in number compared to the tipsters. 💃
Here, we shall see the monthly contribution to Reviews
& Tips
filtered by Region & Gender
.
The dataset has data from 4 locations (USA, UK, CANADA, Germany (DEUTSCHLAND)) although most of the data is from the US.
Now that we have gender attribute, it is more interesting to see how each gender contributes to tips & reviews and more interestingly enough how does this vary by region.
For that we do the following:
user_id
s of the reviewers that reviewed their businessesgender != 2
)# WARNING : Getting these results take some time on a single machine
mtg_usa, ftg_usa = yp.get_growth_rate(country='USA', byGender=True, source='tips')
mtg_uk, ftg_uk = yp.get_growth_rate(country='UK', byGender=True, source='tips')
mtg_can, ftg_can = yp.get_growth_rate(country='CAN', byGender=True, source='tips')
mtg_deu, ftg_deu = yp.get_growth_rate(country='DEU', byGender=True, source='tips')
obtaining growth rates for the country USA obtaining local business ids obtaining local business users getting male & female local users getting male & female reviewing/tipping dates getting male date counts getting female date counts normalizing the counts... converting to datetime... returning male & female date growth counts obtaining growth rates for the country UK obtaining local business ids obtaining local business users getting male & female local users getting male & female reviewing/tipping dates getting male date counts getting female date counts normalizing the counts... converting to datetime... returning male & female date growth counts obtaining growth rates for the country CAN obtaining local business ids obtaining local business users getting male & female local users getting male & female reviewing/tipping dates getting male date counts getting female date counts normalizing the counts... converting to datetime... returning male & female date growth counts obtaining growth rates for the country DEU obtaining local business ids obtaining local business users getting male & female local users getting male & female reviewing/tipping dates getting male date counts getting female date counts normalizing the counts... converting to datetime... returning male & female date growth counts
dfs = [mtg_usa, ftg_usa, mtg_uk, ftg_uk, mtg_can, ftg_can, mtg_deu, ftg_deu]
df = yp.ut.align_timeseries_data(dfs).reset_index()
# let's rename the columns properly for ease of identification after plotting
df.columns = ['months', 'male_USA', 'female_USA', 'male_UK', 'female_UK', 'male_CAN', 'female_CAN',
'male_DEU', 'female_DEU']
df.set_index('months', inplace=True)
defaults.width = 500
defaults.height = 400
area_usa = Area(df.reset_index(), x='months', y=['male_USA', 'female_USA'], xlabel='years', title='USA',
ylabel='no. of tips given (norm)',)
area_uk = Area(df.reset_index(), x='months', y=['male_UK', 'female_UK'], xlabel='years', title='UK',
ylabel='no. of tips given (norm)',)
area_can = Area(df.reset_index(), x='months', y=['male_CAN', 'female_CAN'], xlabel='years', title='CANADA',
ylabel='no. of tips given (norm)', )
area_deu = Area(df.reset_index(), x='months', y=['male_DEU', 'female_DEU'], xlabel='years',
title='DEUTSCHLAND', ylabel='no. of tips given (norm)', background_fill_color='yellow',
background_fill_alpha=0.2)
area_usa.y_range = Range1d(0, 0.05)
area_uk.y_range = Range1d(0, 0.05)
area_can.y_range = Range1d(0, 0.05)
area_deu.y_range = Range1d(0, 0.15)
show(gridplot([[area_usa, area_uk], [area_can, area_deu]]))
Lots of insights can be gathered from the above data. First of all, the Germany (Deutschland) table is highlighted because of the Y-Axis scale which is very large compared to the rest of the three. Now all the other three countries' data has same scale for easier comparison. Since most of the data is from the US, the counts are normalized to observe the individual contributions from a region at the same scale. Finally, We observe the following:
Let's find out what reviews has to offer 😉
# WARNING: Computing this takes some time on a single machine.
mrg_usa, frg_usa = yp.get_growth_rate(country='USA', byGender=True, source='reviews')
mrg_uk, frg_uk = yp.get_growth_rate(country='UK', byGender=True, source='reviews')
mrg_can, frg_can = yp.get_growth_rate(country='CAN', byGender=True, source='reviews')
mrg_deu, frg_deu = yp.get_growth_rate(country='DEU', byGender=True, source='reviews')
dfs = [mrg_usa, frg_usa, mrg_uk, frg_uk, mrg_can, frg_can, mrg_deu, frg_deu]
df = yp.ut.align_timeseries_data(dfs).reset_index()
# let's rename the columns properly for ease of identification after plotting
df.columns = ['months', 'male_USA', 'female_USA', 'male_UK', 'female_UK', 'male_CAN', 'female_CAN',
'male_DEU', 'female_DEU']
df.set_index('months', inplace=True)
defaults.width = 500
defaults.height = 400
area_usa = Area(df.reset_index(), x='months', y=['male_USA', 'female_USA'], xlabel='years', title='USA',
ylabel='no. of reviews given (norm)',)
area_uk = Area(df.reset_index(), x='months', y=['male_UK', 'female_UK'], xlabel='years', title='UK',
ylabel='no. of reviews given (norm)', background_fill_color='yellow', background_fill_alpha=0.2)
area_can = Area(df.reset_index(), x='months', y=['male_CAN', 'female_CAN'], xlabel='years', title='CANADA',
ylabel='no. of reviews given (norm)', )
area_deu = Area(df.reset_index(), x='months', y=['male_DEU', 'female_DEU'], xlabel='years',
title='DEUTSCHLAND', ylabel='no. of reviews given (norm)')
area_usa.y_range = Range1d(0, 0.03)
area_uk.y_range = Range1d(0, 0.07)
area_can.y_range = Range1d(0, 0.03)
area_deu.y_range = Range1d(0, 0.03)
show(gridplot([[area_usa, area_uk], [area_can, area_deu]]))
Following the same procedure as above for deriving the data, we see the following:
Here we find out whether:
This is to see if there is any pattern in a way a given user writes his/her review & a tip. Thus, we are looking for those who have reviewed & tipped at least once and we want to see their behaviour roughly.
The length of text varies amongst users and some prefer short texts while some go full Sherlock 👓🔍. And this behavior could possibly be seen in reviews/tips. Here's how we do it:
uni_user_tips = yp.tips.select('user_id').distinct() # select distinct users in tips
user_tip_wc = yp.tips.groupBy('user_id')\ # group 'tips' by user_id
.agg({'wordCount':'mean'})\ # get avg word count for all of his/her tips
.withColumnRenamed('avg(wordCount)', 'avg_tip_wc') # rename the column
user_rev_wc = yp.reviews.join(uni_user_tips, 'user_id')\ # join the above table with this for common users
.groupBy('user_id')\ # group by user_id
.agg({'wordCount':'mean'})\ # get avg word count for all his/her reviews
.withColumnRenamed('avg(wordCount)', 'avg_rev_wc') # rename the column
# join both the above tables
rv_tips_users = user_rev_wc.join(user_tip_wc, 'user_id')\
.select('avg_rev_wc', 'avg_tip_wc')\
.toPandas()
x, y = rv_tips_users.avg_rev_wc, rv_tips_users.avg_tip_wc
p = figure(plot_width=800, plot_height=650, title='TIPS vs. REVIEWS word counts by USER')
p.scatter(x, y, fill_alpha=0.6, line_color=None)
p.xaxis.axis_label = 'avg. word count of REVIEWS'
p.yaxis.axis_label = 'avg. word count of TIPS'
show(p)
From this L shaped curve, we can make out that:
10 times
more than that of tips.uni_biz_tips = yp.tips.select('business_id').distinct()
biz_tip_wc = yp.tips.groupBy('business_id')\
.agg({'wordCount': 'mean'})\
.withColumnRenamed('avg(wordCount)', 'avg_tip_wc')
biz_rev_wc = yp.reviews.join(biz_tip_wc, 'business_id')\
.groupBy('business_id')\
.agg({'wordCount': 'mean'})\
.withColumnRenamed('avg(wordCount)', 'avg_rev_wc')
rv_tips_biz = biz_rev_wc.join(biz_tip_wc, 'business_id')\
.select('avg_rev_wc', 'avg_tip_wc')\
.toPandas()
x, y = rv_tips_biz.avg_rev_wc, rv_tips_biz.avg_tip_wc
p = figure(plot_width=800, plot_height=650, title='TIPS vs. REVIEWS word counts by BUSINESS')
p.scatter(x, y, fill_alpha=0.6, line_color=None)
p.xaxis.axis_label = 'avg. word count of REVIEWS'
p.yaxis.axis_label = 'avg. word count of TIPS'
show(p)
Doing the same operation but this time from a business' perspective offers somewhat similar pattern as before but this time, more tightly clustered.
We can observe that most businesses got reviewed within less than 200 words.
Also those businesses that have received the longest tips got small to average length reviews.
We can plot word clouds based on the most frequently used ones in Reviews / Tips. We can also filter by Gender
/region
if interested.
We can see one such example below. More examples here.
yp.plot_wordCloud(yp.tips, 'text')
Gender Diversity
is an important property everywhere... not only for population statistics but even for businesses to attract customers.
There are few incentives for finding this:
Businesses can take advantage of this diversity by offering gender biased services
On the other hand, some can also overcome shortcomings to attract males & females equally to reduce the diversity gap.
Here we'll see the user growth based on gender over the years and how it is related to the the number of tips / reviews logged in Yelp every month.
We try to answer the following:
For that, we:
yelping_since
attribute to datetimeyelping_since
and gender
# **** WARNING *** This takes around 4 to 5 minutes on a quad core intel i7 laptop
gc = yp.get_gender_counts('users')
plt_width, plt_height = 800, 500
p1 = figure(x_axis_type='datetime', title='user growth over the years',
plot_width=plt_width, plot_height=plt_height)
p1.xaxis.axis_label = 'years'
p1.yaxis.axis_label = 'males / females joining per month'
p1.y_range = Range1d(0, 6500)
p1.line(gc.date, gc.male_count, legend='male', color='red')
p1.line(gc.date, gc.female_count, legend='female', color='green')
show(p1)
We see that the growth pattern is identical for both sexes and has hit its peak somewhere in early/mid 2015 and started dropping ever since to pre-2011 times.
The rate of male users registering per month was initially high until it was taken over by females from 2011
and maintained their dominance from then on.
Let us find out their individual monthly contributions.
# **** WARNING *** This will take some time if you run it on a laptop
tc = yp.get_gender_counts('tips', 'month')
rc = yp.get_gender_counts('reviews', 'month')
# normalize the counts
rc.male_count /= rc.male_count.sum()
rc.female_count /= rc.female_count.sum()
tc.male_count /= tc.male_count.sum()
tc.female_count /= tc.female_count.sum()
plt_width, plt_height = 800, 500
p1 = figure(x_axis_type='datetime', title=' Reviews & Tips in the form of female-male ratio over the years',
plot_width=plt_width, plot_height=plt_height)
p1.xaxis.axis_label = 'years'
p1.yaxis.axis_label = 'ratio of tips / reviews entered per month'
p1.line(rc.date, rc.female_count / rc.male_count, legend='reviews', color='green')
p1.line(tc.date, tc.female_count / tc.male_count, legend='tips', color='red')
show(p1)
The female to male ratio for both tips & reviews indicates how actively those genders are contributing every month. A ratio of < 1 indicates male domination and > 1 indicates female dominance.
We observe that the ratio for reviews fluctuated during the starting years and started flattening out near 1 indicating equal contribution until around 2015 after which more females are actively reviewing than their male counterparts.
The same can be said for tips
which initially started out with more male contribution but females becoming more active on the scene just after 2012.
It can also be observed that the tips rate shows a more increasing trend than reviews.
This gives a overall sense that females are more active on yelp in terms of both reviewing & tipping. However, the rate at which they do this vary significantly from what we see from the reviews & tips' growth rate in Task 1.
However, it is important to understand that this data is not representative of all the people but rather a small subset which may or may not be properly sampled. Yelp is constantly expanding to many regions and across businesses and the validity of the results we obtained apply only on the data we have analyzed but not of the whole Yelp user base, in general.
For the final task, we develop a text classifier to guess the rating of the review just from the text. We do a 5 level
fine grained multi-classification using shallow layered Convolutional Neural Networks.
Now, as we know, there have been significant technical advancements in the fields of Artifical Intelligence in the recent years and deep learning (or neural networks in general) offered us a new way of looking at things. Of these, Convolutional Neural Networks vastly improved state-of-the-art Computer Vision, Image recognition and Speech systems. They were also proven successful for text classification in the recent years.
In this task, we use a shallow layered CNN to perform a fine grained sentiment classification. These have proven to be as effective as deep layered ones for sentiment classificaion tasks albeit with much less computational overhead and complexity.
As features to this model, we shall use Glove word embeddings simply because it is as good as Word2Vec (if not better) but also provides smaller dimensional embeddings (to work on my laptop). However, the code accepts any type of word embeddings such as Google's Word2Vec or Faceook's FastText etc. You can also train your own Word2Vec Embeddings from yelp reviews. (see 'text_classification.py')
The architecture is similar to the paper mentioned above which is also partly related to my thesis work. 😊 See more about it's inner workings here.
from text_classification import TextClassification
Using TensorFlow backend.
# Initialize it
tc = TextClassification()
# Load the data (adjust the sample size for testing...)
# tc.load_data(training_samples=1000, validation_samples=200, testing_samples=100)
# My laptop barely survived with 100,000 examples when run locally.
# So you might want to set it to lesser value if you don't have much RAM when run in a container.
# Feel free to change it here if you have more otherwise.
tc.load_data(training_samples=100000, validation_samples=20000, testing_samples=10000)
_ = tc.build_network()
holy_grail obtained... loading embeddings...
<text_classification.TextClassification at 0x118cbe978>
tc.train(num_epochs=3, batch_size=500)
/Users/Vivek/anaconda/envs/dataScience/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:91: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Train on 93825 samples, validate on 20000 samples Epoch 1/3 93825/93825 [==============================] - 1923s - loss: 4.8303 - categorical_accuracy: 0.3704 - val_loss: 1.1511 - val_categorical_accuracy: 0.4997 Epoch 2/3 93825/93825 [==============================] - 1903s - loss: 1.0753 - categorical_accuracy: 0.5654 - val_loss: 0.8596 - val_categorical_accuracy: 0.6284 Epoch 3/3 93825/93825 [==============================] - 1893s - loss: 0.8420 - categorical_accuracy: 0.6393 - val_loss: 0.8190 - val_categorical_accuracy: 0.6475
# test it on the heldout dataset ('tc.x_test')
tc.test()
categorical_accuracy: 0.6511 confusion_matrix: [[ 982 178 48 24 42] [ 214 272 233 83 27] [ 63 139 455 444 83] [ 25 34 215 1371 779] [ 25 12 38 783 3431]]
That's all there is to it. 😅 Hope you have enjoyed and taken a point or two from here. Follow the blog posts if you want to know more about these. If you find any mistakes please kindly report by email.