import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
from networkx.algorithms import bipartite
import pandas as pd
import re
import matplotlib
font = {'family' : 'DejaVu Sans', 'weight' : 'normal', 'size' : 22}
matplotlib.rc('font', **font)
def degree(g, nodes=None, as_list=True):
deg = dict(g.degree())
if nodes: deg = dict(g.degree(nodes))
if as_list: return list(deg.values())
return deg
def degree_plot(g, nodes=None, filename=None, title=''):
deg = degree(g, nodes=nodes)
bins = 100
if len(nodes) < 100:
bins = len(nodes)
hist = np.histogram(deg, bins=bins)
freqs, edges = hist[0], hist[1]
n = freqs.size
means = [(edges[i] + edges[i+1]) / 2 for i in range(n)]
# SCATTER PLOT
plt.figure(figsize=[15,10])
plt.plot(means, freqs, ".", markersize=20)
plt.xlabel("k")
plt.ylabel("frequency")
plt.title("Degree distribution for %s" % title)
if filename: plt.savefig('plots/%s.svg' % filename, format='svg', bbox_inches="tight")
plt.show()
# LOG LOG PLOT
plt.figure(figsize=[15,10])
plt.loglog(means, freqs, ".", markersize=20)
plt.xlabel("log(k)")
plt.ylabel("log(frequency)")
plt.title("Log-log degree distribution for %s" % title)
if filename: plt.savefig('plots/log_%s.svg' % filename, format='svg', bbox_inches="tight")
plt.show()
The dataset used in this project comes from the data provided for the Yelp Dataset Challenge. This dataset consists of about 1.5 million users and about 200 thousand businesses from North America. Additionally the dataset includes just under 6 million reviews, made by users of the Yelp service, to businesses. The businesses included in the dataset are both restaurants as well as businesses offering other services, such as postal delivery.
It is interesting to study the patterns in peoples' opinions and behaviour and with a dataset as large as the yelp dataset, this is made possible.
The purpose of this project is to investigate properties of Yelp’s Elite users. For this paper, the focus will lie on Yelp’s two primary claims about their Elite users:
Yelp states that its Elite users have high connectivity, which means that they are connected with many other users and interact often with members of their Yelp community.
Yelp claims that its Elite users make up the “true heart of the Yelp community.” Third, Yelp claims that its users have high contribution, which means that the user has made a large impact on the site with meaningful and high-quality reviews.
The first goal of our project is to analyze whether the above claims about Yelp’s Elite users are quantifiably valid. For this, we will specify several characteristics which we expect Elite users to have based on these claims. We will then perform analyses on Yelp’s dataset in order to determine whether these properties are truly represented among the Elite users. The secondary goal of our project is to find which properties are most indicative of Elite status on Yelp.
The analyses for the first goal can be used for this purpose as well. This kind of information may be useful for those who are interested in becoming Elite members on Yelp. In order to become a member of the “Elite squad,” a user must go through an application process. Despite the suggestions presented above, Yelp doesn’t provide any specific criteria on exactly what characteristics a user must have to become Elite. The mystery behind the selection process for Elite users is well-documented.
import pandas as pd
def cleanup(N, dataset, chunk_size=100000):
'''
Cleans up a JSON file by adding a trailing comma to each line,
which is missing from the Yelp dataset files.
A chunk size must be specified, since all the lines in the data
files cannot be stored in memory at the same time, due to being very large!
'''
for k in range(N):
dirty_path = 'yelp_dataset/yelp_academic_dataset_%s.json' % dataset
clean_path = "cleaned/%s%i.json" % (dataset, k)
dirty_file = open(dirty_path, "r")
clean_file = open(clean_path, "w")
start = chunk_size * k
end = chunk_size * (k+1)
content = ''
i = 0
for line in dirty_file:
if i == end:
break
elif i >= start:
s = line.replace('\n', ',\n')
content += s
i += 1
if content:
payload = '{"data" : \n[%s]}' % (content[:-2] + '\n')
clean_file.write(payload)
else:
print("No more content.")
print('Iteration', k, 'done')
def read_json_to_df(N, dataset):
# Create dataframe from JSON files
df_matrix = [None] * N
for i in range(N):
path = "cleaned/business%i.json" % i
df_matrix[i] = pd.DataFrame(list(pd.read_json(path).data))
return pd.concat(df_matrix)
# A node class for storing data.
class Node:
def __init__(self, Data, Type):
self.Data = Data
self.Type = Type
def to_string(self):
return "Node (%s), Data: " % (self.Type, self.Data)
def __hash__(self):
return hash(self.Data)
def __eq__(self, other):
return (
self.__class__ == other.__class__ and
self.Data == other.Data
)
# Clean business JSON files
N = 2 # There are about 200k restaurants, therefore 2 chunks of 100k elements is sufficient
dataset = 'business'
cleanup(N, dataset)
# Make dataframe from JSON data
df = read_json_to_df(N, dataset)
# Restaurants will contain the keywords 'restaurant'
# and/or 'food' in the 'category' attribute.
keywords = ['restaurant', 'food']
idx = df.categories.str.lower().str.contains("|".join(keywords)).fillna(False)
rest = df[idx]
# Only include Toronto restaurants
rest.city = rest.city.str.lower()
rest = rest[rest.city == 'toronto']
# Drop attributes irrelevant to the analysis
rest = rest.drop(['city', 'attributes', 'categories', 'address', 'neighborhood', 'is_open', 'hours'], axis=1)
# Save dataset to CSV
rest.to_csv('toronto2/toronto_restaurants.csv', header=False)
# Clean business JSON files
N = 60 # There are about 6M reviews, therefore 60 chunks of 100k elements is sufficient
dataset = 'review'
cleanup(N, dataset)
# Make dataframe from JSON data
df = read_json_to_df(N, dataset)
# Filter out reviews of businesses outside Toronto
reviews = df[df.business_id.isin(rest.business_id)]
# Drop attributes irrelevant to the analysis
reviews = reviews.drop(['cool', 'funny', 'useful'], axis=1)
# Save dataset to CSV
reviews.to_csv('toronto2/toronto_reviews.csv')
# Clean business JSON files
N = 30 # A guess
dataset = 'user'
cleanup(N, dataset)
# Make dataframe from JSON data
df = read_json_to_df(N, dataset)
# Filter out users not in the Toronto reviews
toronto_users = df[df.user_id.isin(reviews.user_id)]
# Drop attributes irrelevant to the analysis
toronto_users = toronto_users.drop(['compliment_cool', 'compliment_cute',
'compliment_funny', 'compliment_hot', 'compliment_list',
'compliment_more', 'compliment_note', 'compliment_photos',
'compliment_plain', 'compliment_profile', 'compliment_writer', 'cool',
'funny', 'fans'], axis=1)
# Save to CSV
toronto_users.to_csv('toronto/toronto_users.csv', index=False)
For this project the restaurants in Toronto were the main focus, as Toronto is a big city with more than a sufficient amount of data to perform a serious analysis, but small enough for various graph algorithms to be carried out. The users considered in this project were all the users who left a review on a business in Toronto.
The social network was created by extracting the friends of each user who made a review on a Toronto-based restaurant, and then creating a link between each user. Some users in the social network will not have placed a review on a Toronto-based restaurant, and will only be in the network due to their friendship with someoneone who has.
The Toronto Yelp review network was modelled as an undirected graph, containing user nodes where the edges between two user nodes represent the fact two users have reviewed the same restaurant.
Detecting how important the elite users were for the network was done by removing them from the network in small chunks, and then observing how the largest connected subcomponent shrinks. The elite users were deleted based on their degree centrality.
# Constant Strings
USER = 'user'
ELITE_USER = 'elite_user'
BIZ = 'biz'
# Read in data
biz = pd.read_csv('toronto/toronto_biz.csv')
user = pd.read_csv('toronto/toronto_users.csv')
reviews = pd.concat([
pd.read_csv('toronto/toronto_reviews_1.csv'),
pd.read_csv('toronto/toronto_reviews_2.csv'),
pd.read_csv('toronto/toronto_reviews_3.csv'),
pd.read_csv('toronto/toronto_reviews_4.csv')
])
# Extract elite users
elite_user = user[~user.elite.str.contains('None')]
elite_ids = set(elite_user.user_id)
print('#Reviews:', len(reviews))
print('#Users:', len(set(reviews.user_id)))
print('#Elite users:', len(elite_user))
print('#Businesses:', len(set(reviews.business_id)))
#Reviews: 379875 #Users: 84624 #Elite users: 7499 #Businesses: 9678
The Toronto Yelp friends network was used to model the friendships of users, whom have reviewed one or more restaurants in Toronto.
# Create friend list
users_with_friends = user[user.friends != 'None']
friend_list = dict()
for row in users_with_friends.itertuples():
friend_list[row.user_id] = re.split(r',', row.friends)
len(friend_list)
49362
social_network = nx.Graph()
for uid in users_with_friends.user_id:
a = Node(uid, ELITE_USER if uid in elite_ids else USER)
for fid in friend_list[uid]:
b = Node(fid, ELITE_USER if fid in elite_ids else USER)
social_network.add_edge(a, b)
N,L = len(social_network.nodes()), len(social_network.edges())
print('Nodes:', len(social_network.nodes()))
print('Edges:', len(social_network.edges()))
Nodes: 1552431 Edges: 3214980
user_nodes = [n for n in list(social_network.nodes()) if n.Type == USER]
elite_nodes = [n for n in list(social_network.nodes()) if n.Type == ELITE_USER]
print('Regular User Nodes:', len(user_nodes))
print('Elite User Nodes:', len(elite_nodes))
print('Elite to regular user ratio:', len(elite_nodes) / len(user_nodes))
Regular User Nodes: 1544962 Elite User Nodes: 7469 Elite to regular user ratio: 0.004834423112024762
degree_plot(social_network, nodes=user_nodes, title="regular user nodes", filename='degree_social_regular')
degree_plot(social_network, nodes=elite_nodes, title="elite nodes", filename='degree_social_elite')
L = max(nx.connected_component_subgraphs(social_network), key=len)
print("Nodes in largest subcomponent:", len(L.nodes()))
print("Edges in largest subcomponent:", len(L.edges()))
Nodes in largest subcomponent: 1534169 Edges in largest subcomponent: 3202099
ev = nx.eigenvector_centrality_numpy(social_network)
ev_avg_elite = np.mean([ev[node]for node in ev if node.Type == ELITE_USER])
ev_avg_reg = np.mean([ev[node]for node in ev if node.Type == USER])
ev_avg_all = np.mean([ev[node]for node in ev if node.Type == USER or node.Type == ELITE_USER])
ev_avg_elite, ev_avg_reg, ev_avg_all
dg = nx.degree_centrality(social_network)
dg_avg_elite = np.mean([dg[node]for node in dg if node.Type == ELITE_USER])
dg_avg_reg = np.mean([dg[node]for node in dg if node.Type == USER])
dg_avg_all = np.mean([dg[node]for node in dg if node.Type == USER or node.Type == ELITE_USER])
dg_avg_elite, dg_avg_reg, dg_avg_all
import random, copy
def robustness_analysis(graph, nodes_, one_percent_of_users, verbose=False):
nodes = copy.copy(nodes_)
k = 100
random.shuffle(nodes)
# Initialize array for LCC sizes
lcc_values = np.zeros(k)
# Start loop
for i in range(k):
# Remove 1 percent of users, randomly chosen
for j in range(one_percent_of_users):
node = nodes.pop()
graph.remove_node(node)
# Compute largest connected subcomponent
mcc = len(max(nx.connected_component_subgraphs(graph), key=len))
lcc_values[i] = mcc
if verbose:
print(str(i) + " percent removed.")
print("Network size: " + str(len(graph)))
print("Largest connected component: " + str(mcc))
return lcc_values
# The number of users to remove each round in the robustness analysis
one_percent = int(len(elite_user)*0.01)
# The robustness analysis takes over an hour, run at your own risk
lcc_random = robustness_analysis(social_network, social_network.nodes(), one_percent)
lcc_regular = robustness_analysis(social_network, social_network.nodes(), one_percent)
lcc_elite = robustness_analysis(social_network, social_network.nodes(), one_percent)
# We have saved the result in files, which can be read here instead.
lcc_random = pd.read_csv('toronto/LCC_random.txt', header=None, names=['data'])
lcc_reg = pd.read_csv('toronto/LCC_regular.txt', header=None, names=['data'])
lcc_elite = pd.read_csv('toronto/LCC_elite.txt', header=None, names=['data'])
x_ax = 75 * np.arange(0,100)
plt.figure(figsize=[15,10])
plt.plot(x_ax, 1 - lcc_random.data / lcc_random.data.max(), linewidth=4)
plt.plot(x_ax, 1 - lcc_reg.data / lcc_reg.data.max(), linewidth=4)
plt.plot(x_ax, 1 - lcc_elite.data / lcc_elite.data.max(), linewidth=4)
plt.legend(['Random users', 'Non-elite users', 'Elite users'])
plt.xlabel('Users removed')
plt.ylabel('Fraction of original network size lost')
plt.savefig('plots/robustness_plot.svg', format='svg', bbox_inches="tight")
plt.title('Robustness analysis, all friends')
plt.show()
# Creating the network
user_ids = set(user.user_id)
social_network = nx.Graph()
for uid in users_with_friends.user_id:
a = Node(uid, ELITE_USER if uid in elite_ids else USER)
for fid in friend_list[uid]:
# Only include user if they left a review in Toronto
if fid in user_ids:
b = Node(fid, ELITE_USER if fid in elite_ids else USER)
social_network.add_edge(a, b)
# Extract number of nodes and edges
N,L = len(social_network.nodes()), len(social_network.edges())
print('Nodes:', N)
print('Edges:', L)
# Extract nodes based on type
user_nodes = [n for n in list(social_network.nodes()) if n.Type == USER]
elite_nodes = [n for n in list(social_network.nodes()) if n.Type == ELITE_USER]
all_nodes = list(social_network.nodes())
Nodes: 13677 Edges: 9437
one_percent = int(len(elite_nodes) / 100)
lcc_elite = robustness_analysis(social_network, elite_nodes, one_percent)
lcc_reg = robustness_analysis(social_network, user_nodes, one_percent)
lcc_random = robustness_analysis(social_network, social_network.nodes(), one_percent)
# Save results to a CSV file
lcc = pd.DataFrame(np.array([lcc_elite, lcc_reg, lcc_random]).T, columns=['elite', 'reg', 'random'])
lcc.to_csv('lcc_only_toronto.csv')
# Plot results on the robustness analysis
x_ax = one_percent * np.arange(0,100)
plt.figure(figsize=[15,10])
plt.plot(x_ax, 1 - lcc_random / lcc_random.max(), linewidth=4)
plt.plot(x_ax, 1 - lcc_reg / lcc_reg.max(), linewidth=4)
plt.plot(x_ax, 1 - lcc_elite / lcc_elite.max(), linewidth=4)
plt.legend(['Random users', 'Non-elite users', 'Elite users'])
plt.xlabel('Users removed')
plt.ylabel('Fraction of original network size lost')
plt.savefig('plots/robustness_only_toronto_plot.svg', format='svg', bbox_inches="tight")
plt.title('Robustness analysis, only Toronto friends')
plt.show()
The Toronto Yelp review network was modelled as an undirected graph, containing user nodes where the edges between two user nodes represent the fact two users have reviewed the same restaurant.
# Create a NetworkX graph for the review network
review_network = nx.Graph()
# For each review, create a node for the user and business and a link between them
for r in reviews.itertuples():
a = Node(r.user_id, ELITE_USER if r.user_id in elite_ids else USER)
b = Node(r.business_id, BIZ)
review_network.add_edge(a, b, weight=r.stars)
# Show the number of nodes and edges
print('Nodes:', len(review_network.nodes()))
print('Edges:', len(review_network.edges()))
Nodes: 94291 Edges: 379875
# Separate nodes based on their type
review_biz_nodes = [n for n in list(review_network.nodes()) if n.Type == BIZ]
review_user_nodes = [n for n in list(review_network.nodes()) if n.Type == USER]
review_elite_nodes = [n for n in list(review_network.nodes()) if n.Type == ELITE_USER]
degree_plot(review_network, review_user_nodes, title="all Toronto users", filename='reviews_degree_normal_users')
degree_plot(review_network, review_elite_nodes, title="Toronto Elite users", filename='reviews_degree_elite_users')
degree_plot(review_network, review_biz_nodes, title="Toronto restaurants", filename='reviews_degree_all_biz')
# Degree centrality
deg = nx.degree(review_network)
deg_elite_user = [deg[n] for n in deg if n.Type == ELITE_USER]
deg_user = [deg[n] for n in deg if n.Type == USER]
elite_avg_deg = np.mean(deg_elite_user)
user_avg_deg = np.mean(deg_user)
all_user_deg = np.mean(deg_elite_user + deg_user)
# Show results
print('Normal user mean degree centrality', user_avg_deg)
print('Elite user mean degree centrality', elite_avg_deg)
print('All users mean degree centrality', all_user_deg)
ratio = elite_avg_deg / user_avg_deg
print('Ratio degree (Elite : Normal): %.2f' % ratio)
Normal user mean degree centrality 3.0597274271561394 Elite user mean degree centrality 19.19215895452727 All users mean degree centrality 4.489446440389524 Ratio degree (Elite : Normal): 6.27
On average, elite users have over 6 times as many friend connections on Yelp, as regular users
# Eigenvalue centrality
ev = nx.eigenvector_centrality_numpy(review_network)
ev_elite_user = [ev[n] for n in ev if n.Type == ELITE_USER]
ev_user = [ev[n] for n in ev if n.Type == USER]
elite_avg_ev = np.mean(ev_elite_user)
user_avg_ev = np.mean(ev_user)
all_user_ev = np.mean(ev_elite_user + ev_user)
# Show results
print('Normal user mean EV centrality', user_avg_ev)
print('Elite user mean EV centrality', elite_avg_ev)
print('All users mean EV centrality', all_user_ev)
ratio = elite_avg_ev / user_avg_ev
print('Ratio EV (Elite : Normal): %.2f' % ratio)
Normal user mean EV centrality 0.0005151721690592373 Elite user mean EV centrality 0.0034574370628518815 All users mean EV centrality 0.0007759271614785319 Ratio EV (Elite : Normal): 6.71
The eigenvalue centrality has already been calculated beforehand, and is stored in the user dataframe.
# Create new column in the user dataframe
user['ev'] = 0
ev_user = {n.Data: ev[n] for n in ev if (n.Type == ELITE_USER) or (n.Type == USER)}
# Insert the eigenvalue of the user in the dataframe. This takes several minutes...
i = 1
p = int(len(ev_user) / 100)
for k in ev_user:
if i % p == 0: print('%i percent done' % (i/p))
eigenvalue = ev_user[k]
user.loc[user.user_id == k, 'ev'] = eigenvalue
i += 1
# Plot the eigenvalue of the user vs. the average rating the user
plt.figure(figsize=[15,10])
plt.scatter(user.average_stars, user.ev, edgecolors='black')
plt.xlabel('Yelp average rating')
plt.ylabel('Eigenvalue centrality')
plt.title('Eigenvalue centrality vs. average user rating for Yelp users in Toronto')
plt.savefig('plots/user_rating_ev.svg', format='svg', bbox_inches="tight")
plt.show()
The eigenvalue centrality has already been calculated beforehand, and is stored in the biz dataframe.
ev_biz = {n.Data: ev[n] for n in ev if n.Type == BIZ}
deg_biz = {n.Data: deg[n] for n in deg if n.Type == BIZ}
biz['ev'] = 0.0
for k in ev_biz:
eigenvalue = ev_biz[k]
biz.loc[biz.business_id == k, 'ev'] = eigenvalue
plt.figure(figsize=[15,10])
plt.scatter(biz.stars, biz.ev, edgecolors='black')
plt.xlabel('Yelp rating')
plt.ylabel('Eigenvalue centrality')
plt.title('Eigenvalue centrality vs Yelp rating for restaurants in Toronto')
plt.savefig('plots/biz_rating_ev.svg', format='svg', bbox_inches="tight")
plt.show()
plt.figure(figsize=[15,10])
plt.scatter(biz.ev, biz.review_count, edgecolors='black')
plt.xlabel('Eigenvector centrality score')
plt.ylabel('Review count')
plt.title('Eigenvector centrality vs. number of review for restaurants in Toronto')
plt.savefig('plots/biz_ev_count.svg', format='svg', bbox_inches="tight")
plt.show()
For this section, the differences in ratings between the regular users and the elite users were investigated.
Are elite users overall harsher in their reviews? Or is it the other way around? Let us find out!
# Extract ratings for elite, regular, and all users
elite_stars = np.array(reviews[reviews.user_id.isin(elite_user.user_id)].stars)
regular_stars = np.array(reviews[~reviews.user_id.isin(elite_user.user_id)].stars)
all_stars = np.array(reviews.stars)
# Histogram data for regular users ratings
reg = np.histogram(regular_stars, bins=[1,2,3,4,5,6])[0]
reg = reg / sum(reg)
# Histogram data for elite users ratings
elit = np.histogram(elite_stars, bins=[1,2,3,4,5,6])[0]
elit = elit / sum(elit)
# Histogram data for all users ratings
all_ = np.histogram(all_stars, bins=[1,2,3,4,5,6])[0]
all_ = all_ / sum(all_)
# Plot the histogram data
plt.figure(figsize=[15,10])
x = np.array([1,2,3,4,5])
dx = 1/12 # x-axis space
plt.bar(x - 2*dx, height=elit, width=2*dx, color='#FFF571', edgecolor='black')
plt.bar(x, height=reg, width=2*dx, color='#FF588A', edgecolor='black')
plt.bar(x + 2*dx, height=all_, width=2*dx, color='#32A9CC', edgecolor='black')
plt.xlabel('Review stars')
plt.ylabel('Fraction of users')
plt.legend(['Elite users', 'Regular users', 'All users'])
plt.title('Review distribution for Elite- and regular Yelp users in Toronto')
plt.savefig('plots/review_dist.svg', format='svg', bbox_inches="tight")
plt.show()
Elite users are more moderate and peak at 4 stars, where regular users are more critical and over-enthustiastic, i.e. giving 1 star reviews, and 5 star reviews.
Are elite users harsher in their reviews? Or is it the other way around? Let us find out!
# Get ids of businesses
biz_ids = [b.Data for b in review_biz_nodes]
# Only regular user reviews
reg_biz_graph = nx.subgraph(review_network, review_user_nodes + review_biz_nodes)
reg_weights_dict = reg_biz_graph.degree(review_biz_nodes, weight='weight')
reg_degrees_dict = reg_biz_graph.degree(review_biz_nodes)
reg_biz_ratings_dict = {
node.Data: reg_weights_dict[node] / reg_degrees_dict[node]
for node in review_biz_nodes
if reg_degrees_dict[node] > 0
}
# Only elite reviews
elite_biz_graph = nx.subgraph(review_network, review_elite_nodes + review_biz_nodes)
elite_weights_dict = elite_biz_graph.degree(review_biz_nodes, weight='weight')
elite_degrees_dict = elite_biz_graph.degree(review_biz_nodes)
elite_biz_ratings_dict = {
node.Data: elite_weights_dict[node] / elite_degrees_dict[node]
for node in review_biz_nodes
if elite_degrees_dict[node] > 0
}
# All user reviews
all_weights_dict = review_network.degree(review_biz_nodes, weight='weight')
all_degrees_dict = review_network.degree(review_biz_nodes)
all_biz_ratings_dict = {
node.Data: all_weights_dict[node] / all_degrees_dict[node]
for node in review_biz_nodes
if all_degrees_dict[node] > 0
}
# Comparison REGULAR AND ELITE
deltas_elite_reg = {
biz_id: elite_biz_ratings_dict[biz_id] - reg_biz_ratings_dict[biz_id]
for biz_id in biz_ids
if biz_id in reg_biz_ratings_dict.keys()
and biz_id in elite_biz_ratings_dict.keys()
}
# Comparison ALL AND ELITE
deltas_elite_all = {
biz_id: elite_biz_ratings_dict[biz_id] - all_biz_ratings_dict[biz_id]
for biz_id in biz_ids
if biz_id in all_biz_ratings_dict.keys()
and biz_id in elite_biz_ratings_dict.keys()
}
# Comparison ALL AND REG
deltas_reg_all = {
biz_id: reg_biz_ratings_dict[biz_id] - all_biz_ratings_dict[biz_id]
for biz_id in biz_ids
if biz_id in all_biz_ratings_dict.keys()
and biz_id in reg_biz_ratings_dict.keys()
}
plt.figure(figsize=[20,5])
plt.hist(np.array(list(deltas_elite_all.values())), bins=200, edgecolor='black')
plt.xlabel('Delta (stars)')
plt.title('Elite reviews compared to all user reviews')
plt.savefig('plots/delta_elite_all.svg', format='svg', bbox_inches="tight")
plt.show()
plt.figure(figsize=[20,5])
plt.hist(np.array(list(deltas_elite_reg.values())), bins=200, edgecolor='black')
plt.xlabel('Delta (stars)')
plt.title('Elite reviews compared to regular user reviews')
plt.savefig('plots/delta_elite_reg.svg', format='svg', bbox_inches="tight")
plt.show()
plt.figure(figsize=[20,5])
plt.hist(np.array(list(deltas_reg_all.values())), bins=200, edgecolor='black')
plt.xlabel('Delta (stars)')
plt.title('Regular user reviews compared to all user reviews')
plt.savefig('plots/delta_all_reg.svg', format='svg', bbox_inches="tight")
plt.show()
Talk about how you've worked with text, including regular expressions, unicode, etc.
Describe which network science tools and data analysis strategies you've used, how those network science measures work, and why the tools you've chosen are right for the problem you're solving.
How did you use the tools to understand your dataset?
A variety of tools went into our text analysis. The review text was already nicely given to us in the data provided by Yelp and so there was no need to use regular expressions or be concerned with unicode in order to acquire the relevant text. First, the langdetect package was used to filter out the small portion of reviews not written in English (mostly written in French since Toronto is in Canada).
Still, the reviews contained many misspellings and slang which led to uninformative results. So, comparing words to the set of English words provided by the nltk package was used to further filter out non-English words. That set of words does not include the conjugated forms, however, and so each word was additionally stemmed using the SnowballStemmer or WordNetLemmatizer provided by nltk in order to better not lose the conjugated forms. Running this pipeline for all of the words across all of the reviews, we were able to create counts of words for each tier of stars and across elite and regular users, ten total documents.
For each of the words in a document, a TF-IDF score was generated as follows:
$TFIDF(word, document) = TF(word, document) \times IDF(word)$
$TF(word, document) = \text{frequency of word in document}$
$IDF(word) = log(\frac{\#\text{ documents}}{\text{# documents containing word}})$
These TF-IDF scores were then used to generate word clouds through the $\textit{wordcloud}$ package.
As each word was analyzed, we also generated a sentiment score for that word. The sentiment score came from the labMT 1.0 data from the Mechanical Turk study which contains 10,222 words and their evaluated average happiness score [2]. Storing the sentiment scores in each review allowed us to generate the graphs shown.
As the reviews were parsed, simple additional statistics such as review length were stored and averaged to create the other figures.
The dataset was quite large, and thus we faced many issues in cleaning and preprocessing it. However, we managed to effectively parse out what we needed and collapse the data into a manageable subset. There is also so much potential for the data and we were able to come up with a couple of key questions to address.
Our network plots showed interesting results and we were able to connect those and centrality measures with their real-world implications. We were also able to confirm prior results from different cities in our analysis of the network robustness. Our results offered a quantitative measure of the differences between Elite vs. regular users and some evaluation of Yelp's claims.
For the text analysis, we managed to effectively filter the data and remove reviews written in foreign languages as well as filter out non-English words which were mostly misspellings and cluttered the word clouds. This required finding and implementing various lemmatizers and toolboxes until the text had been effectively filtered. Additionally, there was a lot of data to be reviewed and working with it all in memory was very slow. So, a strategy to periodically store the data, merging it with previously stored data, was employed.
The Review network is still slightly inconclusive. We had hoped that a better analysis would come of it. An idea we would like to try but requires taking a subgraph of the graph is making a review graph based on users who reviewed the same restaurant.
Additionally, more could be done with the text analysis. Further filtering could be done to potentially show more interesting results. Some words appeared very prominently in the word clouds but did not have an obvious reason for appearing. Upon further research, some of these words were found to correspond to single establishments and unique events of theirs. This clutters the word clouds but also is an interesting result that may benefit from additional analysis.
** References **
[1] Crain, Heh, Winston, 2014. An Analysis of the “Elite” Users on Yelp-.com, Stanford University, CA.
[2] Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12): e26752. https://doi.org/10.1371/journal.pone.0026752