By Jay Sayre, Serguei Balanovich, Sebastian Gehrmann
Please visit our website for more information
Thursday, December 12, 11:59pm
%matplotlib inline
import json
import numpy as np
import copy
import pandas as pd
import networkx as nx
import requests
import scipy
from pattern import web
import matplotlib.pyplot as plt
import matplotlib.pylab as plt2
from scipy.stats import pearsonr
from datetime import datetime
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics.pairwise import cosine_similarity
from myalchemy import MyAlchemy
from sklearn import svm, tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.utils.extmath import density
from sklearn import metrics
# set some nicer defaults for matplotlib
from matplotlib import rcParams
#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
(0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
(0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
(0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
(0.4, 0.6509803921568628, 0.11764705882352941),
(0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
(0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
(0.4, 0.4, 0.4)]
rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'
def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
"""
Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
"""
ax = axes or plt.gca()
ax.spines['top'].set_visible(top)
ax.spines['right'].set_visible(right)
ax.spines['left'].set_visible(left)
ax.spines['bottom'].set_visible(bottom)
#turn off all ticks
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')
#now re-enable visibles
if top:
ax.xaxis.tick_top()
if bottom:
ax.xaxis.tick_bottom()
if left:
ax.yaxis.tick_left()
if right:
ax.yaxis.tick_right()
"P/r/eddict It!" is the final project by Jay Sayre, Serguei Balanovich, and Sebastian Gehrmann created for the course CS109 - Data Science at Harvard University (cs109.org).
The course covered many interesting topics in Data Science, from scraping, cleaning, and visualizing data, to constructing statistical models in an attempt to predict certain outcomes or make valuable suggestions. In one of the lessons, the algorithm that generates the front page of the social media website Reddit was briefly discussed. This lesson caught our attention, since Reddit is a service that allows users to submit links and discuss and vote for certain posts by providing "Up" or "Down" votes and moving the overall scores of posts in such a way as to allow for quick virality of posts. Because of this unique feature, Reddit seemed like a very interesting place from which to construct a large dataset and analyze it to find trends in posts, likes, and comments. So as to generate the front page and bring the most viral posts to the front, Reddit employs an algorithm that takes into account posting time and the score to generate the order in which the content is shown.
An example of a successful post which gained national attention was actually one of the visualizations from Homework 5 due to its unusually high score in the DataIsBeautiful Subreddit. Due to this attention, almost all big media web sites like yahoo covered the visualization that shows the bipartisanship in the US senate.
We found it really interesting that a single post could generate such a huge media attention and wondered whether it was at all possible to predict or approximate such success. Since Reddit is split up into a variety of differnet topic-based forums known as subreddits, each of which has its own culture and community, it was necessary for us to take a subset of them and to investigate on an individual basis what made any given post successful, both compared to other posts in the subreddit and compared to posts in general. There are two types of posts on Reddit: self posts that consist solely of text and link posts that have only a title text that links to some website or image. Due to the nature of this project, and our interest in the textual analysis of Reddit posts, we chose to concentrate on working only with textual posts and seeing if we were able to predict success using more advanced classification techniques than those used throughout the duration of the course.
*This processbook shows only the most important code that we needed to progress building the models and understanding the data. There are 7 other IPython notebooks which contain our experimental code, data scraping that takes far too long (sometimes upwards to 30 hours fora single cell), or simply code that was useful but extraneous. To keep the Process Book concise and readable, we have therefore separated all of that out, but have placed strategic parkers throughout this file to indicate where additional files might be useful to look at. Finally, as a word of caution, we would advise not running this notebook with less than 1GB of free RAM as it will take up much of the computer's resources.*
We begin with constructing a mental model of Reddit and considering the most important elements that we will need to collect before proceeding with statistical analysis. There are more then 278,000 subreddits on Reddit and each has its own rules and community which means that each requires a different approach to predict a post's success. Depending on the size of the community, a different score or number of comments will count as success. For instance, a subreddit dedicated to jokes will have much different indicators of success than a subreddit dedicated to science.
We wanted to compile a list of roughly 10 largely-text based subreddits that have a large community and cover diverse topics. We found most of the subreddits we will look at here.
This is our list:
Reddit offers an API to access their site which can be found here. We used it to download all possible data sets for our list of subreddits. Reddit allows to access the top 1,000 posts for each of the following lists: Top (all / week / day), new and hot.
At first we used a wrapper for the API called praw to download our data. We found it to be very slow, and which was problematic given the sheer amount of data we required. For this reason, instead we wrote our own functions to access the reddit API. Ultimately, using our scraping algorithms, downloading the posts for all subreddits gave us a list of a total of 44,000 submissions. Reddit's API has some issues that we ran into, unfortunately, as it can only serve up 25-100 posts on every call, and they require that each request is spaced at least 2 seconds apart.
The Big Concern
We were hoping initially that we could take all of the entries for a given subreddit in the "top all" section, referring to the top scoring posts (where score$=upvotes-downvotes$) in the subreddit for all time. However, we discovered that Reddit's server will only deliver the top 1000 results for any section, which introduced some problems. Moreover, Reddit will not serve more than this limit on the website, making manually scraping the data not an option either. To overcome this, we scraped all the available posts in each section provided by the API, consisting of "top day", "top week", "top all", "hot", and "new". The algorithm used to classify hot posts was discussed in class, but new simply gives us the newest 1000 (or less) posts in the subreddit. This method of scraping from differenet was employed in order to try to see both the high and low scoring posts. However, ultimately, this method is admittedly imperfect, and the "correct" way to do this would be to examine new posts over a long (few months or longer) period of time, in order to do a time series of how the scores of each post changes. However, given our limited amount of time, we were forced to try this method. Ultimately, there was alot of of overlap in the data sets, giving us fewer unique entries than we had though - about 26,000. Fortunately though, it appears like posts usually come into popularity within 72 hours or so, so it's unlikely that there would be incorrect classification of posts that might become popular later on. It's always possible, but it doesn't seem to be a huge, overarching concern.
*Cleaning the data*
After spending much time trying to perform some rudimentary data analysis and visualization, we noticed that we were getting numerous errors for unexplainable reasons. After trying to clean out offending elements from our dataset by hand, we ultimately decided that it would be best to systematically clean all data. Although the reddit API worked fine most of the time, sometimes titles were stored as data types other than strings and scores were not stored as ints or floats. We had to make sure to drop all offending posts from our dataset and convert the incorrect types to the appropriate ones before proceeding.
Additionally, we did not want to look at moderator posts because they were not community-driven and did not behave in the same way as normal posts. They stood out among the others and were more successful by nature. There was no reason to let those posts influence our function. Also, we removed the posts with media in them, wishing to only examine the effect of text upon posts, and not links or images, unlike what these Stanford researchers did. If you would like to see all of the small changes we made to the dataset to clean it out, please refer to the code in the notebook redditscraping.ipynb
Even after we downloaded all of the posts we wanted to look at, there was still much important information missing. In order to compile a single comprehensive dataset, we had to undergo the following three steps.
*Part 1 - merging files*
Since we downloaded data from different subreddits and only got 1,000 entries at a time we created a ton of .csv files. The first step was to merge these by opening them sequentially and saving them into a single large table (and ultimately CSV file).
*Part 2 - downloading extra information*
In the original download, although many differnet fields were collected, we were still missing some information. For instance, we had to make seperate api calls for the karma of each use to check if there was some valuable correlation there that we could later use. For this reason, it was necessary to expand the table and download this information into it.
*Part 3 - downloading comments*
For each one of our 44,000 submissions we also wanted to predict and look at the scores of comments. The text of the comments is often related to the text of the post and we wanted to get the top ranked comments and add them to our text analysis to ensure that we got as much data about every post as possible at the scraping stage. For this we needed to download all comments, which, using the reddit API allowed us to get the top 200 comments for each post. We did this and merged all comments of each subreddit into one file. If you would like to see the code for this portion of the project, please refer to the notebook datapreparation.ipynb
We found an API for a natural language processing service called Alchemy. Alchemy offers a wide range of features for text analysis including sentiment analysis and keyword extraction.
We wanted to try to use Alchemy's results in order to help with our analysis of the texts and titles of posts. This has the advantage that we are able to not only look at raw text but actually analyse it without much effort and utilize either "concepts" or "keywords" of posts to do further text analysis that would have been impossible with just the raw titles. Though this library did nothing for us at first, by the end of the project, we found a great application of the alchemy keywords.
Alchemy, has a class designed for use in Python, which was unfortunately was also quite slow to use. Due to the size of our data set we needed something with better performace, so we wrote our own class for it - the code for this class is in the python file myalchemy.py.
After gathering and cleaning all the data, and running some basic experiments with Alchemy, we now had all the necessary data and a basic understanding of its shape, sparisty, and nature to take our first in depth look at it. The hypothesis we set out to confirm was whether or not it was possible to predict certain posts given the history of previous posts and their properties. Thus, we set out intitially to find correlations between these different attributes of posts to see if anything was obviously useful.
We began with looking at simple metrics. For instance, how many distinct authors are there in our data set?
big_table = pd.read_csv('Data/full.csv', encoding='utf-8')
big_table = big_table[big_table['author'] != "deleted"]
print "Number of posts: ", len(big_table)
print "Number of distinct authors: ", len(big_table .groupby('author'))
Number of posts: 43749 Number of distinct authors: 20044
We have roughly 44,000 posts by 20,000 authors. This means that the average successful user has approximately two submissions in the lists we investigate. This indicates that there are probably users that have much more (successful posts). As in most communities there will probably exist power users with numerous of successful posts. In order to investigate this we look at the 10 most successful users and plot a histogram with the number of posts for each individual user.
def get_author_stats():
author_table = big_table.groupby('author')
author_count = author_table['author'].count()
author_count.sort()
return author_count
author_count = get_author_stats()
author_count[-10:]
author Vladith 69 maxwellhill 74 UserName 75 mepper 78 shadowbanmeplz 81 AdelleChattre 83 FredFltStn 87 wattmeter 91 BurtonDesque 178 drewiepoodle 276 dtype: int64
plt.hist(author_count, bins = 20, log=True)
plt.title("Distribution of number of submissions")
remove_border()
We found similar results while plotting the number of comments and the score of a post. Only a very small fraction of the posts and users are actually successful but there are some outstanding individual entities. You can find the other visualizations in the datavisualization.ipynb notebook.
Now, we try to measure the rate of successful users with more than one post. We can look at this ratio for every type of data that we have and see whether there are differences in the activity of people.
types = list(big_table['type'].unique())
'''
returns:
- the number of active users with more than 2 posts
- the number of distinct authors
- the ratio of active/distinct users
for a subreddit
'''
def get_sub_stats(subreddit):
author_table = subreddit.groupby('author')
dist_authors = len(subreddit.groupby('author'))
#print "Number of distinct authors: ", dist_authors
successful_authors = subreddit[author_table.author.transform(lambda x: x.count() > 1).astype('bool')]
authorset = set()
for a in successful_authors.index:
authorset.add(successful_authors.ix[a]['author'])
active_users = len(authorset)
#print "number of authors with more than 1 submission in the top 1000: ", active_users
if dist_authors >0:
succ_ratio = float(active_users) / dist_authors
else:
succ_ratio = 0
return active_users, dist_authors, succ_ratio
#get the values for all types of data
authorstats = {}
for ctype in types:
curr_df = big_table[big_table['type'] == ctype]
authorstats[ctype] = get_sub_stats(curr_df)
del curr_df #reduce memory
'''
plots a scatterplot for a list of subreddit stats calculated before
X-Axis: Number of distinct users
Y-Axis: Success ratio
'''
def plot_author_success(successlist):
xvals = [value[0] for key, value in successlist.iteritems()]
yvals = [value[2] for key, value in successlist.iteritems()]
labellist = [key for key, value in successlist.iteritems()]
fig, ax = plt.subplots()
ax.scatter(xvals, yvals)
for i, txt in enumerate(labellist):
ax.annotate(txt, (xvals[i],yvals[i]))
plt.title("Active Users with their success rate")
plt.xlabel("No. distinct users")
plt.ylabel("fraction of users with multiple posts")
remove_border()
plot_author_success(authorstats)
It is evident from this plot that the filterings with the most active users are actually top_day and top_week. When looking at top_day, one would normally expect it to have a similar success rate to new, given that the scores of every post are random. However, there are some users who provide higher quality posts, which reveals to us that they may have properties indicative of the success of their posts - as long as we manage to find these authors in the model later!
Now, we consider whether all subreddits we have looked at have the same fraction of successful users or whether there are subreddits that have a userbase with more success. Perhaps some subreddits have smaller user bases, or perhaps power users or experienced users know how to get more points in their respective forums. An interesting point to explore indeed - we proceed to plot this for all subreddits we are studying.
subreddits = list(big_table['subreddit'].unique())
sr_stats = {}
for ctype in subreddits:
curr_df = big_table[big_table['subreddit'] == ctype]
sr_stats[ctype] = get_sub_stats(curr_df)
del curr_df #reduce memory
plot_author_success(sr_stats)
del sr_stats #reduce memory
It is clear from this plot that the story subreddits seem to have more successful users. We assume that this is due to a very active userbase where users frequently post. The question subreddits seem to be far more random in what posts are successful - if we begin predicting subreddits seperately we should take this into account. We should probably begin our focus on the less random subreddits.
The next step in the analysis of the data was to look at the combination of the two most important measurements of success - the number of comments and the score of a post. We begin by plotting this relationship for all the data in our dataset.
#regression line
m_fit,b_fit = plt2.polyfit(big_table.comments, big_table.score, 1)
plt2.plot(big_table.comments, big_table.score, 'yo', big_table.comments, m_fit*big_table.comments+b_fit, color='purple', alpha=0.3)
plt.title("Comments versus Score")
plt.xlabel("Comments")
plt.ylabel("Score")
plt.xlim(-10, max(big_table.comments) * 1.05)
plt.ylim(-10, max(big_table.score) * 1.05 )
remove_border()
It is evident that there exists a small linear correlation between number of comments and the score. We can also see that the line of best fit doesn't really seem to fit most posts. This is due to the high number of unsuccessful posts in the bottom left part of the plot. There also seems to be a magical border of roughly 2000 - 2500 score that posts rarely cross, no matter how many comments the post has and in which subreddit it is in. We suspenct that this is due to the small number of people in the reddit community that actually participate in up and downvoting and vote on every post. The majority of the people will rarely vote on anything, only on really great content which would cause the post to break the 2500 score border. We might want to further investigate this later - what causes people that generally not vote to vote on content.
This visualization, however, doesn't help us to understand the correlations between score and comments as the data we look at is very sparse and is missing all older posts with less than 2,000 score. We can't access those posts with the API so we need to look at smaller pieces of the data.
In addition to this graph, we plotted this relationship for all subreddits. These plots can be found in the notebook datavisualization.ipynb.
The next thing we will investigate now is whether there is a correlation between the comments and the score of a post when both are very low since much of our data consists of exactly this kind of data.
big_table_filtered = big_table[big_table['comments'] < 50] #only look at posts with <50 comments
big_table_filtered = big_table_filtered[big_table_filtered['score'] < 100] # and less than 100 score
plt.scatter(big_table_filtered.comments, big_table_filtered.score, alpha=0.2)
plt.title("Comments versus Score")
plt.xlabel("Comments")
plt.ylabel("Score")
plt.xlim(-1, max(big_table_filtered.comments) * 1.05)
plt.ylim(-1, max(big_table_filtered.score) * 1.05 )
remove_border()
del big_table_filtered
When visualizing filtered data where we filtered out successful posts you can see that there seems to be no correlation between comments and score since almost every combination of those two exist in that score range. Since we want to predict the post when it still is in the lower bottom part of the chart, the ratio might not be an optimal indicator of whether a posts can and will be successful.
The penultimate test we attempted in this domain before moving on to our "bag-of-words" classifier work was investigating whether it makes a difference if an author puts in an descriptive text in the post. This so called "self" text will be shown when trying to comment on a post.
def split_selftext_DataFrame(df):
'''
returns a list with 0 if a post has no selftext and a 1 if it has
'''
is_string_list = []
i = 0
for idx, record in df['selftext'].iteritems():
if type(record) == float: #for some reason no selftext is formatted as float
is_string_list.append(0)
else:
is_string_list.append(1)
return is_string_list
big_table['islink'] = split_selftext_DataFrame(big_table)
big_table_link = big_table[big_table['islink'] == 0]
big_table_self = big_table[big_table['islink'] == 1]
def plot_link_vs_self(table_link, table_self):
'''
plots a scatterplot of scores and comments for two different datasets
'''
p1 = plt.scatter(table_link.comments, table_link.score, color='red', alpha = 0.2)
p2 = plt.scatter(table_self.comments, table_self.score, color='blue', alpha = 0.2)
plt.legend([p1, p2], ["no self text", "self texts"])
plt.title("Comments versus Score ")
plt.xlabel("Comments")
plt.ylabel("Score")
plt.ylim(-10, 5000)
plt.xlim(-10, 30000)
remove_border()
plot_link_vs_self(big_table_link, big_table_self)
del big_table_link
del big_table_self
This visualization clearly shows us that link posts in the set of our subreddits seem to achieve not only higher scores but also higher comment counts than the selftexts - the ones we want to focus on. This does not, however, show any correlation between score and comments between either of them.
Some further visualizations that helped us understand our data on an even deeper level can be found in the notebook datavisualization.ipynb
Let's now begin to perform some exploratory statistical analysis of the data. We found a paper by stanford researchers that also investigated the predictability of scores. We will look into everything they tried and what is possible with our dataset later.
First we tried to use some simple linear regression models on the data, beginning with the impact of a poster's karma and link karma upon the score of their posts. Inherently, this measure is somewhat correlated, but let's see just how much. We will plot this using log scales for both values.
logkrm = np.log(big_table['karma'])
loglinkkrm = np.log(big_table['link_karma'])
logscore = np.log(big_table['score'])
plt.scatter(logkrm, logscore, c='g')
plt.title("Karma versus Score - Both on a Logarithimic Scale")
plt.xlabel("Karma (Log)")
plt.ylabel("Score (Log)")
plt.xlim(-0.5, 16)
plt.ylim(-0.5, 10)
remove_border()
plt.show()
r_row, p_value = pearsonr(big_table['karma'], big_table['score'])
print "Pearson coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
plt.scatter(loglinkkrm, logscore, c='g')
plt.title("Link Karma versus Score - Both on a Logarithimic Scale")
plt.xlabel("Link Karma (Log)")
plt.ylabel("Score (Log)")
plt.xlim(-0.5, 16)
plt.ylim(-0.5, 10)
remove_border()
plt.show()
r_row, p_value = pearsonr(big_table['link_karma'], big_table['score'])
print "Pearson r coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
del logkrm, loglinkkrm, logscore
Pearson coefficient is 0.0868347116634 with a p-value of 5.50788323065e-74
Pearson r coefficient is 0.159307162395 with a p-value of 1.51416044271e-246
Based on these, we were skeptical of the effect that either karma score has upon the score of the post. Sure, as we already found out earlier there seem to be a few "power users" that have a lot of posts with high scores, but they also have quite a few low scoring posts. One thing to keep in mind is that this is for all subreddits, and this result could be larger or smaller for any individual subreddit. Another possibility is that the "power users" are deleted, or null, or lost beacause of some other complication. It is quite clear, however, that despite everything, there does not seem to be a failsafe way to be a consistently successful user - even if most of the posts you create become successful, the tendency for some posts to flop will always outweigh this and the almost square-shaped cloud of points above reaffirms this point. Another thing to check if we want to ultimately build a multiple regression model is whether the karma and link karma are at all correlated. Let's check this
r_row, p_value = pearsonr(big_table['karma'], big_table['link_karma'])
print "Pearson r coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
Pearson r coefficient is 0.0810603052473 with a p-value of 1.11139557717e-64
Interestingly, this is not as highly correlated as one might expect, and the predictive value of both measurements upon the score of one's post seems quite low. Therefore, we do not think that there will be any evident gains from including these statistics in our final model.
We still haven't considered the title length in our data exploration. Maybe this will give us more information
big_table['length']=big_table['comments'] # Done simply to initialize the column "length", so to speak
for i in big_table.index:
big_table['length'][i]=len(str(big_table['title'][i]))
plt.scatter(big_table['length'], big_table['score'], c='g')
plt.title("Post Title Length versus Post Score")
plt.xlabel("Title Length")
plt.ylabel("Score")
plt.xlim(0, 300)
plt.ylim(0, 9000)
remove_border()
plt.show()
r_row, p_value = pearsonr(big_table['length'], big_table['score'])
print "Pearson r coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
Pearson r coefficient is 0.194747106686 with a p-value of 0.0
From this, it seems like popular posts have slightly shorter title lengths, but it's safe to say that this might only have some explanatory power.
Now, let's try doing a first pass of modeling the effect of time upon a post's score, which to begin we'll just model with a simple LR model. Reddit stores the time created in a UNIX timestamp in coordinated universal time. We can convert this to a useable format, then find the earliest post, and simply create a new variable which measures days from that original post.
We would like to give credit to this stackexchange post for conversion help and this post for checking the time between two dates.
#p =datetime.utcfromtimestamp(messwith)#.strftime('%Y-%m-%d %H:%M:%S') #Year, Month, Day, Hour, Minute, Second format
dates = list(big_table['time_created'])
#Function to return the time between dates
def convertdate(dates, which):
dts = []
for date in dates:
dts.append(datetime.utcfromtimestamp(date))
currenttime = datetime.now()
until = max(dts)
days = []
hrs = []
for date in dts:
days.append((until-date).days)
hrs.append((until-date).total_seconds()/3600.0)
#print "Last post in the data set has a date/time of", until.strftime('%Y-%m-%d %H:%M:%S')
if which == 'days':
return days
elif which == 'hours':
return hrs
else:
print 'Enter days or hours'
big_table['daysfrom'] = convertdate(dates, 'days')
big_table['hoursfrom'] = convertdate(dates, 'hours')
# Color each scatter plot point according to subreddit type
df = big_table
#Set the colors of each category for a nicer looking graph
colors = ['c', 'g', 'y', 'b', 'r', 'm', 'k', 'w']
talldf = df[df['type'] == types[0]]
talldf['color'] = colors[0]
tallcol = list(talldf['color'])
newdf = df[df['type'] == types[1]]
newdf['color'] = colors[1]
newcol= list(newdf['color'])
hotdf = df[df['type'] == types[2]]
hotdf['color'] = colors[2]
hotcol= list(hotdf['color'])
tweekdf = df[df['type'] == types[3]]
tweekdf['color'] = colors[3]
tweekcol= list(tweekdf['color'])
tdaydf = df[df['type'] == types[4]]
tdaydf['color'] = colors[4]
tdaycol= list(tdaydf['color'])
#Plot time vs. score
tall = plt.scatter(talldf['daysfrom'], talldf['score'], c=tallcol)
new = plt.scatter(newdf['daysfrom'], newdf['score'], c=newcol)
hot = plt.scatter(hotdf['daysfrom'], hotdf['score'], c=hotcol)
tweek = plt.scatter(tweekdf['daysfrom'], tweekdf['score'], c=tweekcol)
tday = plt.scatter(tdaydf['daysfrom'], tdaydf['score'], c=tdaycol)
plt.title("Post Date (in Days posted before November, 11 2013) versus Post Score")
plt.xlabel("Number of Days posted before last post date")
plt.ylabel("Score (Upvotes-Downvotes)")
plt.xlim(0, 2100)
plt.ylim(0, 9000)
plt.legend((tall, new, hot, tweek, tday),
('Top all', 'New', 'Hot', 'Top Weekly', 'Top Day'),
loc='upper right')
remove_border()
plt.show()
r_row, p_value = pearsonr(talldf['length'], talldf['score'])
print "Pearson r coefficient for top all is " + str(r_row) + " with a p-value of " + str(p_value)
Pearson r coefficient for top all is 0.35027311129 with a p-value of 0.0
We can see that from the graph and the r score, that there is not a simple linear relationship between post date and post score. That said, this is still an interesting visualization of our data set. Let's examine this in a few other ways.
'''
The following code will plot four scatterplots from different combinations
of this data and various axis limits to see if any patterns can be observed
'''
#exclude data older than 100 days from the plots
tall = plt.scatter(talldf['hoursfrom']/24.0, talldf['score'], c=tallcol)
new = plt.scatter(newdf['hoursfrom']/24.0, newdf['score'], c=newcol)
hot = plt.scatter(hotdf['hoursfrom']/24.0, hotdf['score'], c=hotcol)
tweek = plt.scatter(tweekdf['hoursfrom']/24.0, tweekdf['score'], c=tweekcol)
tday = plt.scatter(tdaydf['hoursfrom']/24.0, tdaydf['score'], c=tdaycol)
plt.title("Post Date (in Days posted before November, 11 2013) versus Post Score")
plt.xlabel("Number of Days posted before last post date")
plt.ylabel("Score (Upvotes-Downvotes)")
plt.xlim(0, 100)
plt.ylim(0, 5500)
plt.legend((tall, new, hot, tweek, tday),
('Top all', 'New', 'Hot', 'Top Weekly', 'Top Day'),
loc='upper right')
remove_border()
plt.show()
#leave out all and hot
new = plt.scatter(newdf['hoursfrom']/24.0, newdf['score'], c=newcol)
tweek = plt.scatter(tweekdf['hoursfrom']/24.0, tweekdf['score'], c=tweekcol)
tday = plt.scatter(tdaydf['hoursfrom']/24.0, tdaydf['score'], c=tdaycol)
plt.title("Post Date (in Days posted before November, 11 2013) versus Post Score")
plt.xlabel("Number of Days posted before last post date")
plt.ylabel("Score (Upvotes-Downvotes)")
plt.xlim(0, 100)
plt.ylim(0, 4200)
plt.legend((new, tweek, tday),
('New', 'Top Weekly', 'Top Day'),
loc='upper right')
remove_border()
plt.show()
#plot it with hot
new = plt.scatter(newdf['hoursfrom']/24.0, newdf['score'], c=newcol)
hot = plt.scatter(hotdf['hoursfrom']/24.0, hotdf['score'], c=hotcol)
tweek = plt.scatter(tweekdf['hoursfrom']/24.0, tweekdf['score'], c=tweekcol)
tday = plt.scatter(tdaydf['hoursfrom']/24.0, tdaydf['score'], c=tdaycol)
plt.title("Post Date (in Days posted before November, 11 2013) versus Post Score")
plt.xlabel("Number of Days posted before last post date")
plt.ylabel("Score (Upvotes-Downvotes)")
plt.xlim(0, 100)
plt.ylim(0, 4200)
plt.legend((new, hot, tweek, tday),
('New', 'Hot', 'Top Weekly', 'Top Day'),
loc='upper right')
remove_border()
plt.show()
#look only at day and week
tweek = plt.scatter(tweekdf['hoursfrom']/24.0, tweekdf['score'], c=tweekcol)
tday = plt.scatter(tdaydf['hoursfrom']/24.0, tdaydf['score'], c=tdaycol)
plt.title("Post Date (in days posted before November, 11 2013) versus Post Score")
plt.xlabel("Number of Days posted before last post date")
plt.ylabel("Score (Upvotes-Downvotes)")
plt.xlim(0, 8)
plt.ylim(0, 4200)
plt.legend((tweek, tday),
('Top Weekly', 'Top Day'),
loc='upper right')
remove_border()
plt.show()
del talldf, tallcol, newdf, newcol, hotdf, hotcol, tweekdf, tweekcol, tdaydf, tdaycol
A few observations jump out from these scatterplots:
In order to build our first "real" model we need to read in the Data. We want to compute upvotes/downvotes in order to try to generate a measurement for how controversial a post is. This might not be the most useful approach because Reddit deliberately messes with the number of up and downvotes they display. Then we try to use simple models to test different approaches to predict scores.
df = pd.read_csv('Data/full.csv', encoding='utf-8') # Top all is our training data set
print len(df)
df['up/down'] = df['upvotes'].astype(float)/df['downvotes'].astype(float) # Reddit fuzzes this so...
topcomments=float(max(df['comments']))
topsscore=float(max(df['score']))
leastcontro = max(df['up/down'])
# The following metric is something we invented for testing purposes
df['mymetric'] = (((df['comments'].astype(float)/topcomments)*0.10)+\
((df['score'].astype(float)/topsscore)*0.85)+\
((df['up/down']/leastcontro)*0.05))**(0.30)
df['nrmscore'] = (df['score'].astype(float)/topsscore)**(0.30)
bigdf = df
df = df[df['subreddit'] == 'AskReddit']
df2 = df[df['type'] == 'top_week']
print len(df2)
df = df[df['type'] == 'top_all']
print len(df)
44261 999 994
#It's important in cross validation that the sets are disjoint, so we are removing duplicates
dfids = list(df['id'])
df2ids = list(df2['id'])
dupids = []
for redditid in dfids:
if redditid in df2ids:
dupids.append(redditid)
#This part is slightly overengineered, but the motivation behind it is that we didn't want to simply strip out the
#posts from other data set at will. Instead, we are splitting the duplicates in half and assigning them to one of the data sets
#to avoid some sort of possible bias.
if len(dupids)%2 != 0:
a = len(dupids)/2
a = a+1
dup1 = dupids[0:a]
dup2 = dupids[a:]
else:
a = len(dupids)/2
dup1 = dupids[0:a]
dup2 = dupids[a:]
if np.random.randint(2) == 0:
df=df[df['id'].apply(lambda x: x in dup1) == False]
df2=df2[df2['id'].apply(lambda x: x in dup2) == False]
else:
df=df[df['id'].apply(lambda x: x in dup2) == False]
df2=df2[df2['id'].apply(lambda x: x in dup1) == False]
print len(df)
print len(df2)
982 987
vectorizer = CountVectorizer(min_df=0.001)
title = list(df['title']) + list(df2['title'])
vectorizer.fit(title)
def category(x, df, num=2):
size = len(df)
blocksize = size/num
for i in range(num):
blockmax = max(sorted(df['score'])[blocksize*i:blocksize*(i+1)])
if x < blockmax:
return i+1
return num
x_train = vectorizer.transform(df['title'])
x_test = vectorizer.transform(df2['title'])
score = [category(i, df2) for i in df['score']]
score2 = [category(i, df2) for i in df2['score']]
y_train = np.array(score)
y_test = np.array(score2)
vectorizer2 = CountVectorizer(min_df=0.001)
title2 = df2['title']
vectorizer2.fit(title2)
X2 = vectorizer2.transform(title2)
Y2 = np.array(df2['score'])
clf = MultinomialNB(alpha=1)
clf.fit(x_train, y_train)
print "Training accuracy is", clf.score(x_train, y_train)
print "Test accuracy is", clf.score(x_test, y_test)
Training accuracy is 1.0 Test accuracy is 0.517730496454
As we can see, there are some severe issues with this vectorizer. The training accuracy is very high, because we are not doing a split of the training and testing data. Let's try this now and see if we can get anything more reasonable.
dftitles = df['title']
df2titles = df2['title']
vectorizer = CountVectorizer(min_df=0.001)
title = list(dftitles) + list(df2titles)
vectorizer.fit(title)
def category(x, df, num=2):
size = len(df)
blocksize = size/num
for i in range(num):
blockmax = max(sorted(df['score'])[blocksize*i:blocksize*(i+1)])
if x < blockmax:
return i+1
return num
#scores = [category(i) for i in df2['score']]
#print scores
#X = vectorizer.transform(title)
#Y = np.array(scores)
x_train = vectorizer.transform(dftitles)
x_test = vectorizer.transform(df2titles)
score = [1 if i > np.mean(df['score']) else 0 for i in df['score']]
score2 = [1 if i > np.mean(df2['score']) else 0 for i in df2['score']]
y_train = np.array(score)
y_test = np.array(score2)
clf = MultinomialNB(alpha=1)
clf.fit(x_train, y_train)
print "Training accuracy is", clf.score(x_train, y_train)
print "Test accuracy is", clf.score(x_test, y_test)
Training accuracy is 0.852342158859 Test accuracy is 0.755825734549
This looks great! A 0.75 accuracy at this stage of the project is a fantastic start. Unfortunately, the reason for this very high accuracy both in the training and the testing is the way the binning is designed. First, we tried doing a category split but this yielded a Test accuracy directly inverse to the number of categories (.5 for 2 categories, .25 for 4, etc). We kept the category split code out of interest and perhaps future revisitation, but decided to bin instead on the mean score. If a post is above the mean in its category, it is considered "successful" and otherwise categorized as "unpopular". This yields the result we see above but unfortunately is a result of bin dimension mismatch. In cross-validating the top posts of the week against the top posts of all time, we are not considering the incredible skew of top weekly posts. There will only be a few with very high scores which will pull the mean towards them and put them as the only few posts that are above the average. Since over 90% of the posts are thus below the mean, that bin is enormous and validates itself by its size, disregarding any mistakes potentially caused in the small "popular" bin.
For now, we leave this and try and see what we can do with our Alchemy class instead
apikey = "e945cef59338f9e8e7bc962badde170e623fb7e5" #Please insert your own key here
p = MyAlchemy(apikey)
This is an overview for what Alchemy is able to do
dftitles = list(df['title'])
df2titles = list(df2['title'])
print dftitles[5]
print p.run_method(dftitles[5], 'concepts')
print p.run_method(dftitles[5], 'keywords')
print p.run_method(dftitles[5], 'category')
#print p.run_method(dftitles[5], 'sentiment')
print p.run_method(dftitles[5], 'entities')
print len(df)
A ume all of world history is a movie What are the biggest plotholes [(u'0.857188', u'History of the world'), (u'0.837988', u'Ibn Khaldun')] [(u'0.930601', u'biggest plotholes'), (u'0.875235', u'ume'), (u'0.7165', u'world history'), (u'0.539622', u'movie')] (u'arts_entertainment', u'english', u'0.570921', u'OK') [(u'1', u'0.33', u'world history', u'FieldTerminology')] 982
#Concepts, keywords, category, sentiment, entities - all things Alchemy can provide
categories = []
concepts, concepts2 = [], []
for i in range(30):
conceptlist = p.run_method(dftitles[i], 'concepts')
for c in conceptlist:
concepts.append(c[1])
for i in range(30):
conceptlist = p.run_method(df2titles[i], 'concepts')
for c in conceptlist:
concepts2.append(c[1])
print concepts
print "--------"
print concepts2
[u'Gerontology', u'Ageing', u'Old age', u'2006 albums', u'Laptop', u'Internet', u'Problem solving', u'World Wide Web', u'Manchester City F.C.', u'Mobile phone', u'Wi-Fi', u'Man', u'Guy', u'2004 albums', u'History of the world', u'Ibn Khaldun', u'Alien abduction', u'Man', u'Gender', u'Leonard Cohen', u'Western culture', u'Puerto Rico', u'United States', u'Latin America', u'U.S. state', u'Territories of the United States', u'Native Americans in the United States', u'Christopher Columbus', u'Spanish language', u'Illegal drug trade', u'Drug', u'2007 albums', u'High school', u'College', u'Draw-A-Person Test', u'The Front Page', u'University', u'Education', u'Yolanda Adams', u'Knowledge', u'Ralph Waldo Emerson', u'Family', u'The Red Chord', u'Plane', u'Education', u'English-language films', u'2006 albums', u'Political terms', u'WALL-E', u'The Nature Conservancy', u'Statistics', u'Arnold Schwarzenegger', u'Taste', u'Internet pornography', u'Internet'] -------- [u'Vector space', u'John Carpenter', u'Old Testament', u'Pennsylvania', u'E-mail', u'Culture', u'Roswell UFO Incident', u'Theodore Roosevelt', u'Barack Obama', u'President of the United States', u'Hotel', u'United States', u'Country music', u'Old-time music', u'Rock and roll', u'Mother', u'Father', u'Parent', u'Family', u'Question', u'Adolescence', u'Pumpkin pie', u'Human sexuality', u'Life', u'Look', u'Norman Rockwell', u'Mika Nakashima', u'Stan Lee', u'2006 albums', u'Second language', u'Cognition', u'Homosexuality', u'Bisexuality', u'Sexual orientation', u'Heterosexuality', u'Street dance', u'2000s American television series', u'1999 in film', u'Time', u'Pixar', u'American films', u'Plus One', u'Christmas', u"St Hilda's College, Oxford"]
vectorizer = CountVectorizer(min_df=0.001)
vectorizer.fit(concepts)
X = vectorizer.transform(concepts)
Y = np.array(df['score'][0:55])
title2 = df2['title']
print len(Y)
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.5) #I added the train size parameter.
clf = MultinomialNB(alpha=1)
clf.fit(x_train, y_train)
print "Training accuracy is", clf.score(x_train, y_train)
print "Test accuracy is", clf.score(x_test, y_test)
55 Training accuracy is 0.962962962963 Test accuracy is 0.0
Unfortunately, Alchemy seems like a waste of time at this point. Seeing as our testing accuracy is 0.0, the categories generated by alchemy have absolutely no correlation with score and it seems like the other potential candidates are also out of the question. So, although it was a good idea to test this so as to determine what we have to work with, we will not be able to use this for any sort of predictive capabilities until we come up with a better use case.
Next, let's try to implement a Naive Bayes on the entire data set. This classifier will serve as our baseline to improve upon. First, we need to remove duplicates across types for each subreddit, which by visual analysis, seem to be quite prominent. Clearly this is due to reposting within the same subreddit or even cross-posting across several subreddits at once.
del big_table #don't need this any longer
df = pd.read_csv('Data/full.csv', encoding='utf-8') #using a fresh dataset for this
print "Original size of data set is", len(df)
df = df.drop_duplicates('id')
print "Size of data set with only unique posts is", len(df)
dfmean = np.mean(df['score'])
df = df.sort('score')
df = df.reset_index(level=0, drop=True)
median = len(df)/2
md = df['score'][median]
Original size of data set is 44261 Size of data set with only unique posts is 25992
So as to prevent large blocks of repeating code in future, we decided that since we will be testing our Classifier and imporiving it with different techniques over time, it will be important to abstract out the function that creates the X and Y sets for our training and testing data. Let's test this abstraction now with just the titles and scores of the DataFrame.
def make_xy(titles, scores, vectorizer=None):
#Set default vecotrizer
if not vectorizer:
vectorizer = CountVectorizer(min_df=0.001)
#Build the vocabulary by fitting the vectorizer to the list of quotes
vectorizer.fit(titles)
#Convert into a bag-of-words and use a sparse array to save memory
x = vectorizer.transform(titles)
x = x.tocsc()
#save into numpy array, and return everything
y = np.array(scores)
return x, y, vectorizer
X,Y,vectorizer = make_xy(list(df['title']), df['score'])
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.5)
clf = MultinomialNB(alpha=1)
clf.fit(x_train, y_train)
print "Training accuracy is", clf.score(x_train, y_train)
print "Test accuracy is", clf.score(x_test, y_test)
Training accuracy is 0.160126192675 Test accuracy is 0.0675592489997
Again, not a very promising result. Clearly, we should explore some other options. But first, just to clarify what it is exactly we are doing, since this code is quite specific only to those who have experince with Homework 3 of the Data Science class.
What we are doing here is constructing a "bag-of-words". Each word that occurs in any title will get added to a list of occuring words. For each occuring word, the algorithm keeps track of how often this word occured. This is what CountVectorizer does.
We then create our training and test sets in order to predict the scores using the title of a post. As can be clearly seen, the model is overfitting the data by a lot while having an accuracy on the training set of only 76%. This means that we have to improve our model by a lot in order to get an accuracy desired for our project.
Maybe the reason for the low score is because of the wide range of possible scores a post can have. Let's try to bin the data and predict the bin it is in.
We can split the dataset into 5 equally sized bins in order to decrease the sparsity of the data, and see if we can get a better result overall
sorteddf = df.sort('score')
sorteddf['category'] = df['score']
size = len(df)
num = 5
blocksize = size/num
blocks = [blocksize * i for i in range(num)]
blocks.append(size)
for i in range(num):
sorteddf['category'][blocks[i]:blocks[i+1]] = i+1
Xsort, Ysort, vectorizer2 = make_xy(list(sorteddf['title']), sorteddf['category'])
x_train3, x_test3, y_train3, y_test3 = train_test_split(Xsort, Ysort, train_size=0.5)
clf3 = MultinomialNB(alpha=1)
clf3.fit(x_train3, y_train3)
train_acc = clf3.score(x_train3, y_train3)
test_acc = clf3.score(x_test3, y_test3)
print "Training accuracy is", train_acc
print "Test accuracy is", test_acc
Training accuracy is 0.427516158818 Test accuracy is 0.327408433364
This is a lot better! Using bins is better than using the scores themselves, which makes sense, because there is simply less space for error. However, these accuracies are still unfortunately low. Now we cross-validate on the bin size to see if this actually affects the accuracy. If it does, we'll store the best binning method for later use.
sorteddf = df.sort('score')
sorteddf['category'] = df['score']
size = len(df)
best_test = 0
best_vect = None
best_Ysort = None
best_clf = None
for num in range(2, 11):
blocksize = size/num
blocks = [blocksize * i for i in range(num)]
blocks.append(size)
for i in range(num):
sorteddf['category'][blocks[i]:blocks[i+1]] = i+1
Xsort, Ysort, vectorizer2 = make_xy(list(sorteddf['title']), sorteddf['category'])
x_train3, x_test3, y_train3, y_test3 = train_test_split(Xsort, Ysort, train_size=0.5)
clf3 = MultinomialNB(alpha=1)
clf3.fit(x_train3, y_train3)
train_acc = clf3.score(x_train3, y_train3)
test_acc = clf3.score(x_test3, y_test3)
if best_test < test_acc:
best_test = test_acc
best_vect = copy.deepcopy(vectorizer2)
best_Ysort = copy.deepcopy(Ysort)
best_clf = copy.deepcopy(clf3)
print "For", num, "bins:"
print "Training accuracy is", train_acc
print "Test accuracy is", test_acc
print "---------------------------------"
For 2 bins: Training accuracy is 0.657048322561 Test accuracy is 0.601261926747 --------------------------------- For 3 bins: Training accuracy is 0.54593721145 Test accuracy is 0.47837796245 --------------------------------- For 4 bins: Training accuracy is 0.476915974146 Test accuracy is 0.387426900585 --------------------------------- For 5 bins: Training accuracy is 0.423437980917 Test accuracy is 0.331024930748 --------------------------------- For 6 bins: Training accuracy is 0.404970760234 Test accuracy is 0.288935056941 --------------------------------- For 7 bins: Training accuracy is 0.385887965528 Test accuracy is 0.256386580486 --------------------------------- For 8 bins: Training accuracy is 0.374730686365 Test accuracy is 0.234379809172 --------------------------------- For 9 bins: Training accuracy is 0.348491843644 Test accuracy is 0.213065558633 --------------------------------- For 10 bins: Training accuracy is 0.336949830717 Test accuracy is 0.187519236688 ---------------------------------
As expected, we can see that the more bins we have, the less accurate the classifier becomes on both the training and test set. This seems logical, and one good thing to observe here is that as we increase/decrease the number of bins, the classifier doesn't start overfitting/underfitting more, showing that there is predictive value in using this approach. Therefore, this baseline MCNB is what we have to build upon. Interestingly, stemming the titles using the nlkt package, meaning we take only the roots of the words and don't consider prefixes or suffixes, doesn't seem to have much of an effect on the results. For stemming attempts and other experiments we tried, please take a look at the notebook Statisticalmodelling.ipynb.
This reaffirms the validity of the original test in that it is robust even if we remove extraneous portions of each word, so stemming won't be necessary in future, though we may try it out of interest
Let's now try an ngram analysis in order to improve our model. We will use the optimal Y calculated above as our y_train. See stackexchange and documentation for more information.
n_grams = CountVectorizer(ngram_range=[1, 5], analyzer='word')
n_grams.fit(list(sorteddf['title']))
Xngram = n_grams.transform(list(sorteddf['title']))
x_train4, x_test4, y_train4, y_test4 = train_test_split(Xngram, best_Ysort, train_size=0.5)
clf4 = MultinomialNB(alpha=1)
clf4.fit(x_train4, y_train4)
print "Training accuracy is", clf4.score(x_train4, y_train4)
print "Test accuracy is", clf4.score(x_test4, y_test4)
Training accuracy is 0.988996614343 Test accuracy is 0.60811018775
It seems like n_grams really improve the model to an acceptable level of 61% accuracy! With this breakthrough we can use this as the new basis for further improvements. Let's try the same thing using a TFIDDF (Term Frequency Inverse Document Frequency) matrix which is normally a slight improvement over the normal vectorizer.
tdidf = TfidfVectorizer(ngram_range=[1, 5], sublinear_tf=True)
tdidf.fit(list(sorteddf['title']))
Xtdidf = tdidf.transform(list(sorteddf['title']))
x_train5, x_test5, y_train5, y_test5 = train_test_split(Xtdidf, best_Ysort, train_size=0.5)
clf5 = MultinomialNB(alpha=1)
clf5.fit(x_train5, y_train5)
print "Training accuracy is", clf5.score(x_train5, y_train5)
print "Test accuracy is", clf5.score(x_test5, y_test5)
Training accuracy is 0.993844259772 Test accuracy is 0.617651585103
Since this is just a small improvement, it is not particularly necessary to continue using the TFIDF vectorizer. We may revisit it in future. Now let's turn to predicting some word probabilities, just out of interest:
#Calculate the rotten and fresh word probabilities
#and create a new, sorted DataFrame for them
mywords = best_vect.get_feature_names()
print len(mywords)
diag = np.eye(len(mywords))
unpop, pop = zip(*best_clf.predict_proba(diag))
data = pd.DataFrame({'words': mywords, 'p_pop': pop, 'p_unpop': unpop})
sort = data.sort('p_pop', ascending=False).copy()
print 'Top 10 \"Best\" Words:'
print
for i in sort[:10].index:
print "The word",sort.words[i],"has probability", sort.p_pop[i], "of being popular"
print
print 'Top 10 \"Worst\" Words:'
print
for i in sort[:-11:-1].index:
print "The word",sort.words[i],"has probability", sort.p_unpop[i], "of being unpopular"
1320 Top 10 "Best" Words: The word elizabeth has probability 0.937900976218 of being popular The word sopa has probability 0.934274878848 of being popular The word pt has probability 0.923040261622 of being popular The word romney has probability 0.90318067632 of being popular The word warren has probability 0.90318067632 of being popular The word pizza has probability 0.894068767541 of being popular The word tent has probability 0.883063589381 of being popular The word fireworks has probability 0.883063589381 of being popular The word banks has probability 0.876656522686 of being popular The word boyfriend has probability 0.869506657706 of being popular Top 10 "Worst" Words: The word immigration has probability 0.952964319805 of being unpopular The word thanksgiving has probability 0.887375543378 of being unpopular The word friday has probability 0.869125830358 of being unpopular The word affect has probability 0.849122748133 of being unpopular The word holiday has probability 0.835122636099 of being unpopular The word christmas has probability 0.831207001264 of being unpopular The word related has probability 0.81825859606 of being unpopular The word role has probability 0.81825859606 of being unpopular The word eliwhat has probability 0.808463180497 of being unpopular The word opinion has probability 0.808463180497 of being unpopular
These lists are really cool and we spent a bit of time trying to work out why and how each word was categorized as either "Popular" or "Unpopular". As we can see, words like immigration and opinion get fairly low scores while sopa and banks get higher scores on average.
Now, we attempt splitting by subreddit and then vectorizing it may improve the model.
#let's get started with a new and clean data set once again
df = pd.read_csv('Data/full.csv', encoding='utf-8')
df = df.drop_duplicates('id')
df = df.sort('score')
df = df.reset_index(level=0, drop=True)
df = df.drop_duplicates()
subreddit_ngrams = {}
for subreddit in subreddits:
smalldf = df[df['subreddit'] == subreddit]
sortedsmalldf = smalldf.sort('score')
sortedsmalldf['category'] = smalldf['score']
size = len(smalldf)
num = 2
blocksize = size/num
blocks = [blocksize * i for i in range(num)]
blocks.append(size)
for i in range(num):
sortedsmalldf['category'][blocks[i]:blocks[i+1]] = i+1
n_grams = CountVectorizer(ngram_range=[1, 3])
n_grams.fit(list(sortedsmalldf['title']))
X = n_grams.transform(list(sortedsmalldf['title']))
Y = np.array(sortedsmalldf['category'])
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.5)
clf = MultinomialNB(alpha=50)
clf.fit(x_train, y_train)
subreddit_ngrams[subreddit] = [clf, n_grams]
train_acc = clf.score(x_train, y_train)
test_acc = clf.score(x_test, y_test)
print "For", subreddit, "subreddit:"
print "Training accuracy is", train_acc
print "Test accuracy is", test_acc
print "---------------------------------"
For atheism subreddit: Training accuracy is 0.837711069418 Test accuracy is 0.552532833021 --------------------------------- For politics subreddit: Training accuracy is 0.656279508971 Test accuracy is 0.592067988669 --------------------------------- For nosleep subreddit: Training accuracy is 0.791089108911 Test accuracy is 0.595450049456 --------------------------------- For pettyrevenge subreddit: Training accuracy is 0.864171122995 Test accuracy is 0.50641025641 --------------------------------- For jokes subreddit: Training accuracy is 0.805899143673 Test accuracy is 0.557564224548 --------------------------------- For askhistorians subreddit: Training accuracy is 0.735714285714 Test accuracy is 0.522448979592 --------------------------------- For TalesFromTechsupport subreddit: Training accuracy is 0.898197242842 Test accuracy is 0.52704135737 --------------------------------- For AskReddit subreddit: Training accuracy is 0.733142857143 Test accuracy is 0.565714285714 --------------------------------- For talesFromRetail subreddit: Training accuracy is 0.889947089947 Test accuracy is 0.533333333333 --------------------------------- For askscience subreddit: Training accuracy is 0.673509286413 Test accuracy is 0.525390625 --------------------------------- For tifu subreddit: Training accuracy is 0.764887063655 Test accuracy is 0.530800821355 --------------------------------- For explainlikeimfive subreddit: Training accuracy is 0.892686804452 Test accuracy is 0.583002382844 ---------------------------------
The splitting by subreddit did not improve our model, it even worsened it by a bit.
All of our previous methods looked at one specific title and its parts. Another way to look at it is to compare titles. We wanted to analyse titles and find the ones closest to the one we investigate. Maybe there is a correlation between similar titles. A popular method to compute the score of similarity for texts is the cosine similarity. We first compute the similarity between each pair of posts and then run a method that computes the k nearest neighbors of a title. This gives us several values to look at and see whether there exists a correlation between the score and them. First the score of the closest title, then the max, min and mean of the k nearest ones. Now let's implement it.
We need to iterate over the tfidf_matrix matrix and the resulting array is the Cosine Similarity between the first title with all titles in the set
# after http://stackoverflow.com/questions/12787650/finding-the-index-of-n-biggest-elements-in-python-array-list-efficiently
# we need this to efficiently get the largest cosine scores
def f(a):
return np.argsort(a)[0][::-1]
def make_xy(titles, scores, vectorizer=None):
#this one uses a tfidf vectorizer as opposed to the earlier version
#Set default vecotrizer
if not vectorizer:
vectorizer = TfidfVectorizer()
#Build the vocabulary by fitting the vectorizer to the list of quotes
#Convert into a bag-of-words and use a sparse array to save memory
x = vectorizer.fit_transform(titles)
#x = x.tocsc()
#save into numpy array, and return everything
y = np.array(scores)
return x, y, vectorizer
X,Y,vectorizer = make_xy(list(df['title']), df['score'])
#this calculates close to 2.5 billion scores + sorts every of the 50,000 list of 50,000 -> might take a while
def make_closest():
'''
calculates the cosine similarity between each pair of titles and returns a dictionary containing them
'''
closest_title_scores = {}
i = 1
for a in X:
vec = cosine_similarity(a, X)
#sort the results
sorted_vec = f(vec)
num = 0
already_printed = 0
closest_list = []
while already_printed < 10:
#try because of the dropped ID's -> may have nonexistent entries
try:
curr = df['title'][sorted_vec[num]]
sco = df['score'][sorted_vec[num]]
closest_list.append((curr, sco, vec[0][sorted_vec[num]]))
already_printed +=1
except:
pass
num +=1
#drop the first entry because its the cosine of the title with itself (score 1)
closest_title_scores[closest_list[0][0]] = closest_list[1:]
return closest_title_scores
closest_title_scores = make_closest()
def knearest(title, k=7):
"""
Given a restaurant_id, dataframe, and database, get a sorted list of the
k most similar restaurants from the entire database.
"""
return closest_title_scores[title][:k]
#generate the new columns
df['max_cosine'] = df['title'].map(lambda x: 0)
df['avg_cosine'] = df['title'].map(lambda x: 0)
df['min_cosine'] = df['title'].map(lambda x: 0)
df['closest_cosine'] = df['title'].map(lambda x: 0)
#fill in the values to the new columns
for key, value in df.iterrows():
max_score = 0
mean_score = 0
min_score = 0
closest_score = 0
try:
tuple_list = knearest(value['title'])
closest_score = tuple_list[0][1]
max_score = max(tuple_list,key=lambda item:item[1])[1]
min_score = min(tuple_list,key=lambda item:item[1])[1]
mean_score = np.mean([a[1] for a in tuple_list] )
except:
pass
df['max_cosine'][key] = max_score
df['avg_cosine'][key] = mean_score
df['min_cosine'][key] = min_score
df['closest_cosine'][key] = closest_score
#calculate the pearsonr
print "max cosine pearson"
r_row, p_value = pearsonr(df['max_cosine'], df['score'])
print "Pearson coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
print "avg cosine pearson"
r_row, p_value = pearsonr(df['avg_cosine'], df['score'])
print "Pearson coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
print "min cosine pearson"
r_row, p_value = pearsonr(df['min_cosine'], df['score'])
print "Pearson coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
print "closest cosine pearson"
r_row, p_value = pearsonr(df['closest_cosine'], df['score'])
print "Pearson coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
max cosine pearson Pearson coefficient is 0.169645106629 with a p-value of 4.57076248584e-167 avg cosine pearson Pearson coefficient is 0.312734742086 with a p-value of 0.0 min cosine pearson Pearson coefficient is 0.08813460013 with a p-value of 5.46830455005e-46 closest cosine pearson Pearson coefficient is 0.153899785414 with a p-value of 1.68783653801e-137
The result clearly and disappointingly indicates that the correlation is too small (next to nothing) to consider it for our model. This also means that similar sounding titles have nothing to do with the success of the post.
Our next idea was that we could potentially generate Trigrams and Bigrams in titles and start a search query on reddit with it. The result page will show several posts that we could scrape using the API. We then could store the scores of the search results and try to look into them. The investigations can be found in the notebook Title Bi-Trigram Analsyis. Unfortunately neither of the following methods worked there:
Let's get back to our first approach then. We are calculating the probability of a post being successful for each subreddit and add it to our dataframe in order to use the data later. We use the same methods with the ngrams we used in our most successful model.
spec_probs = []
for i in df.index:
title = df.title[i]
subreddit = df.subreddit[i]
clf = subreddit_ngrams[subreddit][0]
n_grams_spec = subreddit_ngrams[subreddit][1]
prob_spec = clf.predict_proba(n_grams_spec.transform([title]))[0][1]
spec_probs.append(prob_spec)
df['spec_probs'] = spec_probs
df.to_csv("Data/new_full.csv", index=False, encoding='utf-8')
def plot_spec_prob(table):
'''
plots a scatterplot of scores against the specific probability
'''
m_fit,b_fit = plt2.polyfit(table.spec_probs, table.score, 1)
plt2.plot(table.spec_probs, table.score, 'yo', table.spec_probs, m_fit*table.spec_probs+b_fit, color='red', alpha=.9)
#p1 = plt.scatter(table.spec_probs, table.score, color='red', alpha = 0.2)
plt.title("Specific probability against score")
plt.xlabel("Specific probability")
plt.ylabel("Score")
plt.ylim([0, 7000])
remove_border()
plot_spec_prob(df)
Now we can write a prediction function based on the specific probability
m, b, r, p, std = scipy.stats.linregress(np.array(df['spec_probs']), np.array(df['score']))
print "slope", m
print "slope intercept", b
print "squared correlation", r**2
print "probability", p
print "standard deviation", std
def predict(title):
x = clf.predict_proba(n_grams_spec.transform([title]))[0][1]
y = m*x + b
return y
slope 2130.34920343 slope intercept -646.136514001 squared correlation 0.185034187628 probability 0.0 standard deviation 27.7326496246
Using it we can go through the sklearn library and test the various regression models in there.
We start with the Ridge classifier. The documentation can be found here
subreddit_svm = {}
for subreddit in subreddits:
smalldf = df[df['subreddit'] == subreddit]
sortedsmalldf = smalldf.sort('score')
sortedsmalldf['category'] = smalldf['score']
size = len(smalldf)
num = 2
blocksize = size/num
blocks = [blocksize * i for i in range(num)]
blocks.append(size)
for i in range(num):
sortedsmalldf['category'][blocks[i]:blocks[i+1]] = i+1
n_grams = CountVectorizer(ngram_range=[1, 3])
n_grams.fit(list(sortedsmalldf['title']))
X = n_grams.transform(list(sortedsmalldf['title']))
Y = np.array(sortedsmalldf['category'])
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.5)
clf = RidgeClassifier(tol=1e-2, solver="lsqr")
clf.fit(x_train, y_train)
subreddit_svm[subreddit] = [clf, n_grams]
train_acc = clf.score(x_train, y_train)
test_acc = clf.score(x_test, y_test)
print "For", subreddit, "subreddit:"
print "Training accuracy is", train_acc
print "Test accuracy is", test_acc
print "---------------------------------"
For atheism subreddit: Training accuracy is 0.997185741088 Test accuracy is 0.634146341463 --------------------------------- For politics subreddit: Training accuracy is 0.999055712937 Test accuracy is 0.707271010387 --------------------------------- For nosleep subreddit: Training accuracy is 1.0 Test accuracy is 0.640949554896 --------------------------------- For pettyrevenge subreddit: Training accuracy is 0.99679144385 Test accuracy is 0.547008547009 --------------------------------- For jokes subreddit: Training accuracy is 0.997145575642 Test accuracy is 0.572787821123 --------------------------------- For askhistorians subreddit: Training accuracy is 1.0 Test accuracy is 0.588775510204 --------------------------------- For TalesFromTechsupport subreddit: Training accuracy is 0.998939554613 Test accuracy is 0.517497348887 --------------------------------- For AskReddit subreddit: Training accuracy is 0.999428571429 Test accuracy is 0.585714285714 --------------------------------- For talesFromRetail subreddit: Training accuracy is 1.0 Test accuracy is 0.537566137566 --------------------------------- For askscience subreddit: Training accuracy is 1.0 Test accuracy is 0.603515625 --------------------------------- For tifu subreddit: Training accuracy is 0.998973305955 Test accuracy is 0.545174537988 --------------------------------- For explainlikeimfive subreddit: Training accuracy is 1.0 Test accuracy is 0.559968228753 ---------------------------------
Now let's actually start using Alchemy. For we want to use the keywords function from alchemy in order to scrape the keywords from each title. In the notebook alchemy keywords you can see how this is done. We added a column to the dataframe where you can find the result of the keywords function. Now lets use those in order to predict the score.
subreddit_alchemy = {}
for subreddit in subreddits:
smalldf = df[df['subreddit'] == subreddit]
sortedsmalldf = smalldf.sort('score')
sortedsmalldf['category'] = smalldf['score']
size = len(smalldf)
num = 2
blocksize = size/num
blocks = [blocksize * i for i in range(num)]
blocks.append(size)
for i in range(num):
sortedsmalldf['category'][blocks[i]:blocks[i+1]] = i+1
alch_titles = []
for title in list(sortedsmalldf['title']):
titles = [lst.replace('(', '') for lst in sortedsmalldf[sortedsmalldf['title'] == title]['alchemy']]
titles = [lst.replace(')', '') for lst in titles]
titles = [lst.replace('[', '') for lst in titles]
titles = [lst.replace(']', '') for lst in titles]
titles = "".join(titles)
titles = "".join(ch for ch in titles if ch in 'qwertyuiopasdfghjklzxcvbnm ')
titles = titles.replace(' ', ' ')
titles = titles.split(' ')
alch_titles.append(" ".join(titles[1:]))
n_grams = CountVectorizer(ngram_range=[1, 3])
n_grams.fit(list(alch_titles))
X = n_grams.transform(alch_titles)
Y = np.array(sortedsmalldf['category'])
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.5)
clf = RidgeClassifier(tol=1e-2, solver="lsqr")
clf.fit(x_train, y_train)
subreddit_alchemy[subreddit] = [clf, n_grams]
train_acc = clf.score(x_train, y_train)
test_acc = clf.score(x_test, y_test)
print "For", subreddit, "subreddit:"
print "Training accuracy is", train_acc
print "Test accuracy is", test_acc
print "---------------------------------"
For atheism subreddit: Training accuracy is 0.998123827392 Test accuracy is 0.69512195122 --------------------------------- For politics subreddit: Training accuracy is 1.0 Test accuracy is 0.754485363551 --------------------------------- For nosleep subreddit: Training accuracy is 0.993069306931 Test accuracy is 0.839762611276 --------------------------------- For pettyrevenge subreddit: Training accuracy is 0.995721925134 Test accuracy is 0.662393162393 --------------------------------- For jokes subreddit: Training accuracy is 0.997145575642 Test accuracy is 0.659372026641 --------------------------------- For askhistorians subreddit: Training accuracy is 0.995918367347 Test accuracy is 0.630612244898 --------------------------------- For TalesFromTechsupport subreddit: Training accuracy is 0.997879109226 Test accuracy is 0.662778366914 --------------------------------- For AskReddit subreddit: Training accuracy is 1.0 Test accuracy is 0.714285714286 --------------------------------- For talesFromRetail subreddit: Training accuracy is 0.998941798942 Test accuracy is 0.695238095238 --------------------------------- For askscience subreddit: Training accuracy is 0.999022482893 Test accuracy is 0.642578125 --------------------------------- For tifu subreddit: Training accuracy is 0.993839835729 Test accuracy is 0.70636550308 --------------------------------- For explainlikeimfive subreddit: Training accuracy is 0.99920508744 Test accuracy is 0.618745035743 ---------------------------------
As opposed to the first Ridge classifier this one actually beats our baseline in all regards! In the nosleep subreddit we get an accuracy of 81%! The fact that you can predict this best was already assumed while plotting the user success rate for each subreddit. Nosleep was the one with the highest user success score which aligns with our results.
Now let's try the next regression - perceptron Since it always yielded better results, we will only include the alchemy ones here - using the methods with just the titles can be found in the notebook statisticalmodelling.
subreddit_alchemy = {}
for subreddit in subreddits:
smalldf = df[df['subreddit'] == subreddit]
sortedsmalldf = smalldf.sort('score')
sortedsmalldf['category'] = smalldf['score']
size = len(smalldf)
num = 2
blocksize = size/num
blocks = [blocksize * i for i in range(num)]
blocks.append(size)
for i in range(num):
sortedsmalldf['category'][blocks[i]:blocks[i+1]] = i+1
alch_titles = []
for title in list(sortedsmalldf['title']):
titles = [lst.replace('(', '') for lst in sortedsmalldf[sortedsmalldf['title'] == title]['alchemy']]
titles = [lst.replace(')', '') for lst in titles]
titles = [lst.replace('[', '') for lst in titles]
titles = [lst.replace(']', '') for lst in titles]
titles = "".join(titles)
titles = "".join(ch for ch in titles if ch in 'qwertyuiopasdfghjklzxcvbnm ')
titles = titles.replace(' ', ' ')
titles = titles.split(' ')
alch_titles.append(" ".join(titles[1:]))
n_grams = CountVectorizer(ngram_range=[1, 3])
n_grams.fit(list(alch_titles))
X = n_grams.transform(alch_titles)
Y = np.array(sortedsmalldf['category'])
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.5)
clf = Perceptron(n_iter=50)
clf.fit(x_train, y_train)
subreddit_alchemy[subreddit] = [clf, n_grams]
train_acc = clf.score(x_train, y_train)
test_acc = clf.score(x_test, y_test)
print "For", subreddit, "subreddit:"
print "Training accuracy is", train_acc
print "Test accuracy is", test_acc
print "---------------------------------"
For atheism subreddit: Training accuracy is 0.996247654784 Test accuracy is 0.717636022514 --------------------------------- For politics subreddit: Training accuracy is 1.0 Test accuracy is 0.774315391879 --------------------------------- For nosleep subreddit: Training accuracy is 0.99504950495 Test accuracy is 0.826904055391 --------------------------------- For pettyrevenge subreddit: Training accuracy is 0.988235294118 Test accuracy is 0.637820512821 --------------------------------- For jokes subreddit: Training accuracy is 0.986679352997 Test accuracy is 0.701236917222 --------------------------------- For askhistorians subreddit: Training accuracy is 0.997959183673 Test accuracy is 0.641836734694 --------------------------------- For TalesFromTechsupport subreddit: Training accuracy is 0.995758218452 Test accuracy is 0.672322375398 --------------------------------- For AskReddit subreddit: Training accuracy is 0.999428571429 Test accuracy is 0.709714285714 --------------------------------- For talesFromRetail subreddit: Training accuracy is 0.998941798942 Test accuracy is 0.678306878307 --------------------------------- For askscience subreddit: Training accuracy is 1.0 Test accuracy is 0.6689453125 --------------------------------- For tifu subreddit: Training accuracy is 0.993839835729 Test accuracy is 0.673511293634 --------------------------------- For explainlikeimfive subreddit: Training accuracy is 0.998410174881 Test accuracy is 0.645750595711 ---------------------------------
It seems like this regression is even better than the first one. The accuracy goes up to 85%!
The next classifier we used we tried was the passive agressive classifier.
subreddit_alchemy = {}
for subreddit in subreddits:
smalldf = df[df['subreddit'] == subreddit]
sortedsmalldf = smalldf.sort('score')
sortedsmalldf['category'] = smalldf['score']
size = len(smalldf)
num = 2
blocksize = size/num
blocks = [blocksize * i for i in range(num)]
blocks.append(size)
for i in range(num):
sortedsmalldf['category'][blocks[i]:blocks[i+1]] = i+1
alch_titles = []
for title in list(sortedsmalldf['title']):
titles = [lst.replace('(', '') for lst in sortedsmalldf[sortedsmalldf['title'] == title]['alchemy']]
titles = [lst.replace(')', '') for lst in titles]
titles = [lst.replace('[', '') for lst in titles]
titles = [lst.replace(']', '') for lst in titles]
titles = "".join(titles)
titles = "".join(ch for ch in titles if ch in 'qwertyuiopasdfghjklzxcvbnm ')
titles = titles.replace(' ', ' ')
titles = titles.split(' ')
alch_titles.append(" ".join(titles[1:]))
n_grams = CountVectorizer(ngram_range=[1, 3])
n_grams.fit(list(alch_titles))
X = n_grams.transform(alch_titles)
Y = np.array(sortedsmalldf['category'])
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.5)
clf = PassiveAggressiveClassifier(n_iter=50)
clf.fit(x_train, y_train)
subreddit_alchemy[subreddit] = [clf, n_grams]
train_acc = clf.score(x_train, y_train)
test_acc = clf.score(x_test, y_test)
print "For", subreddit, "subreddit:"
print "Training accuracy is", train_acc
print "Test accuracy is", test_acc
print "---------------------------------"
For atheism subreddit: Training accuracy is 0.999061913696 Test accuracy is 0.733583489681 --------------------------------- For politics subreddit: Training accuracy is 0.998111425873 Test accuracy is 0.792256846081 --------------------------------- For nosleep subreddit: Training accuracy is 0.992079207921 Test accuracy is 0.861523244313 --------------------------------- For pettyrevenge subreddit: Training accuracy is 0.994652406417 Test accuracy is 0.65811965812 --------------------------------- For jokes subreddit: Training accuracy is 0.998097050428 Test accuracy is 0.725975261656 --------------------------------- For askhistorians subreddit: Training accuracy is 0.997959183673 Test accuracy is 0.671428571429 --------------------------------- For TalesFromTechsupport subreddit: Training accuracy is 0.996818663839 Test accuracy is 0.690349946978 --------------------------------- For AskReddit subreddit: Training accuracy is 1.0 Test accuracy is 0.726857142857 --------------------------------- For talesFromRetail subreddit: Training accuracy is 0.996825396825 Test accuracy is 0.691005291005 --------------------------------- For askscience subreddit: Training accuracy is 1.0 Test accuracy is 0.6708984375 --------------------------------- For tifu subreddit: Training accuracy is 0.993839835729 Test accuracy is 0.717659137577 --------------------------------- For explainlikeimfive subreddit: Training accuracy is 0.99920508744 Test accuracy is 0.661636219222 ---------------------------------
This algorithm had a slightly worse max score (for the nosleep subreddit) but it seems that overall the scores are better, Let's try to investigate which one is better.
In the statisticalmodelling notebook there is also a k-nearest neighbors implementation but due to memory constraints it causes a lot of problems and it's scores are lower than all of the other three methods
#this needs a lot of memory! you might get a memory error running it
for i, d in enumerate(['Not alchemy', 'Alchemy']):
for clf, name in (
(RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"),
(Perceptron(n_iter=50), "Perceptron"),
(PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive")):
subreddit_svm = {}
for subreddit in subreddits:
smalldf = df[df['subreddit'] == subreddit]
sortedsmalldf = smalldf.sort('score')
sortedsmalldf['category'] = smalldf['score']
size = len(smalldf)
num = 2
blocksize = size/num
blocks = [blocksize * i for i in range(num)]
blocks.append(size)
for i in range(num):
sortedsmalldf['category'][blocks[i]:blocks[i+1]] = i+1
titles = list(sortedsmalldf['title'])
bins = list(sortedsmalldf['category'])
if (i==1):
alch_titles = []
for title in list(sortedsmalldf['title']):
titles = [lst.replace('(', '') for lst in sortedsmalldf[sortedsmalldf['title'] == title]['alchemy']]
titles = [lst.replace(')', '') for lst in titles]
titles = [lst.replace('[', '') for lst in titles]
titles = [lst.replace(']', '') for lst in titles]
titles = "".join(titles)
titles = "".join(ch for ch in titles if ch in 'qwertyuiopasdfghjklzxcvbnm ')
titles = titles.replace(' ', ' ')
titles = titles.split(' ')[1:]
alch_titles.append(titles)
alch_bins = []
categories = np.array(sortedsmalldf['category'])
for i, lst in enumerate(alch_titles):
b = categories[i]
for j in range(len(lst)):
alch_bins.append(b)
alch_titles = [word for words in alch_titles for word in words]
titles = alch_titles
bins = alch_bins
n_grams = CountVectorizer(ngram_range=[1, 3])
n_grams.fit(titles)
X = n_grams.transform(titles)
Y = np.array(bins)
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.5)
clf2 = clf
clf2.fit(x_train, y_train)
subreddit_svm[subreddit] = [clf, n_grams]
train_acc = clf.score(x_train, y_train)
test_acc = clf.score(x_test, y_test)
print "For", d, "and", subreddit, "subreddit and", name, "classifier:"
print "Training accuracy is", train_acc
print "Test accuracy is", test_acc
print "---------------------------------"
For Not alchemy and atheism subreddit and Ridge Classifier classifier: Training accuracy is 0.78222501656 Test accuracy is 0.70723820858 --------------------------------- For Not alchemy and politics subreddit and Ridge Classifier classifier: Training accuracy is 0.830306477093 Test accuracy is 0.810881516588 --------------------------------- For Not alchemy and nosleep subreddit and Ridge Classifier classifier: Training accuracy is 0.733062793678 Test accuracy is 0.651869697824 --------------------------------- For Not alchemy and pettyrevenge subreddit and Ridge Classifier classifier: Training accuracy is 0.713601204737 Test accuracy is 0.615625770063 --------------------------------- For Not alchemy and jokes subreddit and Ridge Classifier classifier: Training accuracy is 0.859435898619 Test accuracy is 0.802333894514 --------------------------------- For Not alchemy and askhistorians subreddit and Ridge Classifier classifier: Training accuracy is 0.760463532673 Test accuracy is 0.667846155717 --------------------------------- For Not alchemy and TalesFromTechsupport subreddit and Ridge Classifier classifier: Training accuracy is 0.686583176424 Test accuracy is 0.580494373166 --------------------------------- For Not alchemy and AskReddit subreddit and Ridge Classifier classifier: Training accuracy is 0.774272657099 Test accuracy is 0.704943677462 --------------------------------- For Not alchemy and talesFromRetail subreddit and Ridge Classifier classifier: Training accuracy is 0.670623338578 Test accuracy is 0.566282577089 --------------------------------- For Not alchemy and askscience subreddit and Ridge Classifier classifier: Training accuracy is 0.750661152211 Test accuracy is 0.668223917982 --------------------------------- For Not alchemy and tifu subreddit and Ridge Classifier classifier: Training accuracy is 0.735712672239 Test accuracy is 0.648790858358 --------------------------------- For Not alchemy and explainlikeimfive subreddit and Ridge Classifier classifier: Training accuracy is 0.774277996224 Test accuracy is 0.69989257293 --------------------------------- For Not alchemy and atheism subreddit and Perceptron classifier: Training accuracy is 0.732993521876 Test accuracy is 0.667796876579 --------------------------------- For Not alchemy and politics subreddit and Perceptron classifier: Training accuracy is 0.811399684044 Test accuracy is 0.761605055292 --------------------------------- For Not alchemy and nosleep subreddit and Perceptron classifier: Training accuracy is 0.678633062794 Test accuracy is 0.620891748041 --------------------------------- For Not alchemy and pettyrevenge subreddit and Perceptron classifier: Training accuracy is 0.662495721815 Test accuracy is 0.591900993894 --------------------------------- For Not alchemy and jokes subreddit and Perceptron classifier: Training accuracy is 0.832614561472 Test accuracy is 0.772007103157 --------------------------------- For Not alchemy and askhistorians subreddit and Perceptron classifier: Training accuracy is 0.722804542912 Test accuracy is 0.642797213001 --------------------------------- For Not alchemy and TalesFromTechsupport subreddit and Perceptron classifier: Training accuracy is 0.638634671805 Test accuracy is 0.574331129308 --------------------------------- For Not alchemy and AskReddit subreddit and Perceptron classifier: Training accuracy is 0.732295459358 Test accuracy is 0.675146356625 --------------------------------- For Not alchemy and talesFromRetail subreddit and Perceptron classifier: Training accuracy is 0.618173927196 Test accuracy is 0.548890046221 --------------------------------- For Not alchemy and askscience subreddit and Perceptron classifier: Training accuracy is 0.705481446962 Test accuracy is 0.641067163744 --------------------------------- For Not alchemy and tifu subreddit and Perceptron classifier: Training accuracy is 0.692568330698 Test accuracy is 0.628939675791 --------------------------------- For Not alchemy and explainlikeimfive subreddit and Perceptron classifier: Training accuracy is 0.521584497127 Test accuracy is 0.439825196341 --------------------------------- For Not alchemy and atheism subreddit and Passive-Aggressive classifier: Training accuracy is 0.743333819848 Test accuracy is 0.66278952273 --------------------------------- For Not alchemy and politics subreddit and Passive-Aggressive classifier: Training accuracy is 0.821535545024 Test accuracy is 0.769642969984 --------------------------------- For Not alchemy and nosleep subreddit and Passive-Aggressive classifier: Training accuracy is 0.654404100812 Test accuracy is 0.565189532768 --------------------------------- For Not alchemy and pettyrevenge subreddit and Passive-Aggressive classifier: Training accuracy is 0.68188103224 Test accuracy is 0.597541275361 --------------------------------- For Not alchemy and jokes subreddit and Passive-Aggressive classifier: Training accuracy is 0.849772837342 Test accuracy is 0.789419063213 --------------------------------- For Not alchemy and askhistorians subreddit and Passive-Aggressive classifier: Training accuracy is 0.728872297478 Test accuracy is 0.631294154841 --------------------------------- For Not alchemy and TalesFromTechsupport subreddit and Passive-Aggressive classifier: Training accuracy is 0.638820773306 Test accuracy is 0.538358803696 --------------------------------- For Not alchemy and AskReddit subreddit and Passive-Aggressive classifier: Training accuracy is 0.721092431808 Test accuracy is 0.647371045639 --------------------------------- For Not alchemy and talesFromRetail subreddit and Passive-Aggressive classifier: Training accuracy is 0.632224814192 Test accuracy is 0.541088904802 --------------------------------- For Not alchemy and askscience subreddit and Passive-Aggressive classifier: Training accuracy is 0.710642511796 Test accuracy is 0.629102347527 --------------------------------- For Not alchemy and tifu subreddit and Passive-Aggressive classifier: Training accuracy is 0.678908834824 Test accuracy is 0.591828328461 --------------------------------- For Not alchemy and explainlikeimfive subreddit and Passive-Aggressive classifier: Training accuracy is 0.686156509768 Test accuracy is 0.604365919544 --------------------------------- For Alchemy and atheism subreddit and Ridge Classifier classifier: Training accuracy is 0.780743019457 Test accuracy is 0.710235884539 --------------------------------- For Alchemy and politics subreddit and Ridge Classifier classifier: Training accuracy is 0.827842022117 Test accuracy is 0.812423380727 --------------------------------- For Alchemy and nosleep subreddit and Ridge Classifier classifier: Training accuracy is 0.73357539513 Test accuracy is 0.654031148815 --------------------------------- For Alchemy and pettyrevenge subreddit and Ridge Classifier classifier: Training accuracy is 0.71154767609 Test accuracy is 0.614996029899 --------------------------------- For Alchemy and jokes subreddit and Ridge Classifier classifier: Training accuracy is 0.858029104495 Test accuracy is 0.800304421024 --------------------------------- For Alchemy and askhistorians subreddit and Ridge Classifier classifier: Training accuracy is 0.76200783093 Test accuracy is 0.667992071888 --------------------------------- For Alchemy and TalesFromTechsupport subreddit and Ridge Classifier classifier: Training accuracy is 0.687623155406 Test accuracy is 0.577626220607 --------------------------------- For Alchemy and AskReddit subreddit and Ridge Classifier classifier: Training accuracy is 0.773794479093 Test accuracy is 0.705169104236 --------------------------------- For Alchemy and talesFromRetail subreddit and Ridge Classifier classifier: Training accuracy is 0.669418976835 Test accuracy is 0.564557428987 --------------------------------- For Alchemy and askscience subreddit and Ridge Classifier classifier: Training accuracy is 0.749286421623 Test accuracy is 0.668981184831 --------------------------------- For Alchemy and tifu subreddit and Ridge Classifier classifier: Training accuracy is 0.73605814587 Test accuracy is 0.648950305607 --------------------------------- For Alchemy and explainlikeimfive subreddit and Ridge Classifier classifier: Training accuracy is 0.773381032343 Test accuracy is 0.702103693197 --------------------------------- For Alchemy and atheism subreddit and Perceptron classifier: Training accuracy is 0.736406606114 Test accuracy is 0.670132144742 --------------------------------- For Alchemy and politics subreddit and Perceptron classifier: Training accuracy is 0.426439178515 Test accuracy is 0.374723538705 --------------------------------- For Alchemy and nosleep subreddit and Perceptron classifier: Training accuracy is 0.681272960273 Test accuracy is 0.619029312009 --------------------------------- For Alchemy and pettyrevenge subreddit and Perceptron classifier: Training accuracy is 0.670572934492 Test accuracy is 0.599006105742 --------------------------------- For Alchemy and jokes subreddit and Perceptron classifier: Training accuracy is 0.470330481308 Test accuracy is 0.391181015198 --------------------------------- For Alchemy and askhistorians subreddit and Perceptron classifier: Training accuracy is 0.528198642963 Test accuracy is 0.43919551551 --------------------------------- For Alchemy and TalesFromTechsupport subreddit and Perceptron classifier: Training accuracy is 0.547609143057 Test accuracy is 0.469468406533 --------------------------------- For Alchemy and AskReddit subreddit and Perceptron classifier: Training accuracy is 0.730539862968 Test accuracy is 0.673424915807 --------------------------------- For Alchemy and talesFromRetail subreddit and Perceptron classifier: Training accuracy is 0.615418000326 Test accuracy is 0.554412690146 --------------------------------- For Alchemy and askscience subreddit and Perceptron classifier: Training accuracy is 0.527686841032 Test accuracy is 0.454744568067 --------------------------------- For Alchemy and tifu subreddit and Perceptron classifier: Training accuracy is 0.518276883828 Test accuracy is 0.445522189742 --------------------------------- For Alchemy and explainlikeimfive subreddit and Perceptron classifier: Training accuracy is 0.524191950271 Test accuracy is 0.447126065145 --------------------------------- For Alchemy and atheism subreddit and Passive-Aggressive classifier: Training accuracy is 0.753730254073 Test accuracy is 0.679461990142 --------------------------------- For Alchemy and politics subreddit and Passive-Aggressive classifier: Training accuracy is 0.828309636651 Test accuracy is 0.778906793049 --------------------------------- For Alchemy and nosleep subreddit and Passive-Aggressive classifier: Training accuracy is 0.700119607006 Test accuracy is 0.630203928202 --------------------------------- For Alchemy and pettyrevenge subreddit and Passive-Aggressive classifier: Training accuracy is 0.689369566705 Test accuracy is 0.601333406347 --------------------------------- For Alchemy and jokes subreddit and Passive-Aggressive classifier: Training accuracy is 0.8413320726 Test accuracy is 0.778741265192 --------------------------------- For Alchemy and askhistorians subreddit and Passive-Aggressive classifier: Training accuracy is 0.725966098397 Test accuracy is 0.628083999076 --------------------------------- For Alchemy and TalesFromTechsupport subreddit and Passive-Aggressive classifier: Training accuracy is 0.64608967903 Test accuracy is 0.547236940053 --------------------------------- For Alchemy and AskReddit subreddit and Passive-Aggressive classifier: Training accuracy is 0.709663977485 Test accuracy is 0.635710333427 --------------------------------- For Alchemy and talesFromRetail subreddit and Passive-Aggressive classifier: Training accuracy is 0.600770357511 Test accuracy is 0.493099407591 --------------------------------- For Alchemy and askscience subreddit and Passive-Aggressive classifier: Training accuracy is 0.719613211394 Test accuracy is 0.640449700006 --------------------------------- For Alchemy and tifu subreddit and Passive-Aggressive classifier: Training accuracy is 0.690309464649 Test accuracy is 0.603321817699 --------------------------------- For Alchemy and explainlikeimfive subreddit and Passive-Aggressive classifier: Training accuracy is 0.743374461561 Test accuracy is 0.672065832977 ---------------------------------
Discussion of Findings
The purpose of this project was to study the posts on the popular link and discussion sharing website Reddit, and try to utilize some advanced Data Science techniques to construct a statistical model to represent the posts based on a variety of metrics, with the potential of ultimately creating a predictor that would yield a score based on words in a title. Of course, Reddit titles aren’t usually very long and so there are few words to actually base predictions upon – however, with a powerful enough predictor constructed from all components of a post, including the comments and selftext, we hoped to be able to construct something viable.
We began by learning about our data set. Reddit has hundreds of thousands of subreddits and we decided that for the purposes of our project, it would be far more interesting to explore several text-based subreddits that you see listed here. Using the Reddit API, we mined all of the data for these subreddits and made sure to clean our post titles of digits and ensure that our scores were only numerical. Though there were few of these weird glitches, they caused us many problems and so effectively cleaning the data set was crucial to our success in this project
After merging all of our data, we considered a useful text analysis library called Alchemy, which came in quite useful near the end of the project. For each post, we downloaded the top 200 (or less) comments, and merged them, and then fed them into Alchemy to hopefully produce keywords out of our data. Although this may be biased towards posts with more keywords, this method was employed to hopefully find the elements of post which users were most likely to comment upon, serving as a proxy for the most important elements of a post.
Next, we made some plots comparing such things as number of comments and score, link karma and score, and post date and score. Unfortunately, as can be seen from the graph in our visualization section, it was clear from the outset that though these had some small trends, they would not be good enough to create predictions, considering parsimony. Furthermore, things like number of comments is unlikely to have much predictive ability, even though it might be correlated with score (and probably is). Thus, we turned to performing some Multinomial Classification in the hope that a Naïve Bayesian Classifier would work out. We experimented with many different things – minimizing the number of bins to 2, stemming our words, removing very low outliers using the min df parameter, and attempting many different algorithms. Moreover, we used n-grams, which attempts to build on the assumption of independence of words from problem set 3. It does this through going through permutations of a number n of words in the sentence one hopes to analyze, hence “n”-grams.
How could we improve this model?
Well, as is often said in statistical reports, more data would have been useful in this process. Our data was certainly biased by the limitations on categories Reddit’s API forced us into. A more comprehensive approach would have been to analyze all of the posts in a given subreddit over time, tracking each post’s score in a time series format. Ultimately, though, given size and resource constraints, we still managed to figure out some interesting conclusions from looking at our limited data set.
Conclusions
In practice, for many subreddits, the hot and new sorting algorithms resulted in similar posts
It doesn’t seem to take very long for a given post to reach peak popularity (under 72 hours), which was surprising, and we were expecting that time after submission would have a larger effect than it actually does. Given that we are just using 2 bins, the amount of time needed for a post to be classified accurately (as in, in the right bin), is not as much of a concern as initially thought
Moreover, running a simple linear regression model between days after time submitted and post score explained little of the variance in post scores
The popularity of a user appears to have little effect upon whether or not a post will be popular. Although this may not hold true for users designated as distinguished (mainly celebrities or moderators), since we have filtered these out of our data set, the regression between link karma or karma and score is statistically significant but small
Various metrics we tried to calculate how controversial a post was were not particularly fruitful as Reddit more or less fuzzes the number of upvotes/downvotes to prevent spammers, and furthermore they had low predictive ability
Using term frequency inverse document frequency (TF-IDF), which counts word frequencies for every post and discounting words that appear in many posts (like “the”, “a”, etc.), did not increase accuracy as much as we had expected.
One diversion we went down was to examine each title, remove stop words (commonly used words like “the”, “an”, that have little meaning), break it into n-grams (or permutations of “n” words in the title), and then make an api search request on that n-gram. After making that search request within the subreddit the title came from, we looked at the top 100 (or less) search results for each n-gram, and then took the mean and standard deviation for all search results for all of the n-grams of a title.
We then tried both a simple linear regression model using the only the mean of all search results plotted against post score and a cdf function similar to the one implemented in problem set 2 to predict how likely a post would be above a given mean, similar to the binning method from before. Unfortunately, both methods turned out to be statistically significant but of low predictive value.
One important aspect to note is the drastic difference between cross-validation against the whole data set compared to testing accuracy for a post in a specific subreddit. Initially, we had planned to have a dropdown selector on the website to choose the subreddit against which to validate, but upon discovering that the overall classifier was so much more effective, we abandoned the idea.
A little bit about the classifiers that went into building our final model:
RidgeClassifer
Sometimes the best linear unbiased estimator is not always the best estimator for the beta’s in our bag of words model, because given the bias-variance tradeoff, we would like an estimator which minimizes variance from a standard OLS model at the expense of bias. In ridge regression, we minimize the sum of the squared residuals, which is what we would normally do in OLS, in addition to a term proportional to the weighted sum of the squared parameters. As said earlier, this loses the quality of estimating the true parameters without bias, but in doing so minimizes variance, which can increase the accuracy of our model considerably. In a Bayesian framework, then, our prior might be the general maximum likelihood estimate multiplied by our given prior, say that the noise is a Gaussian with mean zero. With the ridge regression framework, the posterior mode, which is our most likely value, equals the ridge parameter estimate. This framework is used to classify the bag of words into the various bins.
Perceptron
Perceptron is an algorithm for supervised classification that sorts an input into one of a number of bins. It is a linear classifier that combines a set of weights with our calculated vector in order to describe the keywords of the title using a rule that dynamically updates the weights. looking at elements in the training set one by one. Perceptrons most important properties are that it does not require a set learning rate, that it is not regularized and that it only updates its model when it makes mistakes. Due to the last characteristic Perceptron is very fast with the tradeoff that the resulting model might be sparser than that of other algorithms.
Passive-Aggressive
The Passive-Aggressive Classifier we implemented using the sklearn library was, on average, the most effective in accurately predicting posts on subreddits. Much like the MultinomialNB, the the Passive Aggressive Classifier is a binary classifier that allowed us to construct a training model from which we could extrapolate to a prediction algorithm. Like the Perceptron we used previously, this classifier is a set of algorithms useful for large-scale learning, none of which require a learning rate parameter. Unlike the Perceptron, however, the these algorithms require a regularization parameter, for which we used the default of 1.0. It was perhaps this regularization that improved our regularization that improved the regression of our model to such a degree that it outperformed all other classifiers we attempted.
K-Nearest
The final classifier we tried was the K-Nearest Classifier. This set of algorithms simply computes the k nearest neighbors of every post, and utilizes the resulting model to reduce the sparsity of the dataset and to attempt to train the classifier based upon this. However, due to the massive amounts of calculation required to compute the neighbors, this classifier was unfeasible to run and consistently gave us a memory error when applied to the alchemy keywords. Considering the lack of effectiveness of our own k-nearest implementation using Cosine Similarity, and considering that this classifier did not improve particularly upon the regular, non-Alchemized dataset, it was safe to say that this classifier would not ultimately outperform the Passive-Aggressive one, so we left that component out of the final Process Book to avoid unnecessary memory overloads.
There were considerable differences between these classifiers, and the differences between them should not be overlooked, even though each one was applied to the same bag-of-words model.
Finally, after much testing and cross-validation, we discovered a combination of parsing out keywords from our text using Alchemy and using the Perceptron Classifier on these keywords and the post scores. This turned out to be far more effective than any of our other methods, consistently having near-perfect testing accuracies and greater than 65% training accuracies. From this model, we constructed a regression model – one that provided us with a powerful linear equation. Now, we could use our CLF to calculate the probability of a post being successful and using regression, predict the approximate score the post would achieve upon stabilization with about 20% accuracy. If you would like to experiment with this predictor, please visit our website and try it out!
From this model, we constructed a regression model – one that provided us with a powerful linear equation. Now, we could use our CLF to calculate the probability of a post being successful and using regression, predict the approximate score the post would achieve upon stabilization with about 20% accuracy. If you would like to experiment with this predictor, please visit our website and try it out!
Ultimately, this project revealed to us what we suspected – that although a powerful classifier can serve as a somewhat good predictor, due to the nature of Reddit and the randomness with which posts become popular, it is impossible to pinpoint exact words with ignite this popularity. Our challenges through the duration of this project and our ultimately low R squared values for the regression model suggest that in spite of all of our textual analysis, the degree of randomness in Reddit is much too high to have any valuable predictive capabilities. Still, it’s an interesting concept and with further research and more powerful technical capabilities to analyze massive datasets perhaps it would be possible to create a far more reliable prediction model.
css tweaks in this cell