This part is showing the data collection and declaring of variables used throughout the analysis. By changing these variables and running all the code, you can run your own complete analysis of a different event.
# these words will be used to look for hashtags
query_hashtags = ['sutherland springs' ,'sutherland spring', 'texas church shooting', 'texas shooting', 'texas church massacre', 'church shooting']
# add the concatonated version of these strings for hashtags
query_hashtags += map(lambda s: s.replace(' ', ''), query_hashtags)
We start by importing all of the data stored in our MongoDB Atlas Suite.
from pymongo import MongoClient
import twitter, pickle, sys
import pymongo
import numpy as np
import pandas as pd
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (15, 6)
# intialize mongo client on MongoDB Atlas
client = MongoClient("mongodb://socialgraphs:interactions@socialgraphs-shard-00-00-al7cj.mongodb.net:27017,socialgraphs-shard-00-01-al7cj.mongodb.net:27017,socialgraphs-shard-00-02-al7cj.mongodb.net:27017/test?ssl=true&replicaSet=SocialGraphs-shard-0&authSource=admin")
db = client.texas
# access tweet collection
# TODO: only select unique tweets
tweet_collection = db.tweetHistory
myTweets = tweet_collection.find()
Populating the interactive namespace from numpy and matplotlib
/anaconda/lib/python2.7/site-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['f', 'text'] `%matplotlib` prevents importing * from pylab and numpy "\n`%matplotlib` prevents importing * from pylab and numpy"
Let's start by figuring out how many tweets we have in total.
print 'We have a total of {} tweets.'.format(myTweets.count())
We have a total of 52903 tweets.
Let us figure out some very basic statistics about our dataset.
# define initial values
userSet = set()
totalNumberOfWords = 0.0
totalRetweets = 0.0
totalFavorites = 0.0
differentHashtags = set()
# loop over data
for tweet in myTweets:
userSet.add(tweet['username'])
differentHashtags = differentHashtags.union(set(tweet['hashtags'].split()))
totalNumberOfWords += len(tweet['text'].split())
totalRetweets += tweet['retweets']
totalFavorites += tweet['favorites']
# define means
averageLength = totalNumberOfWords / myTweets.count()
averageRetweets = totalRetweets / myTweets.count()
averageFavorites = totalFavorites / myTweets.count()
# print results
print 'There are {} differrent users.'.format(len(userSet))
print 'The average lenght of a tweet is {} words.'.format(averageLength)
print 'A total of {} differrent hashtags are used.'.format(len(differentHashtags))
print 'The average number of retweets: {}'.format(averageRetweets)
print 'The average number of favorites: {}'.format(averageFavorites)
There are 26289 differrent users. The average lenght of a tweet is 19.2577169537 words. A total of 7342 differrent hashtags are used. The average number of retweets: 3.88087632081 The average number of favorites: 8.00228720488
What are the most favorited tweets, the most retweeted tweets and the respective users and times?
We start by preparing our databases by taking advantage of mongo.
# create databases and define number of desired elements
get_top = 5
display_conditions = {"deepmoji": 0, "permalink":0, "id":0,"date":0, "query_criteria":0, "_id":0, "geo":0, "mentions":0, "hashtags":0}
db_by_retweets = tweet_collection.find({}, display_conditions).sort("retweets",pymongo.DESCENDING)[0:get_top]
db_by_favorites = tweet_collection.find({}, display_conditions).sort("favorites",pymongo.DESCENDING)[0:get_top]
# a function that takes a cursor and pretty prints it.
def print_result(database):
array = []
for t in database:
array.append(t)
pandas = pd.DataFrame(array)
display(pandas)
And we print the results.
print 'The most retweeted:'
print_result(db_by_retweets)
print 'The most favorited:'
print_result(db_by_favorites)
The most retweeted:
favorites | retweets | text | username | |
---|---|---|---|---|
0 | 17054 | 6641 | I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions. | shannonrwatts |
1 | 17054 | 6641 | I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions. | shannonrwatts |
2 | 20126 | 5912 | The tragedy in Sutherland Springs happened a little over a week ago. Don’t let this fade into the next news cycle. We need gun safety reforms. Now. | KamalaHarris |
3 | 13447 | 5252 | It's been only one week since the Texas mass shooting . 42 days since the Vegas mass shooting . 53 days since Puerto Rico lost power and humanitarian crisis began. Time feels off with this much tragedy. | sarahkendzior |
4 | 2523 | 3922 | Anyone hear about this from media??? This happened early Saturday morning, before the #TexasChurchShooting Suspected ILLEGAL ALIEN shoots at cars on I-35 with AR style rifle, hits 7 yr old girl in the head, and 4 others. http://www. informationliberation.com/?id=57612 | ChristieC733 |
The most favorited:
favorites | retweets | text | username | |
---|---|---|---|---|
0 | 20126 | 5912 | The tragedy in Sutherland Springs happened a little over a week ago. Don’t let this fade into the next news cycle. We need gun safety reforms. Now. | KamalaHarris |
1 | 17054 | 6641 | I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions. | shannonrwatts |
2 | 17054 | 6641 | I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions. | shannonrwatts |
3 | 13447 | 5252 | It's been only one week since the Texas mass shooting . 42 days since the Vegas mass shooting . 53 days since Puerto Rico lost power and humanitarian crisis began. Time feels off with this much tragedy. | sarahkendzior |
4 | 9842 | 3784 | NRA can confirm Stephen Willeford is a member & has been certified as a NRA firearms instructor. #SutherlandSprings http://www. 4029tv.com/article/man-wh o-shot-texas-church-gunman-shares-his-story/13437943 … | DLoesch |
It's also very interesting to understand the tweets in our database from a chronological point of view.
# get dates and remove seconds for readability purposes
myTweets = tweet_collection.find()
dates = list(set([tweet['date'] for tweet in myTweets]))
no_seconds = [date.replace( minute=0, second=0, microsecond=0) for date in dates]
# count occurences
counter = dict(Counter(no_seconds))
# prepare plot
x = []
y = []
for element in counter:
x.append(element)
y.append(counter[element])
# plot nicely
plt.title('Number of tweets per date')
plt.ylabel('Number of tweets')
plt.xlabel('Date')
plt.scatter(x, y, c=y, marker='.', s=y)
plt.xlim([min(x), max(x)])
plt.grid()
plt.show()
From the tweets we collected we are going to generate a number of different networks that we will be using throughout the rest of the analysis. In either network, the nodes are going to be the users that have been tweeting about the event using one of the predefined hashtags.
For the first network, the edges will be constructed through mentions in these tweets. So, when a tweet mentions another user that is also a node in the network, there will be an edge between these two nodes. We will refer to this network as mention_graph
.
For the second network, we define the edges between nodes if they share a common hashtag, not including the query hashtags. For example if two tweets from different nodes use the hashtag #GunSense, we will create an edge between them. We will refer to this network as hashtag_graph
.
Below we will start creating the networks.
import networkx as nx
from collections import defaultdict
from itertools import combinations
# start by finding all unique usernames in the tweets that have either mentions or hashtags
where = {
'$or': [
{
'mentions': {
'$ne': ''
},
},{
'hashtags': {
'$ne': ''
}
}
]
}
usernames = tweet_collection.distinct('username', where)
# create two separate graphs for mention relations and hashtags, one simple, one multi
mention_graph = nx.Graph()
hashtag_graph = nx.Graph()
# add nodes from users that wrote tweets
mention_graph.add_nodes_from(usernames)
hashtag_graph.add_nodes_from(usernames)
# add edges to mention_graph between mentions in tweets
# get all tweets with their mentions
tweet_mentions = list(tweet_collection.find({'mentions': {'$ne' : '',}}, {'username': 1, 'mentions': 1}))
# define a default dictionary to store the unique mentions used per user as a set
mentions_dict = defaultdict(set)
# populate dict {username: set(mentions)}
for tweet in tweet_mentions:
# split mentions from string to list
mentions = map(lambda mention: mention[1:], tweet['mentions'].split(' '))
# update dict
mentions_dict[tweet['username']].update(mentions)
# add edges from mentions_dict
for user, mentions in mentions_dict.iteritems():
for to_node in mentions:
if mention_graph.has_node(to_node):
mention_graph.add_edge(user, to_node)
# add edges to the hashtag_graph
# get all tweets with hashtags
tweet_hashtags = tweet_collection.find({'entities.hashtags': {'$ne': ''}}, {'username': 1, 'hashtags': 1})
# intialize a defaultdict to track the unique hashtags
# and how often users are using them
hashtags_dict = defaultdict(lambda: defaultdict(int))
# populate the dict {hashtags: set(usernames)}
for tweet in tweet_hashtags:
username = tweet['username']
# list of hashtags
hashtags = map(lambda tag: tag.replace('#', '').lower(), tweet['hashtags'].split(' '))
# remove the query_hashtags
new_tags = list(set(set(hashtags) - set(query_hashtags)))
if len(new_tags) > 0:
for tag in new_tags:
if tag:
hashtags_dict[tag][username] += 1
# add edges between all users with same hashtag if they have used it more than once
for tag, userdict in hashtags_dict.iteritems():
# find users who used the tag more than twice
usernames = [username for username, count in userdict.iteritems() if count > 2]
# create tuples of possible combinations of nodes
sets = combinations(usernames, 2)
# add edges
for combi in sets:
hashtag_graph.add_edge(*combi, atrr=tag)
print 'Mention Graph:'
print ' - Number of nodes:', len(mention_graph.nodes())
print ' - Number of edges:', len(mention_graph.edges())
print ' - Average degree:', float(sum(nx.degree(mention_graph).values())) / len(mention_graph.nodes())
print 'Hashtag Graph:'
print ' - Number of nodes:', len(hashtag_graph.nodes())
print ' - Number of edges:', len(hashtag_graph.edges())
print ' - Average degree:', float(sum(nx.degree(hashtag_graph).values())) / len(hashtag_graph.nodes())
Mention Graph: - Number of nodes: 15992 - Number of edges: 1700 - Average degree: 0.212606303152 Hashtag Graph: - Number of nodes: 15992 - Number of edges: 29894 - Average degree: 3.73861930965
plt.style.use('fivethirtyeight')
# get degree distributions
mention_degree = nx.degree(mention_graph)
hashtag_degree = nx.degree(hashtag_graph)
# get minumum and maximum degrees
min_mention_degree, max_mention_degree = min(mention_degree.values()), max(mention_degree.values())
min_hashtag_degree, max_hashtag_degree = min(hashtag_degree.values()), max(hashtag_degree.values())
# plot the degree distributions
plt.figure()
plt.subplot(211)
plt.yscale('log', nonposy='clip')
plt.title('Mention graph degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.hist(sorted(mention_degree.values(),reverse=True), range(min_mention_degree, max_mention_degree + 1)) # degree sequence
plt.subplot(212)
plt.title('Hahstag graph degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.yscale('log', nonposy='clip')
plt.hist(sorted(hashtag_degree.values(),reverse=True), range(min_hashtag_degree, max_hashtag_degree + 1)) # degree sequence
plt.show()
# get all the separate components
components_mention = sorted(nx.connected_component_subgraphs(mention_graph), key=len, reverse=True)
components_hashtag = sorted(nx.connected_component_subgraphs(hashtag_graph), key=len, reverse=True)
print 'The mention graph has {0} disconnected components'.format(len(components_mention))
print 'The hashtag graph has {0} disconnected components'.format(len(components_hashtag))
plt.figure()
plt.subplot(211)
mention_component_lengths = [len(c) for c in components_mention]
plt.yscale('log', nonposy='clip')
plt.title('Mention graph components')
plt.ylabel('Number of components')
plt.xlabel('Number of nodes')
max_mcl = max(mention_component_lengths)
plt.hist(mention_component_lengths, range(max_mcl + 1))
plt.subplot(212)
plt.yscale('log', nonposy='clip')
plt.title('Hashtag graph components')
plt.ylabel('Number of components')
plt.xlabel('Number of nodes')
hashtag_component_lengths = [len(c) for c in components_hashtag]
max_hcl = max(hashtag_component_lengths)
plt.hist(hashtag_component_lengths, range(max_hcl + 1))
plt.tight_layout()
The mention graph has 14501 disconnected components The hashtag graph has 15245 disconnected components
Since the main graph is so disconnecte, we decide to only work with the GCC of both graphs. This allows us to perform more in depth analysis.
# get the giant connected component for both graphs
mention_gcc, hashtag_gcc = components_mention[0], components_hashtag[0]
print 'Mention GCC'
print ' - Number of nodes:', len(mention_gcc.nodes())
print ' - Number of edges:', len(mention_gcc.edges())
print ' - Average degree:', float(sum(nx.degree(mention_gcc).values())) / len(mention_gcc.nodes())
print 'Hashtag GCC:'
print ' - Number of nodes:', len(hashtag_gcc.nodes())
print ' - Number of edges:', len(hashtag_gcc.edges())
print ' - Average degree:', float(sum(nx.degree(hashtag_gcc).values())) / len(hashtag_gcc.nodes())
# draw the graphs
nx.draw_networkx(mention_gcc, pos=nx.spring_layout(mention_gcc), node_size=[v * 100 for v in mention_degree.values()], with_labels=False)
plt.title('Mention GCC')
plt.show()
nx.draw_networkx(hashtag_gcc, pos=nx.spring_layout(hashtag_gcc), node_size=[v * 0.1 for v in hashtag_degree.values()], with_labels=False)
plt.title('Hashtag GCC')
plt.show()
Mention GCC - Number of nodes: 1091 - Number of edges: 1211 - Average degree: 2.21998166819 Hashtag GCC: - Number of nodes: 718 - Number of edges: 29828 - Average degree: 83.0863509749
Since we are now only looking at the GCC of both graphs, we run the degree distribution again. This time we have no nodes without edges. The shapes are, however, remarkably similar the the full graphs. The distributions are plotted on a logarithmic scale so we can easily see whether the degrees follow a power law distribution. We can see that the mention_graph
still looks logarithmic, so this seems to be an extreme distribution of a couple of highly connected nodes and a lot of poorly connected nodes. The hashtag_graph
seems to follow a more linear relation now that one scale is logarithmic.
mention_degree_gcc = nx.degree(mention_gcc)
hashtag_degree_gcc = nx.degree(hashtag_gcc)
# get minumum and maximum degrees
max_mention_gcc_degree = max(mention_degree_gcc.values())
max_hashtag_gcc_degree = max(hashtag_degree_gcc.values())
# plot the degree distributions
plt.figure()
plt.subplot(211)
plt.yscale('log', nonposy='clip')
plt.title('Mention GCC degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.hist(sorted(mention_degree_gcc.values(),reverse=True), range(max_mention_gcc_degree + 1)) # degree sequence
plt.subplot(212)
plt.title('Hahstag GCC degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.yscale('log', nonposy='clip')
plt.hist(sorted(hashtag_degree_gcc.values(),reverse=True), range( max_hashtag_gcc_degree + 1)) # degree sequence
plt.tight_layout()
The next step in our analysis is to define communities in our network and see what these communities revolve around.First, we will look into the sizes of the communities and the biggest accounts in the biggest communities, to get a sense for the kind of accounts we find. Then, we will look into the most common hashtags used in every community in the mention graph, to get a feeling for the topics that live in every community.
We use the Louvain method [1] for community detection with the following implementation in Python.
import community
# use the python Louvain implementation to find communities in the networks
partition_mention = community.best_partition(mention_gcc)
partition_hashtag = community.best_partition(hashtag_gcc)
#drawing
mention_size = float(len(set(partition_mention.values())))
pos = nx.spring_layout(mention_gcc)
count = 0.
for com in set(partition_mention.values()) :
count = count + 1.
list_nodes = [nodes for nodes in partition_mention.keys()
if partition_mention[nodes] == com]
nx.draw_networkx_nodes(mention_gcc, pos, list_nodes, node_size = 20,
node_color = str(count / mention_size))
print 'For the mention GCC we have found {} communities'.format(int(mention_size))
nx.draw_networkx_edges(mention_gcc,pos, alpha=0.5)
plt.show()
hashtag_size = float(len(set(partition_hashtag.values())))
pos = nx.spring_layout(hashtag_gcc)
count = 0.
for com in set(partition_hashtag.values()) :
count = count + 1.
list_nodes = [nodes for nodes in partition_hashtag.keys()
if partition_hashtag[nodes] == com]
nx.draw_networkx_nodes(hashtag_gcc, pos, list_nodes, node_size = 20,
node_color = str(count / hashtag_size))
print 'For the hashtag GCC we have found {} communities'.format(int(hashtag_size))
nx.draw_networkx_edges(hashtag_gcc,pos, alpha=0.5)
plt.show()
For the mention GCC we have found 25 communities
For the hashtag GCC we have found 10 communities
We can see that for the mention graph, the communities have for the biggest part centred themselves around major news outlets. We see @FoxNews and @ABCNews, but also local news stations as @dallasnews and their reporters, like @lmcgaughy. For the hashtag graph, seem a bit more random. We do, however recognize the twitter accounts of Sputnik News, a Russion state controlled media outlet, linked to fake news on multiple occasions, and marypatriotnews.com which is a hyper conservative outlet to say the least.
It is interesting
to see that the highest degree nodes in the hashtag_graph
's partitions are
not necessarily
accounts with many followers. The links in this network are much more
'democratic', where anyone who uses a lot of prevalent hashtags can becomes
a well connected node in the graph. This is different from the mention_graph
where a user only gets mentioned a lot if he is well known and thus likely
to have many followers.
# look at accounts in each partition with highest degree
# twitter api credentials for lookup
CONSUMER_KEY='29JF8nm5HttFcbwyNXkIq8S5b'
CONSUMER_SECRET='szo1IuEuyHuHCnh93VjLLGb5xg9NcfDVqMsLtOt3DbL5hXxpbt'
OAUTH_TOKEN='575782797-w96NPIzKF07TpC3c78nEadEfACLclYvSusuOPl8z'
OAUTH_TOKEN_SECRET='h0oitwxLkDjOLSejSQl2AWSrcmNeUwBpEvSUWonYzZTNz'
# instantiate API object
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter_api= twitter.Twitter(auth=auth)
# uxilirary function
def inverse_partition(partition):
components_inv = defaultdict(list)
for key, value in partition.iteritems():
components_inv[value].append(key)
return components_inv
# get top accounts by degree
def partition_top_accounts(partition, degree):
part_inv = inverse_partition(partition)
return {part: max(usernames, key=lambda user: degree[user]) for part, usernames in part_inv.iteritems()}
# get data on account
def twitter_account(username):
return twitter_api.users.lookup(screen_name=username)
# display data in dataframe
def pprint_partition_overview(partition, degree, outfile=None):
data = []
columns = ['Partition', 'Partition Size', 'Screen Name', 'Name', 'Url', 'Location', 'Followers', 'Degree']
top_accounts = partition_top_accounts(partition, degree)
for part_id, account in top_accounts.iteritems():
user = twitter_account(account)[0]
# print pprint( user)
url = ''
try:
url = user['entities']['url']['urls'][0]['display_url']
except:
pass
row = {
'Partition': part_id,
'Partition Size': len(inverse_partition(partition)[part_id]),
'Screen Name': account,
'Name': user['name'],
'Url': url,
'Location': user['location'],
'Followers': user['followers_count'],
'Degree': degree[account]
}
data.append(row)
data.sort(key=lambda row: row['Partition Size'], reverse=True)
df = pd.DataFrame(data)
df = df[columns]
display(df)
if outfile:
serialized = json.dumps(data)
with open('data/{}'.format(outfile), 'w') as ofile:
ofile.write(serialized);
print 'The mention graph partitons with an overview of the accounts with the highest degrees'
pprint_partition_overview(partition_mention, mention_degree_gcc, 'mention_partition_accounts.json')
pprint_partition_overview(partition_hashtag, hashtag_degree_gcc, 'hashtag_partition_accounts.json')
The mention graph partitons with an overview of the accounts with the highest degrees
Partition | Partition Size | Screen Name | Name | Url | Location | Followers | Degree | |
---|---|---|---|---|---|---|---|---|
0 | 9 | 270 | FoxNews | Fox News | foxnews.com | U.S.A. | 16589039 | 288 |
1 | 3 | 86 | JohnCornyn | Senator JohnCornyn | cornyn.senate.gov | Austin, Texas | 120492 | 53 |
2 | 6 | 83 | ABC | ABC News | ABCNews.com | New York City / Worldwide | 12986622 | 58 |
3 | 5 | 69 | USATODAY | USA TODAY | usatoday.com | USA TODAY HQ, McLean, Va. | 3529784 | 54 |
4 | 4 | 56 | lmcgaughy | Lauren McGaughy | dallasnews.com/author/lauren-… | Austin, TX | 10613 | 35 |
5 | 10 | 56 | AP | The Associated Press | apnews.com | Global | 12058706 | 53 |
6 | 2 | 46 | usairforce | U.S. Air Force | af.mil | Air, Space and Cyberspace | 886224 | 46 |
7 | 13 | 46 | DLoesch | Dana Loesch | amazon.com/Flyover-Nation… | Texas, USA | 674971 | 14 |
8 | 8 | 45 | chelseahandler | Chelsea Handler | chelseahandler.com | Los Angeles, CA | 8343721 | 23 |
9 | 17 | 45 | Everytown | Everytown | Everytown.org | 114979 | 19 | |
10 | 15 | 43 | RealAlexJones | Alex Jones | infowars.com | Austin, TX | 754784 | 16 |
11 | 12 | 37 | scrowder | Steven Crowder | louderwithcrowder.com | Ghostlike | 491141 | 25 |
12 | 1 | 28 | ExpressNews | San Antonio E-N | ExpressNews.com | San Antonio, TX | 19660 | 8 |
13 | 7 | 23 | abcnews | ABC News | abc.net.au/news | Australia | 1414667 | 11 |
14 | 11 | 23 | RNS | Religion News Service | religionnews.com | DC, NYC, London, Rome | 73982 | 13 |
15 | 0 | 21 | KHOU | KHOU 11 News Houston | khou.com | Houston, TX | 659018 | 13 |
16 | 19 | 20 | KENS5 | KENS 5 | kens5.com | San Antonio, Texas | 130446 | 12 |
17 | 18 | 19 | YahooNews | Yahoo News | yahoo.com/news/ | New York City | 1130694 | 17 |
18 | 16 | 17 | israelnash | Israel Nash | twitter.com/israelnash | Dripping Springs, TX | 2453 | 8 |
19 | 20 | 12 | foxandfriends | FOX & friends | foxandfriends.com | New York City | 1084193 | 11 |
20 | 22 | 12 | NewsHour | PBS NewsHour | pbs.org/newshour/ | Arlington, VA | New York, NY | 983682 | 8 |
21 | 24 | 12 | NRO | National Review | NationalReview.com | New York | 271442 | 11 |
22 | 21 | 8 | ChrisCuomo | Christopher C. Cuomo | ChrisCuomo.com | In the Arena | 1271167 | 7 |
23 | 14 | 7 | InsideEdition | Inside Edition | insideedition.com | New York | 72604 | 4 |
24 | 23 | 7 | ABCWorldNews | World News Tonight | abcnews.com/wnt | New York | 1277571 | 6 |
Partition | Partition Size | Screen Name | Name | Url | Location | Followers | Degree | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 279 | BigGator5 | BigGator5 | biggator5.net/about/twitter-… | Lake County, Florida | 5876 | 343 |
1 | 1 | 192 | Adrian_Rafael | Adrian R. Morales | 1239 | 268 | ||
2 | 4 | 84 | MacChomhghaill | McChomhghaill | TrumpUnifies.tk | Northern California, USA | 4361 | 362 |
3 | 2 | 79 | TrendingNewsTV | Trending Newscast | tn.dvolatility.com | Metro Detroit, MI | 205 | 78 |
4 | 3 | 60 | Ms1Scs | #DeepStateSwampDrain | USA | 6760 | 93 | |
5 | 9 | 6 | Johnathin79 | Lock'm ALL Up! | 6851 | 7 | ||
6 | 5 | 5 | SputnikInt | Sputnik | sputniknews.com | 205091 | 11 | |
7 | 7 | 5 | Expose_The_Lies | ExposeTheLies | facebook.com/ExposeTheLies | 97 | 6 | |
8 | 8 | 5 | PrgrsvArchitect | ProgressiveArchitect | Tucson, AZ USA | 1015 | 7 | |
9 | 6 | 3 | MaryPatriotNews | Mary Budesheim | marypatriotnews.com | Glens Falls, NY | 13642 | 4 |
from collections import Counter
# commnity histogram
hashtag_com_count = Counter(partition_hashtag.values())
mention_com_count = Counter(partition_mention.values())
plt.figure()
plt.subplot(211)
plt.title('Sizes of hashtag communities')
plt.xlabel('Community number')
plt.ylabel('Number of nodes')
plt.bar(hashtag_com_count.keys(), hashtag_com_count.values())
plt.subplot(212)
plt.title('Sizes of mention communities')
plt.xlabel('Community number')
plt.ylabel('Number of nodes')
plt.bar(mention_com_count.keys(), mention_com_count.values())
plt.tight_layout()
To drill down further on what plays in the communities found in either network, we will look at the hashtags are used in these communities and how they relate to each other. A heatmap shows the number of occurrences of the top hashtags in each community. The brighter the color, the more prevalent that hashtag is in the tweets from that community and by extension from the users. This gives us an idea about the ideas or opinions of these users. The data is a little sparse for some communities since they are small and there are not that many hashtags, but we can see some interesting patterns emerge in the ones that do have data.
import matplotlib.style
import matplotlib as mpl
mpl.style.use('classic')
def partition_hashtag_analysis(partition):
# inverse the partitioning to get a dict with { partitioning_id : [usernames]}
components_inv = inverse_partition(partition)
# get all hashtags used by users in combination with our query_hashtags
components_hashtags = defaultdict(list)
for part_id, usernames in components_inv.iteritems():
tweets = tweet_collection.find({
'username': {
'$in': usernames
},
'hashtags': {
'$ne': '',
'$nin': map(lambda s: '#' + s, query_hashtags)
},
},
{
'hashtags': 1
})
# filter query hashtags hashtags
for row in tweets:
tags = [tag for tag in row['hashtags'].lower().replace('#', '').split(' ') if tag not in query_hashtags]
components_hashtags[part_id] += tags
part_tag_counts = {}
for part_id, tags in components_hashtags.iteritems():
counts = Counter(tags)
part_tag_counts[part_id] = counts
return part_tag_counts
mention_com_hashtags = partition_hashtag_analysis(partition_mention)
# Heatmap
# get most common hashtags in general
number_of_tags = 25
hashtags_count = Counter([tags for part_id in mention_com_hashtags.itervalues() for tags in part_id])
most_common_tags = map(lambda tup: str(tup[0]), hashtags_count.most_common(number_of_tags))
# from the 'most_common_hashtags', manually group hashtags together on political orientation
# against gun carry
against = ['gunsense', 'guncontrol', 'guncontrolnow','gunviolence', 'stopgunviolence','backgroundchecks']
# generally in favor of gun carry
in_favor = ['trump', 'maga', '2a','tcot', 'nra', 'msm']
# neutral
most_common_tags = against + in_favor + ['texas', 'sutherlandspringsshooting', 'lasvegasshooting', 'texasstrong', 'ksatnews', 'usatoday', 'kens5eyewitness', 'church', 'shooting', 'firstbaptistchurch', 'gun', 'airforce']
# create a matrix of the count of use of the most commn hashatgs in the communities
heat_array = np.array([[counts[tag] for tag in most_common_tags] for counts in mention_com_hashtags.values()])
# plot heatmap
fig = plt.figure(figsize=(10, 10))
plt.imshow(heat_array, interpolation='nearest')
plt.xticks(range(number_of_tags), most_common_tags, rotation='vertical')
plt.yticks(range(int(mention_size)), mention_com_hashtags.keys())
rect = fig.patch
rect.set_facecolor('white')
plt.colorbar()
plt.show()
The heatmap above displays how often the most common hashtags appear in every community. The brighter the color, the more prevalent that hashtag is in the tweets from that community and by extension from the users. This gives us an idea about the ideas or opinions of these users. The data is a little sparse for most communities since they are small and there are not that many hashtags, but we can see some interesting patterns emerge in the ones that do have data. We have communities 16 and 17 mentioning #guncontrol #gunsense #guncontrolnow #gunviolence and #stopgunviolence, which are all hashtags related to the camp that wants to limit guns in America. Clusters 3 and 5 and 19 mention in #texas combination with news outlets #kens5eyewitness, #ksatnews and #usatoday, seemingly neutral. There are a bunch of more republican oriented hashtags floating around, #trump, #maga (Make America Great Again), #2a (2nd amendment which protects gun owners), #tcot (top conservatives on twitter), #msm (mainstream media) and #nra (national rifle association - lobby group for gun carry rights), which seem a little more used by the clusters 12, 13, 9, 10, 5 and 6.
Let's combine this data with the sentiment analyis we have derived from the deep learning emoji analysis.
As we can see above, the use of a hashtag can be fairly ambiguous. People can use a certain hashtag and be either pro or against it, or use the hastag in a sarcastic or ironic way. To add a bit more context we thought it would be interesting to look at what sentiments or thoughts are associated with the hashtags in each community. For this, we used DeepMoji [2] again. These researchers from MIT, among other universities, have constructed a way to train neural networks on text with emojis that let them predict a series of emojis from a sentence. The project is freely available on github including the pretrained models, which can quite articulately describe the sentiment or feeling of that piece of text. We decided to correlate the hashtags used in every community with the emojis returned from DeepMoji to get a more refined image of the opinions that are prevalent in these communities.
Below, the results of this correlation are displayed in a emoji / hashtag matrix for the communities. De rows represent the different communities in the networks and the columns the most common hashtags in these networks. In the cells the most common emojis found through DeepMoji prediction are displayed. The smallest communities and some trivial hashtags have been left out for clarity.
# get emojis per community similarly as we did for the hashtags
def emoji_hastag_analysis(partition, graph_name, hashtags, threshold=25):
components_inv = inverse_partition(partition)
# store data as { part_id : { hashtag : Counter({ emoji : count})}}
components_emoji = defaultdict(lambda: defaultdict(Counter))
for part_id, usernames in components_inv.iteritems():
tweets = tweet_collection.find({
'username': {
'$in': usernames
},
'hashtags': {
'$ne': '',
'$nin': map(lambda s: '#' + s, query_hashtags)
},
'deepmoji': {
'$exists': True
}
},
{
'hashtags': 1,
'deepmoji': 1
})
# filter query hashtags hashtags
for row in tweets:
tags = [tag for tag in row['hashtags'].lower().replace('#', '').split(' ') if tag not in query_hashtags]
# store emojis associated with hashtags
emojis = [emo for k, emo in row['deepmoji'].iteritems() if 'Emoji' in k]
for tag in tags:
components_emoji[part_id][tag].update(emojis)
# import emoji converting dictionary
import emoji
emoji_index = {}
with open('ressources/emoji.txt') as f:
counter = 0
for line in f: # for every line
contents = [x.strip() for x in line.split(',')] # split line into 2
emoji_index[counter] = contents # contents = [name of emoji, url to emoji photo]
counter += 1
emoji_matrix = []
for part in components_emoji:
# only consider larger communities
if sum(globals()['{}_com_count'.format(graph_name)][part]) > threshold:
tag_emoji = {
'Partition': part,
'1 Partition size': mention_com_count[part]
}
# only look at politically charges hastags
for tag in hashtags:
emojis = []
for item in components_emoji[part][tag].most_common(5):
emojis.append(emoji.emojize(emoji_index[item[0]][0], use_aliases=True))
if len(emojis) > 0:
tag_emoji[tag] = ''.join(emojis)
emoji_matrix.append(tag_emoji)
df = pd.DataFrame(emoji_matrix)
df.set_index('Partition', inplace=True)
display(df)
emoji_hastag_analysis(partition_mention, 'mention', in_favor + against)
1 Partition size | 2a | backgroundchecks | guncontrol | guncontrolnow | gunsense | gunviolence | maga | msm | nra | tcot | trump | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Partition | ||||||||||||
1 | 28 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 46 | NaN | NaN | NaN | 👏💪👊♥👍 | NaN | 😡😣😢😞😓 | NaN | NaN | NaN | NaN | NaN |
3 | 86 | 👍🔫😄😡👏 | NaN | 👍♥😢💔💟 | 😡👍😢👏💔 | 👍🔫😎😡👏 | NaN | 👍😉😄😜🙏 | NaN | 🔫😄😡😜👍 | 😄👍😉😜😡 | NaN |
4 | 56 | 🔫♥❤💔💟 | NaN | 🔫♥👍❤🎶 | ❤♥🔫💔💟 | ❤♥🔫💔💟 | NaN | NaN | 👍😊👏😳😉 | ❤♥🔫💔💟 | NaN | NaN |
5 | 69 | 👍😈🔫😉😜 | NaN | 😐🔫😳😢😕 | NaN | NaN | NaN | ♥😢💔✌💟 | NaN | NaN | NaN | 😡😠😤🔫😈 |
6 | 83 | 👍👊😉😄💪 | NaN | 😡😪😤✋😣 | 💟♥👍❤💙 | NaN | NaN | 👍💟😜♥👊 | 😜👍💟♥👊 | 👍😉😜♥💟 | ♥😉🎶💔💟 | 👍😪👏😣😓 |
8 | 45 | ♥💪👊✌👍 | NaN | ✨😄💜👍😊 | NaN | NaN | NaN | 😡😒😑😤😠 | NaN | NaN | 😡♥👊😠💟 | NaN |
9 | 270 | 👍🔫😡💪👊 | NaN | 😡👍😠😤😢 | 😡👍😠😢😤 | 👍😡😢👏💔 | 👍😡😢💔💟 | 😡😢😠💔😷 | 😡😬👍😜👏 | 👍😡♥😢🔫 | NaN | 😡😠👍😈😑 |
10 | 56 | NaN | NaN | 😡😠👍🔫💔 | NaN | 😡🔫👍😑🙏 | NaN | 👍♥😡🔫🎶 | NaN | 😡🔫😠💔😑 | NaN | 👍🎶😎🎧♥ |
12 | 37 | ♥👍💔✨😡 | NaN | ♥👍😡💟😠 | NaN | ♥✨👍💔💟 | ♥👍😄✌💟 | 👏👍👊🔫😎 | 👍😉♥👏🙏 | 😡😠😢😔🔫 | ♥✨👍💔💟 | 👏👍👊🔫😎 |
13 | 46 | 😡😠😉👊👍 | NaN | 😡😠🔫💔😉 | 👏👍😉😡😠 | 😡🔫😑💔😠 | NaN | 🔫🙏♥✌😡 | NaN | 😂👍🙌👏💟 | NaN | NaN |
15 | 43 | NaN | NaN | NaN | NaN | NaN | NaN | 😡🙌🔫😢🙏 | ♥👍🙏💔💟 | 😐🔫😑😅😓 | NaN | 👍👏💟♥💪 |
17 | 45 | 🔫♥😡😢💔 | 😡👍😜😠🎶 | 😡😠👍😜♥ | 😜♥😡👍😠 | 😢🔫😡👍💔 | 👍👏💟♥😡 | 😡👍😠👊👏 | NaN | ♥😢🔫💔😜 | ♥🎶💟🙏😜 | 😳🔫😜😐😂 |
From the table above, it is clear that some communities have more unified feelings about certain topics than others. Most homogenic would be community 6, which has either negative or positive emojis for most hashtags. They are big fans of #2a (second amendment) but not so much of #guncontrol. They display rather positive emotions for #maga (make america great again), #msm (mainstram media, but used by far right), #nra (the national rifle association) and #tcot (top conservative on twitter). They seem rather divided on #trump with both clapping and crying emojis. Interesting to see that the most connected users in this community are ABC News and CBS News.
Similar emojis appear for community 9, except that an angry emoji appears alongside all other ones. Maybe the language used in this community has a lot of anger in it. Community 2 seems to condone #gunviolence and in favor of #guncontrolnow.
Second, we will do the same thing for the hashtag graph.
This heatmap is bit sparser, but some interesting things can be found still. Partition 0 has high occurences of the hashtags #guncontrol, but in combination with #trump, #2a and #nra, which are all very much pro gun carrying. This could mean the hashtag is used within a whole different context, where the people in this community talk about gun control in a negative sense. Also, they talk about #mentalhealth, which could be a way to divert to a conversation where guns are not the problem, but mental health. The other communities that display noticable correlation seem to follow similar patterns or have very neutral hashtags.
# get the hashtags associated with the partitions
hashtag_partition_hashtags = partition_hashtag_analysis(partition_hashtag)
# Heatmap
# get most common hashtags in general
number_of_tags = 22
hashtags_count = Counter([tags for part_id in hashtag_partition_hashtags.itervalues() for tags in part_id])
most_common_tags = map(lambda tup: str(tup[0]), hashtags_count.most_common(number_of_tags))
# from the 'most_common_hashtags', manually group hashtags together on political orientation
most_common_tags = [tag for tag in most_common_tags if tag and tag not in ['texas']]
# create a matrix of the count of use of the most commn hashatgs in the communities
heat_array = np.array([[float(counts[tag]) for tag in most_common_tags] for counts in hashtag_partition_hashtags.values()])
# plot heatmap
fig = plt.figure(figsize=(10, 10))
plt.imshow(heat_array, interpolation='nearest')
plt.xticks(range(len(most_common_tags)), most_common_tags, rotation='vertical')
plt.yticks(range(int(hashtag_size)), hashtag_partition_hashtags.keys())
rect = fig.patch
rect.set_facecolor('white')
plt.colorbar()
plt.show()
Unfortunately, the emoji / hashtag matrix for the hashtag graph is very incoherent. All cells with emojis seem to express contradicting feelings. It is included for completeness. The hashtag network's communities are small and there was less data for this event in the data base to find the hashtags from.
emoji_hastag_analysis(partition_mention, 'hashtag', most_common_tags, 10)
1 Partition size | 2a | devinpatrickkelley | domesticviolence | fbi | guncontrol | guns | lasvegasmassacre | mentalhealth | nra | sutherlandspringsshooting | sutherlandspringstexas | texasstrong | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Partition | |||||||||||||
0 | 21 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ❤♥💟💔✨ |
1 | 28 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 😡👍🔫😖😢 | NaN | NaN |
2 | 46 | NaN | 🙈😳😕😬🙊 | 💔🙏❤💟♥ | 😡👀👊😈😠 | NaN | NaN | NaN | 😡😣😢😞😓 | NaN | ❤💔💟♥💙 | 🎧💪👊🎶🔫 | 👍👏🙏😎😉 |
3 | 86 | 👍🔫😄😡👏 | NaN | 😡😢🔫🙏😠 | 😡😢😑😕😠 | 👍♥😢💔💟 | 🔫😡👍👏😖 | ♥☺😉💔💟 | NaN | 🔫😄😡😜👍 | 😢💔♥👍💟 | 😄😡👍👏🔫 | 👍💟🙏♥😢 |
4 | 56 | 🔫♥❤💔💟 | NaN | NaN | NaN | 🔫♥👍❤🎶 | ♥😢😞🙏💔 | NaN | NaN | ❤♥🔫💔💟 | 🎧💪🔫🎶😈 | NaN | NaN |
# utility to export graphs into JSON format to use in d3js
import json, pprint
from networkx.readwrite import json_graph
def convert_network_json(network, directed, degree, name, partition=defaultdict(int)):
if not directed:
# remove double edges
network = nx.Graph(network);
print 'Serializing network with {} edges and {} nodes to {}'.format(len(network.edges()), len(network.nodes()), name)
nodes = [{'id': node, 'degree': degree[node], 'partition': partition[node]} for node in network.nodes()]
links = [{'source': edge[0], 'target': edge[1]} for edge in network.edges()]
serialized = {
'directed': directed,
'nodes': nodes,
'links': links,
'graph': {}
}
s = json.dumps(serialized)
with open('data/{}.json'.format(name), 'w') as outfile:
outfile.write(s)
print 'Done writing', name
convert_network_json(mention_gcc, False, mention_degree_gcc, 'mention_gcc')
convert_network_json(mention_graph, False, mention_degree, 'mention_graph')
convert_network_json(mention_gcc, False, mention_degree, 'mention_partition', partition_mention)
convert_network_json(hashtag_gcc, False, hashtag_degree_gcc, 'hashtag_gcc')
convert_network_json(hashtag_graph, False, hashtag_degree, 'hashtag_graph')
convert_network_json(hashtag_gcc, False, hashtag_degree_gcc, 'hashtag_partition', partition_hashtag)
Serializing network with 1211 edges and 1091 nodes to mention_gcc Done writing mention_gcc Serializing network with 1700 edges and 15992 nodes to mention_graph Done writing mention_graph Serializing network with 1211 edges and 1091 nodes to mention_partition Done writing mention_partition Serializing network with 29828 edges and 718 nodes to hashtag_gcc Done writing hashtag_gcc Serializing network with 29894 edges and 15992 nodes to hashtag_graph Done writing hashtag_graph Serializing network with 29828 edges and 718 nodes to hashtag_partition Done writing hashtag_partition
The sentiment analysis part of the report will focus on three different techniques.
On the first part, with the use of sentiment analysis techniques thaught in class, the tweet sentiment will be plotted over time.
One the second part, we will use the WordCloud library to understand the semantics behind the event.
On the third part of the analysis, the deepmoji library will be used to visualize the sentiments, this will be a (hopefully) fun way of understanding what is happening over time. It's also interesting to understand if the results of these two analysis differ in any way.
The first function that needs to be built is one that cleans a raw tweet. Tweets contain a lot of elements that, even though interesting, are not in the scope of a sentiment analysis.
Step 1: A function that cleans a tweet.
def clean_this(raw_tweet):
text = raw_tweet # extract text
text = text.split('http', 1)[0] # remove links
text = text.split('pic.', 1)[0] # remove pictures
text = text.lower() # lower case
text = re.sub(r'(\s)@\w+', r'\1', text) # remove mentions
text = re.sub(r'(\B)#\w+', r'\1', text) # remove hashtags
text = nltk.word_tokenize(text) # tokenize text
text = [token.lower() for token in text if token.isalpha()] # removes punctuation and numbers
text = [word for word in text if word not in stopwords.words('english')] # remove stopwords
text = list(set(text)) # only return unique tokens
return text
After cleaning a tweet, we will need to calculate it hapiness, for that, we will need a data file from MIT, that was shown during the course, named Data_set_S1.txt
. But for now, let's admit this data comes in a happy_data
matrix.
Step 2: A function that calculates the hapiness of a tweet.
def how_happy(tokens):
happinness_counter = 0.0
happy_word_counter = 0
for word in tokens:
if word in happy_words:
happy_word_counter += 1
happinness_counter += happy_data[np.where(happy_words == word)[0][0], 2]
if happy_word_counter == 0:
return happinness_counter
else:
return happinness_counter/happy_word_counter
After this, we need to import all of our libraries, create a connection to our mongoDB
database, extract our text file mentioned in step 2, as well as some other boring stuff.
Step 3: Importing libraries
from pymongo import MongoClient
import pymongo
from nltk.corpus import stopwords
import nltk
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from pprint import pprint
import matplotlib.style as style
import emoji
import seaborn
from collections import Counter
%pylab inline
style.use('fivethirtyeight')
Populating the interactive namespace from numpy and matplotlib
Step 4: The mongoDB
connection.
# creating a mongo connection
client = MongoClient("mongodb://socialgraphs:interactions@socialgraphs-shard-00-00-al7cj."
"mongodb.net:27017,socialgraphs-shard-00-01-al7cj.mongodb.net:27017,"
"socialgraphs-shard-00-02-al7cj"
".mongodb.net:27017/test?ssl=true"
"&replicaSet=SocialGraphs-shard-0&authSource=admin")
# Getting the sentiment analysis file and putting it on a matrix
path = 'ressources/Data_Set_S1.txt'
header = ['words', 'hap.rank', 'hap.avg', 'hap.std', 'tw.rank', 'goog.rank', 'nyt_rank', 'lyr_rank']
happy_data = pd.read_table(filepath_or_buffer=path, header=2).as_matrix()
happy_words = happy_data[:, 0]
# Boring database stuff, including fields to return.
db = client.texas
tweet_collection = db.tweetHistory
display_conditions = {"query_criteria": 0, "_id": 0,
"geo": 0, "mentions": 0,
"hashtags": 0, "favorites": 0,
"permalink": 0, "username": 0,
"id": 0}
Before analysing, let's explain some things. How can tweets be plotted over time ?
Of course we could plot every tweet, but this would cause two things: A very weird plot, and one that is hard to understand.Our approach was to request the tweets from our database in an ordered fashion. The first tweets to be requested will be the older ones (closer to the event) and will advance chronologically. Also, we decided to plot the average sentiment of all of the tweets for every hour.
Finally, it's interesting to understand if the general sentiment is at all related with the most popular tweets, for this, we establish a threshold (only get tweets with more than X retweets) and we do the same procedure.
By doing this, the plot becomes both easier to understand and conceptualise.
Step 1: Importing the tweets in a chronological fashion.
db_my_tweets = list(tweet_collection.find({}, display_conditions).sort("date", pymongo.ASCENDING))
Step 2: Preparing the processing.
# create lists to store sentiment values and periods.
sentiment = []
pop_sentiment = []
periods = []
# define important variables for looping
day = 5
hour = 0
happiness = 0
pop_happiness = 0
pop_tweet_counter = 0
tweet_counter = 0
pop_tweet = 100
# create a string to store entire text
text = ''
Step 3: Lopping over the tweets
# for every tweet
for idx, t in enumerate(db_my_tweets):
print 'Processing tweet {} / {} \r'.format(idx, len(db_my_tweets)),
tweet_counter += 1
tweet_text = t['text']
clean_text = clean_this(tweet_text)
text += ' '.join(word for word in clean_text)
happiness += how_happy(clean_text)
if t['retweets'] >= pop_tweet: # if the tweet is 'popular'
pop_tweet_counter += 1
pop_happiness += how_happy(clean_text)
if t['date'].hour != hour: # if the hour in the tweets that are incoming changes...
sentiment.append(happiness / tweet_counter)
if pop_tweet_counter != 0:
pop_sentiment.append(pop_happiness / pop_tweet_counter)
else: # if there are no popular tweets append 'nan'
pop_sentiment.append(float('nan'))
periods.append('{}/11 at {}'.format(day, hour))
# reset counters for next period
happiness = 0
pop_happiness = 0
tweet_counter = 0
pop_tweet_counter = 0
if hour == 0:
day += 1
hour = t['date'].hour
print 'We got a total of {} sentiment windows.'.format(len(sentiment))
We got a total of 315 sentiment windows.
Now that we have the sentiment vector, we can plot the sentiment over time. Note that the periods and some axis labels dissapeared, this is deliberate in order to increase readibility.
Step 3: Plotting the sentiment of all of the tweets and the sentiment of the popular tweets.
x = np.arange(len(sentiment))
style.use('ggplot')
# defining titles and axis names
plt.title('Sentiment Timeline', fontsize=20)
plt.xlabel('Hours after event', fontsize=17)
plt.ylabel('Normalized Sentiment Index', fontsize=17)
# some styling
plt.tick_params(axis='both', which='major', labelsize=12)
plt.axhline(y=0, color='black', linewidth=1.3, alpha=.7)
# and finally, we plot.
plt.plot(x, sentiment, linewidth=2, label= 'General Sentiment', color='#50514f')
plt.scatter(x, pop_sentiment, linewidth=2, label= 'Popular Tweet Sentiment', color='#f25f5c')
plt.legend(loc=1, prop={'size': 15})
pylab.rcParams['figure.figsize'] = (20, 10)
plt.show()
A couple of things are interesting from this graph:
The main idea of this part of the Sentiment Analysis is to have a visual representation of the most used words througout our database.
To accomplish this task, we will use the very handy WordCloud library.
Let's start by importing some much needed libraries
Step 1: Importing Libraries
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import urllib, cStringIO
Step 2: Getting all the text from the database
The idea in this part is too put all of the tweets in a long string called text
. But while we do this, we also have to clean these tweets.
This string was built in step 2 of part 3.1.2.
We start by selecting a nice image for the wordcloud countour, in this case an image of Texas. That can be found in the link below.
Step 1: Get a nice image
image_path = cStringIO.StringIO(urllib.urlopen('https://i.imgur.com/Bi09JtN.png').read())
texas_mask = np.array(Image.open(image_path))
When querying twitter for tweets, we used some words related to the event we are analysing, these are obviously going to be used a lot in our database, thereform, using the stopwords
function from WordCoud
, we can easily avoid them.
Step 2: Avoiding obvious words
stopwords = set(STOPWORDS)
stopwords.add('Sutherland')
stopwords.add('Texas')
stopwords.add('Shooting')
stopwords.add('Springs')
Now that we have all of the elements to plot it, let's finally do it.
Step 3: Plotting everything nicely.
# defining the wordcloud with stopwords and some edgy styling choices.
word_cloud = WordCloud(mask=texas_mask, width=800, height=400,background_color="white", collocations=False,colormap='inferno', stopwords=stopwords).generate(text)
# plot it.
plt.imshow(word_cloud, interpolation="bilinear")
plt.axis("off")
plt.show()
A couple of interesting points worth mentionning:
The final part of the sentiment analysis is all about emoji. We started by using a very simple version of DeepMoji where an "emoji" score was given to each tweet. In our database, each tweet now posesses the field "deepmoji" where we find the 5 most likely emoji that characterize that tweet and also the "reliability" of each one of them.
Step 1: Creating a dictionnary of emojis from a txt file.
emoji_index = {}
with open('ressources/emoji.txt') as f:
counter = 0
for line in f: # for every line
contents = [x.strip() for x in line.split(',')] # split line into 2
emoji_index[counter] = contents # contents = [name of emoji, url to emoji photo]
counter += 1
Step 2: A simple example.
# define what we will not need from Mongo
display_conditions = {"query_criteria": 0, "_id": 0,
"geo": 0, "mentions": 0,
"hashtags": 0, "favorites": 0,
"permalink": 0, "username": 0,
"retweets": 0, "id": 0}
tweets = tweet_collection.find({'deepmoji': {'$exists': True}},display_conditions)[43:45]
# for 2 tweets, extract the deepmoji field
for t in tweets:
emoji_list = [t['deepmoji']['Emoji_1'],
t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'],
t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']]
print 'Tweet: ', t['text']
print 'Emojis:',
for emoji_number in emoji_list:
print emoji.emojize(emoji_index[emoji_number][0], use_aliases=True),
print '\n'
Tweet: FYI, this is the 2nd mass shooting against people praying in a church in under 3 years. Spoiler: It doesn't stop gun violence. #SutherlandSpring Do your job @SpeakerRyan @SenateMajLdr @POTUS Emojis: 🔫 😡 🙏 😠 😈 Tweet: This is totally the turning point on #GunControl legislation right? #SutherlandSpring #Texas #OnceAgain #Enough Emojis: 😡 👍 😠 😳 😬
We can note that DeepMoji makes a pretty accurate characterization of the sentences, not perfect, but accurate enough.
The first idea for the emoji/sentiment analysis will be to visualize what are the emojis that are the most used in the whole dataset.
Step 1: Count the occurences of every emoji
# get emojis in a list
db_my_tweets = tweet_collection.find({'deepmoji': {'$exists': True}},
display_conditions).sort("date", pymongo.ASCENDING)
mega_list = []
for t in db_my_tweets:
emoji_list = [t['deepmoji']['Emoji_1'],
t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'],
t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']]
mega_list += emoji_list
# get a counter of that list
counter_ = Counter(mega_list)
labels, values = zip(*counter_.items())
indexes = np.arange(len(labels))
Step 2: A histogram of all of the emojis
plt.barh(labels, values, color=['#50514f', '#f25f5c', '#ffe066', '#247ba0'])
plt.yticks(range(len(labels)),
[emoji_index[i][0][1:-1] for i in range(len(labels))], fontsize=14)
plt.xlabel('Emoji Frequency')
plt.ylabel('Emoji Name')
pylab.rcParams['figure.figsize'] = (20, 15)
plt.title('Most Used Emojis in DataSet')
plt.show()
Step 3: A simpler way to visualize.
# print the top 20 emojis and their frequency
top = 20
top_list = counter_.most_common(top)
print 'The top {} sentiments according to deepmoji:'.format(top)
for i in range(len(top_list)):
item_emoji = top_list[i][0]
item_frequency = top_list[i][1]
print i+1,'.',emoji.emojize(emoji_index[item_emoji][0], use_aliases=True), 'with', item_frequency, 'characterizations.'
The top 20 sentiments according to deepmoji: 1 . ♥ with 17093 characterizations. 2 . 🔫 with 16037 characterizations. 3 . 💔 with 14085 characterizations. 4 . 🙏 with 13855 characterizations. 5 . 👍 with 12353 characterizations. 6 . 💟 with 12049 characterizations. 7 . 😢 with 12015 characterizations. 8 . 😡 with 11016 characterizations. 9 . 😠 with 8095 characterizations. 10 . 😐 with 5371 characterizations. 11 . 😕 with 4528 characterizations. 12 . ✌ with 4367 characterizations. 13 . 😳 with 4258 characterizations. 14 . ❤ with 4069 characterizations. 15 . 😞 with 3591 characterizations. 16 . 😈 with 3469 characterizations. 17 . 😑 with 3354 characterizations. 18 . 💪 with 3152 characterizations. 19 . 👊 with 3115 characterizations. 20 . ✨ with 2952 characterizations.
Some interesting things to note:
Note: Some emojis are not well displayed by OS X, in the emoji_index dictionnary you can consult the links for these emoji yourself. (Deepmoji is made wit Twitter Emoji)
The goal of this part of the analyis is to see how the emoji characterization evolves over time. For example, does the characterization of a tweet by 'gun' change over time? If yes, how?
Step 1: Call the tweets that we need.
db_my_tweets = tweet_collection.find({'deepmoji': {'$exists': True}},
display_conditions).sort("date", pymongo.ASCENDING)
The next step is kind of rough, basically we want to store a matrix, called emoji_grid
, where as rows the various possible emoji(64) are stored, and in the columns a period of time is stored. Therefore, the element in position [i,j] of the emoji_grid
will be equivalent to the normalized number of characterizations on period j
by emoji i
.
In this case, we will look at the characterizations every 4 hours, since the shooting. And see how the classification evolves.
Step 2: A matrix that stores the emoji freqeuncy per time period
# start the matrix
emoji_grid = np.zeros((len(emoji_index.values()), 1))
column = np.zeros((len(emoji_index.values()), 1))
# define important variables before loop
absolute_hour = 0
tweets_in_period = 0
periods = []
hours_passed = 0
period_length = 4
# for every tweet
for t in db_my_tweets:
tweets_in_period += 1 # tweet counter for period of time
tweet_hour = t['date'].hour
tweet_day = t['date'].day
# exctract the deepmoji classification
tweet_emoji_list = [t['deepmoji']['Emoji_1'], t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'], t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']]
for emoji_number in tweet_emoji_list:
column[emoji_number, 0] += 1
# this counter counts the hours that have passed
if tweet_hour != absolute_hour:
hours_passed += 1
absolute_hour = tweet_hour
# if X number of hours passed, append those depmoji classifications to the master emoji_grid
if hours_passed == period_length:
periods.append('{}/11 at {}'.format(tweet_day, tweet_hour))
emoji_grid = np.hstack((emoji_grid, column / tweets_in_period)) # here we normalize
column = np.zeros((len(emoji_index.values()), 1))
hours_passed = 0
tweets_in_period = 0
emoji_grid = np.delete(emoji_grid, 0, 1) # deletes the redundant first column.
Step 3: Plotting the emoji grid.
# define important variables.
plot_top = 5 # only the most frequent emojis are plotted for simplicity
counter = 0
colors = ['#50514f', '#f25f5c', '#ffe066', '#247ba0', '#70c1b3', '#50514f', '#f25f5c', '#ffe066', '#247ba0', '#70c1b3']
# for every emoji (row of the emoji_grid) that is on the top X, plot it over the periods.
for i in range(emoji_grid.shape[0]):
if i in np.argsort(np.sum(emoji_grid, axis=1))[::-1][:plot_top]:
s = plt.plot(range(emoji_grid.shape[1]), emoji_grid[i, :],
label=emoji_index[i][0][1:-1], linewidth=2, color=colors[counter])
counter += 1
# define titles and axis names
plt.title('Deepmoji Characteristation Every {} Hours'.format(period_length), fontsize=20)
plt.xlabel('Time', fontsize=17)
plt.ylabel('Normalized Sentiment Frequency', fontsize=17)
plt.tick_params(axis='both', which='major', labelsize=12)
plt.axhline(y=0, color='black', linewidth=1.3, alpha=.7)
plt.xticks(range(emoji_grid.shape[1]), periods, rotation='vertical')
pylab.rcParams['figure.figsize'] = (20, 10)
plt.legend()
plt.show()
This figure describes the tweets with emojis over time using the DeepMoji model. Severall things are interesting and worth descibing in this graph, let's mention some of them: