EVENT ANLYZER TOOL (Texas Shooting)¶

Table of Contents:¶

Part 1: Getting Started
Part 2: Creating the networks
Part 3: Community Detection
Part 4: Export to JSON for D3JS
Part 5: Sentiment Analysis

Prologue¶

This part is showing the data collection and declaring of variables used throughout the analysis. By changing these variables and running all the code, you can run your own complete analysis of a different event.

In [115]:

# these words will be used to look for hashtags
query_hashtags = ['sutherland springs' ,'sutherland spring', 'texas church shooting', 'texas shooting', 'texas church massacre', 'church shooting']

# add the concatonated version of these strings for hashtags
query_hashtags += map(lambda s: s.replace(' ', ''), query_hashtags)

Part 1: Getting Started¶

1.1 Basic Statistics¶

We start by importing all of the data stored in our MongoDB Atlas Suite.

In [116]:

from pymongo import MongoClient
import twitter, pickle, sys
import pymongo
import numpy as np
import pandas as pd
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (15, 6)

# intialize mongo client on MongoDB Atlas
client = MongoClient("mongodb://socialgraphs:interactions@socialgraphs-shard-00-00-al7cj.mongodb.net:27017,socialgraphs-shard-00-01-al7cj.mongodb.net:27017,socialgraphs-shard-00-02-al7cj.mongodb.net:27017/test?ssl=true&replicaSet=SocialGraphs-shard-0&authSource=admin")
db = client.texas

# access tweet collection
# TODO: only select unique tweets
tweet_collection = db.tweetHistory
myTweets = tweet_collection.find()

Populating the interactive namespace from numpy and matplotlib

/anaconda/lib/python2.7/site-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['f', 'text']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"

Let's start by figuring out how many tweets we have in total.

In [117]:

print 'We have a total of {} tweets.'.format(myTweets.count())

We have a total of 52903 tweets.

Let us figure out some very basic statistics about our dataset.

In [118]:

# define initial values
userSet = set()
totalNumberOfWords = 0.0
totalRetweets = 0.0
totalFavorites = 0.0
differentHashtags = set()

# loop over data
for tweet in myTweets:
    userSet.add(tweet['username'])
    differentHashtags = differentHashtags.union(set(tweet['hashtags'].split()))
    totalNumberOfWords += len(tweet['text'].split())
    totalRetweets += tweet['retweets']
    totalFavorites += tweet['favorites']

# define means
averageLength = totalNumberOfWords /  myTweets.count()
averageRetweets = totalRetweets /  myTweets.count()
averageFavorites = totalFavorites /  myTweets.count()

# print results
print 'There are {} differrent users.'.format(len(userSet))
print 'The average lenght of a tweet is {} words.'.format(averageLength)
print 'A total of {} differrent hashtags are used.'.format(len(differentHashtags))
print 'The average number of retweets: {}'.format(averageRetweets)
print 'The average number of favorites: {}'.format(averageFavorites)

There are 26289 differrent users.
The average lenght of a tweet is 19.2577169537 words.
A total of 7342 differrent hashtags are used.
The average number of retweets: 3.88087632081
The average number of favorites: 8.00228720488

1.2 The most popular tweets¶

What are the most favorited tweets, the most retweeted tweets and the respective users and times?

We start by preparing our databases by taking advantage of mongo.

In [119]:

# create databases and define number of desired elements
get_top = 5
display_conditions = {"deepmoji": 0, "permalink":0, "id":0,"date":0, "query_criteria":0, "_id":0, "geo":0, "mentions":0, "hashtags":0}
db_by_retweets = tweet_collection.find({}, display_conditions).sort("retweets",pymongo.DESCENDING)[0:get_top]
db_by_favorites = tweet_collection.find({}, display_conditions).sort("favorites",pymongo.DESCENDING)[0:get_top]

            
# a function that takes a cursor and pretty prints it.  
def print_result(database):
    array = []
    for t in database: 
        array.append(t)
    pandas = pd.DataFrame(array)
    display(pandas)

And we print the results.

In [120]:

print 'The most retweeted:'
print_result(db_by_retweets)
print 'The most favorited:'
print_result(db_by_favorites)

The most retweeted:

	favorites	retweets	text	username
0	17054	6641	I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions.	shannonrwatts
1	17054	6641	I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions.	shannonrwatts
2	20126	5912	The tragedy in Sutherland Springs happened a little over a week ago. Don’t let this fade into the next news cycle. We need gun safety reforms. Now.	KamalaHarris
3	13447	5252	It's been only one week since the Texas mass shooting . 42 days since the Vegas mass shooting . 53 days since Puerto Rico lost power and humanitarian crisis began. Time feels off with this much tragedy.	sarahkendzior
4	2523	3922	Anyone hear about this from media??? This happened early Saturday morning, before the #TexasChurchShooting Suspected ILLEGAL ALIEN shoots at cars on I-35 with AR style rifle, hits 7 yr old girl in the head, and 4 others. http://www. informationliberation.com/?id=57612	ChristieC733

The most favorited:

	favorites	retweets	text	username
0	20126	5912	The tragedy in Sutherland Springs happened a little over a week ago. Don’t let this fade into the next news cycle. We need gun safety reforms. Now.	KamalaHarris
1	17054	6641	I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions.	shannonrwatts
2	17054	6641	I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions.	shannonrwatts
3	13447	5252	It's been only one week since the Texas mass shooting . 42 days since the Vegas mass shooting . 53 days since Puerto Rico lost power and humanitarian crisis began. Time feels off with this much tragedy.	sarahkendzior
4	9842	3784	NRA can confirm Stephen Willeford is a member & has been certified as a NRA firearms instructor. #SutherlandSprings http://www. 4029tv.com/article/man-wh o-shot-texas-church-gunman-shares-his-story/13437943 …	DLoesch

1.3 Visualizing over time¶

It's also very interesting to understand the tweets in our database from a chronological point of view.

In [121]:

# get dates and remove seconds for readability purposes
myTweets = tweet_collection.find()
dates = list(set([tweet['date'] for tweet in myTweets]))
no_seconds = [date.replace( minute=0, second=0, microsecond=0) for date in dates] 

# count occurences
counter = dict(Counter(no_seconds))

# prepare plot
x = []
y = []
for element in counter:
    x.append(element)
    y.append(counter[element])

# plot nicely 
plt.title('Number of tweets per date')
plt.ylabel('Number of tweets')
plt.xlabel('Date')
plt.scatter(x, y, c=y, marker='.', s=y)
plt.xlim([min(x), max(x)])
plt.grid()
plt.show()

2 Generating the networks from tweets¶

From the tweets we collected we are going to generate a number of different networks that we will be using throughout the rest of the analysis. In either network, the nodes are going to be the users that have been tweeting about the event using one of the predefined hashtags.
For the first network, the edges will be constructed through mentions in these tweets. So, when a tweet mentions another user that is also a node in the network, there will be an edge between these two nodes. We will refer to this network as mention_graph.
For the second network, we define the edges between nodes if they share a common hashtag, not including the query hashtags. For example if two tweets from different nodes use the hashtag #GunSense, we will create an edge between them. We will refer to this network as hashtag_graph.

Below we will start creating the networks.

In [122]:

import networkx as nx
from collections import defaultdict
from itertools import combinations

# start by finding all unique usernames in the tweets that have either mentions or hashtags
where = {
    '$or': [
        {
            'mentions': {
                '$ne': ''
        
            },
        },{
            'hashtags': {
                '$ne': ''
            }
        }
    ]
}
usernames = tweet_collection.distinct('username', where)

# create two separate graphs for mention relations and hashtags, one simple, one multi
mention_graph = nx.Graph()
hashtag_graph = nx.Graph()

# add nodes from users that wrote tweets
mention_graph.add_nodes_from(usernames)
hashtag_graph.add_nodes_from(usernames)



# add edges to mention_graph between mentions in tweets 
# get all tweets with their mentions
tweet_mentions = list(tweet_collection.find({'mentions': {'$ne' : '',}}, {'username': 1, 'mentions': 1}))

# define a default dictionary to store the unique mentions used per user as a set
mentions_dict = defaultdict(set)

# populate dict {username: set(mentions)}
for tweet in tweet_mentions:
    # split mentions from string to list
    mentions = map(lambda mention: mention[1:], tweet['mentions'].split(' '))
    # update dict
    mentions_dict[tweet['username']].update(mentions)

# add edges from mentions_dict
for user, mentions in mentions_dict.iteritems():
    for to_node in mentions:
        if mention_graph.has_node(to_node):
            mention_graph.add_edge(user, to_node)
            
# add edges to the hashtag_graph
# get all tweets with hashtags
tweet_hashtags = tweet_collection.find({'entities.hashtags': {'$ne': ''}}, {'username': 1, 'hashtags': 1})

# intialize a defaultdict to track the unique hashtags 
# and how often users are using them
hashtags_dict = defaultdict(lambda: defaultdict(int))

# populate the dict {hashtags: set(usernames)}
for tweet in tweet_hashtags:
    username = tweet['username']
    # list of hashtags
    hashtags = map(lambda tag: tag.replace('#', '').lower(), tweet['hashtags'].split(' '))
    # remove the query_hashtags
    new_tags = list(set(set(hashtags) - set(query_hashtags)))
    if len(new_tags) > 0:
        for tag in new_tags:
            if tag:
                hashtags_dict[tag][username] += 1
                
# add edges between all users with same hashtag if they have used it more than once
for tag, userdict in hashtags_dict.iteritems():
    # find users who used the tag more than twice
    usernames = [username for username, count in userdict.iteritems() if count > 2]
    # create tuples of possible combinations of nodes
    sets = combinations(usernames, 2)
    # add edges
    for combi in sets:
        hashtag_graph.add_edge(*combi, atrr=tag)

2.1 Basic stats on the networks¶

In [123]:

print 'Mention Graph:'
print ' - Number of nodes:', len(mention_graph.nodes())
print ' - Number of edges:', len(mention_graph.edges())
print ' - Average degree:', float(sum(nx.degree(mention_graph).values())) / len(mention_graph.nodes())
print 'Hashtag Graph:'
print ' - Number of nodes:', len(hashtag_graph.nodes())
print ' - Number of edges:', len(hashtag_graph.edges())
print ' - Average degree:', float(sum(nx.degree(hashtag_graph).values())) / len(hashtag_graph.nodes())

Mention Graph:
 - Number of nodes: 15992
 - Number of edges: 1700
 - Average degree: 0.212606303152
Hashtag Graph:
 - Number of nodes: 15992
 - Number of edges: 29894
 - Average degree: 3.73861930965

2.2 Degree distribution¶

In [124]:

plt.style.use('fivethirtyeight')

# get degree distributions
mention_degree = nx.degree(mention_graph)
hashtag_degree = nx.degree(hashtag_graph)

# get minumum and maximum degrees
min_mention_degree, max_mention_degree = min(mention_degree.values()), max(mention_degree.values())
min_hashtag_degree, max_hashtag_degree = min(hashtag_degree.values()), max(hashtag_degree.values())

# plot the degree distributions
plt.figure()
plt.subplot(211)
plt.yscale('log', nonposy='clip')
plt.title('Mention graph degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.hist(sorted(mention_degree.values(),reverse=True), range(min_mention_degree, max_mention_degree + 1)) # degree sequence

plt.subplot(212)
plt.title('Hahstag graph degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.yscale('log', nonposy='clip')
plt.hist(sorted(hashtag_degree.values(),reverse=True), range(min_hashtag_degree, max_hashtag_degree + 1)) # degree sequence
plt.show()

2.3 The GCC¶

Plotting the size of components¶

In [125]:

# get all the separate components
components_mention = sorted(nx.connected_component_subgraphs(mention_graph), key=len, reverse=True)
components_hashtag = sorted(nx.connected_component_subgraphs(hashtag_graph), key=len, reverse=True)

print 'The mention graph has {0} disconnected components'.format(len(components_mention))
print 'The hashtag graph has {0} disconnected components'.format(len(components_hashtag))

plt.figure()

plt.subplot(211)
mention_component_lengths = [len(c) for c in components_mention]
plt.yscale('log', nonposy='clip')
plt.title('Mention graph components')
plt.ylabel('Number of components')
plt.xlabel('Number of nodes')
max_mcl = max(mention_component_lengths)
plt.hist(mention_component_lengths, range(max_mcl + 1))

plt.subplot(212)
plt.yscale('log', nonposy='clip')
plt.title('Hashtag graph components')
plt.ylabel('Number of components')
plt.xlabel('Number of nodes')
hashtag_component_lengths = [len(c) for c in components_hashtag]
max_hcl = max(hashtag_component_lengths)
plt.hist(hashtag_component_lengths, range(max_hcl + 1))

plt.tight_layout()

The mention graph has 14501 disconnected components
The hashtag graph has 15245 disconnected components

Examining the GCC¶

Since the main graph is so disconnecte, we decide to only work with the GCC of both graphs. This allows us to perform more in depth analysis.

In [126]:

# get the giant connected component for both graphs
mention_gcc, hashtag_gcc = components_mention[0], components_hashtag[0]

print 'Mention GCC'
print ' - Number of nodes:', len(mention_gcc.nodes())
print ' - Number of edges:', len(mention_gcc.edges())
print ' - Average degree:', float(sum(nx.degree(mention_gcc).values())) / len(mention_gcc.nodes())
print 'Hashtag GCC:'
print ' - Number of nodes:', len(hashtag_gcc.nodes())
print ' - Number of edges:', len(hashtag_gcc.edges())
print ' - Average degree:', float(sum(nx.degree(hashtag_gcc).values())) / len(hashtag_gcc.nodes())

# draw the graphs
nx.draw_networkx(mention_gcc, pos=nx.spring_layout(mention_gcc), node_size=[v * 100 for v in mention_degree.values()], with_labels=False)
plt.title('Mention GCC')
plt.show()

nx.draw_networkx(hashtag_gcc, pos=nx.spring_layout(hashtag_gcc), node_size=[v * 0.1 for v in hashtag_degree.values()], with_labels=False)
plt.title('Hashtag GCC')
plt.show()

Mention GCC
 - Number of nodes: 1091
 - Number of edges: 1211
 - Average degree: 2.21998166819
Hashtag GCC:
 - Number of nodes: 718
 - Number of edges: 29828
 - Average degree: 83.0863509749

GCC degree distribution¶

Since we are now only looking at the GCC of both graphs, we run the degree distribution again. This time we have no nodes without edges. The shapes are, however, remarkably similar the the full graphs. The distributions are plotted on a logarithmic scale so we can easily see whether the degrees follow a power law distribution. We can see that the mention_graph still looks logarithmic, so this seems to be an extreme distribution of a couple of highly connected nodes and a lot of poorly connected nodes. The hashtag_graph seems to follow a more linear relation now that one scale is logarithmic.

In [127]:

mention_degree_gcc = nx.degree(mention_gcc)
hashtag_degree_gcc = nx.degree(hashtag_gcc)

# get minumum and maximum degrees
max_mention_gcc_degree = max(mention_degree_gcc.values())
max_hashtag_gcc_degree = max(hashtag_degree_gcc.values())

# plot the degree distributions
plt.figure()
plt.subplot(211)
plt.yscale('log', nonposy='clip')
plt.title('Mention GCC degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.hist(sorted(mention_degree_gcc.values(),reverse=True), range(max_mention_gcc_degree + 1)) # degree sequence

plt.subplot(212)
plt.title('Hahstag GCC degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.yscale('log', nonposy='clip')
plt.hist(sorted(hashtag_degree_gcc.values(),reverse=True), range( max_hashtag_gcc_degree + 1)) # degree sequence

plt.tight_layout()

3.1 Community detection¶

The next step in our analysis is to define communities in our network and see what these communities revolve around.First, we will look into the sizes of the communities and the biggest accounts in the biggest communities, to get a sense for the kind of accounts we find. Then, we will look into the most common hashtags used in every community in the mention graph, to get a feeling for the topics that live in every community.

We use the Louvain method [1] for community detection with the following implementation in Python.

In [128]:

import community

# use the python Louvain implementation to find communities in the networks
partition_mention = community.best_partition(mention_gcc)
partition_hashtag = community.best_partition(hashtag_gcc)


#drawing
mention_size = float(len(set(partition_mention.values())))
pos = nx.spring_layout(mention_gcc)
count = 0.
for com in set(partition_mention.values()) :
    count = count + 1.
    list_nodes = [nodes for nodes in partition_mention.keys()
                                if partition_mention[nodes] == com]
    nx.draw_networkx_nodes(mention_gcc, pos, list_nodes, node_size = 20,
                                node_color = str(count / mention_size))

print 'For the mention GCC we have found {} communities'.format(int(mention_size))
nx.draw_networkx_edges(mention_gcc,pos, alpha=0.5)
plt.show()

hashtag_size = float(len(set(partition_hashtag.values())))
pos = nx.spring_layout(hashtag_gcc)
count = 0.
for com in set(partition_hashtag.values()) :
    count = count + 1.
    list_nodes = [nodes for nodes in partition_hashtag.keys()
                                if partition_hashtag[nodes] == com]
    nx.draw_networkx_nodes(hashtag_gcc, pos, list_nodes, node_size = 20,
                                node_color = str(count / hashtag_size))

print 'For the hashtag GCC we have found {} communities'.format(int(hashtag_size))
nx.draw_networkx_edges(hashtag_gcc,pos, alpha=0.5)
plt.show()

For the mention GCC we have found 25 communities

For the hashtag GCC we have found 10 communities

3.1 Top Accounts in Communities¶

We can see that for the mention graph, the communities have for the biggest part centred themselves around major news outlets. We see @FoxNews and @ABCNews, but also local news stations as @dallasnews and their reporters, like @lmcgaughy. For the hashtag graph, seem a bit more random. We do, however recognize the twitter accounts of Sputnik News, a Russion state controlled media outlet, linked to fake news on multiple occasions, and marypatriotnews.com which is a hyper conservative outlet to say the least.

It is interesting to see that the highest degree nodes in the hashtag_graph's partitions are not necessarily accounts with many followers. The links in this network are much more 'democratic', where anyone who uses a lot of prevalent hashtags can becomes a well connected node in the graph. This is different from the mention_graph where a user only gets mentioned a lot if he is well known and thus likely to have many followers.

In [129]:

# look at accounts in each partition with highest degree

# twitter api credentials for lookup
CONSUMER_KEY='29JF8nm5HttFcbwyNXkIq8S5b'
CONSUMER_SECRET='szo1IuEuyHuHCnh93VjLLGb5xg9NcfDVqMsLtOt3DbL5hXxpbt'
OAUTH_TOKEN='575782797-w96NPIzKF07TpC3c78nEadEfACLclYvSusuOPl8z'
OAUTH_TOKEN_SECRET='h0oitwxLkDjOLSejSQl2AWSrcmNeUwBpEvSUWonYzZTNz'

# instantiate API object
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter_api= twitter.Twitter(auth=auth)

# uxilirary function
def inverse_partition(partition):
    components_inv = defaultdict(list)
    for key, value in partition.iteritems():
        components_inv[value].append(key)
    return components_inv

# get top accounts by degree
def partition_top_accounts(partition, degree):
    part_inv = inverse_partition(partition)
    return {part: max(usernames, key=lambda user: degree[user]) for part, usernames in part_inv.iteritems()}

# get data on account
def twitter_account(username):
    return twitter_api.users.lookup(screen_name=username)

    
# display data in dataframe
def pprint_partition_overview(partition, degree, outfile=None):
    data = []
    columns = ['Partition', 'Partition Size', 'Screen Name', 'Name', 'Url', 'Location', 'Followers', 'Degree']
    top_accounts = partition_top_accounts(partition, degree)
    for part_id, account in top_accounts.iteritems():
        user = twitter_account(account)[0]
#         print pprint( user)
        url = ''
        try:
            url = user['entities']['url']['urls'][0]['display_url']
        except:
            pass
        row = {
            'Partition': part_id,
            'Partition Size': len(inverse_partition(partition)[part_id]),
            'Screen Name': account,
            'Name': user['name'],
            'Url': url,
            'Location': user['location'],
            'Followers': user['followers_count'],
            'Degree': degree[account]
        }
        data.append(row)
    data.sort(key=lambda row: row['Partition Size'], reverse=True)
    df = pd.DataFrame(data)
    df = df[columns]
    display(df)
    if outfile:
        serialized = json.dumps(data)
        with open('data/{}'.format(outfile), 'w') as ofile:
            ofile.write(serialized);
    

print 'The mention graph partitons with an overview of the accounts with the highest degrees'
pprint_partition_overview(partition_mention, mention_degree_gcc, 'mention_partition_accounts.json')
pprint_partition_overview(partition_hashtag, hashtag_degree_gcc, 'hashtag_partition_accounts.json')

The mention graph partitons with an overview of the accounts with the highest degrees

	Partition	Partition Size	Screen Name	Name	Url	Location	Followers	Degree
0	9	270	FoxNews	Fox News	foxnews.com	U.S.A.	16589039	288
1	3	86	JohnCornyn	Senator JohnCornyn	cornyn.senate.gov	Austin, Texas	120492	53
2	6	83	ABC	ABC News	ABCNews.com	New York City / Worldwide	12986622	58
3	5	69	USATODAY	USA TODAY	usatoday.com	USA TODAY HQ, McLean, Va.	3529784	54
4	4	56	lmcgaughy	Lauren McGaughy	dallasnews.com/author/lauren-…	Austin, TX	10613	35
5	10	56	AP	The Associated Press	apnews.com	Global	12058706	53
6	2	46	usairforce	U.S. Air Force	af.mil	Air, Space and Cyberspace	886224	46
7	13	46	DLoesch	Dana Loesch	amazon.com/Flyover-Nation…	Texas, USA	674971	14
8	8	45	chelseahandler	Chelsea Handler	chelseahandler.com	Los Angeles, CA	8343721	23
9	17	45	Everytown	Everytown	Everytown.org		114979	19
10	15	43	RealAlexJones	Alex Jones	infowars.com	Austin, TX	754784	16
11	12	37	scrowder	Steven Crowder	louderwithcrowder.com	Ghostlike	491141	25
12	1	28	ExpressNews	San Antonio E-N	ExpressNews.com	San Antonio, TX	19660	8
13	7	23	abcnews	ABC News	abc.net.au/news	Australia	1414667	11
14	11	23	RNS	Religion News Service	religionnews.com	DC, NYC, London, Rome	73982	13
15	0	21	KHOU	KHOU 11 News Houston	khou.com	Houston, TX	659018	13
16	19	20	KENS5	KENS 5	kens5.com	San Antonio, Texas	130446	12
17	18	19	YahooNews	Yahoo News	yahoo.com/news/	New York City	1130694	17
18	16	17	israelnash	Israel Nash	twitter.com/israelnash	Dripping Springs, TX	2453	8
19	20	12	foxandfriends	FOX & friends	foxandfriends.com	New York City	1084193	11
20	22	12	NewsHour	PBS NewsHour	pbs.org/newshour/	Arlington, VA \| New York, NY	983682	8
21	24	12	NRO	National Review	NationalReview.com	New York	271442	11
22	21	8	ChrisCuomo	Christopher C. Cuomo	ChrisCuomo.com	In the Arena	1271167	7
23	14	7	InsideEdition	Inside Edition	insideedition.com	New York	72604	4
24	23	7	ABCWorldNews	World News Tonight	abcnews.com/wnt	New York	1277571	6

	Partition	Partition Size	Screen Name	Name	Url	Location	Followers	Degree
0	0	279	BigGator5	BigGator5	biggator5.net/about/twitter-…	Lake County, Florida	5876	343
1	1	192	Adrian_Rafael	Adrian R. Morales			1239	268
2	4	84	MacChomhghaill	McChomhghaill	TrumpUnifies.tk	Northern California, USA	4361	362
3	2	79	TrendingNewsTV	Trending Newscast	tn.dvolatility.com	Metro Detroit, MI	205	78
4	3	60	Ms1Scs	#DeepStateSwampDrain		USA	6760	93
5	9	6	Johnathin79	Lock'm ALL Up!			6851	7
6	5	5	SputnikInt	Sputnik	sputniknews.com		205091	11
7	7	5	Expose_The_Lies	ExposeTheLies	facebook.com/ExposeTheLies		97	6
8	8	5	PrgrsvArchitect	ProgressiveArchitect		Tucson, AZ USA	1015	7
9	6	3	MaryPatriotNews	Mary Budesheim	marypatriotnews.com	Glens Falls, NY	13642	4

Comunnity Sizes¶

In [130]:

from collections import Counter

# commnity histogram
hashtag_com_count = Counter(partition_hashtag.values())
mention_com_count = Counter(partition_mention.values())

plt.figure()

plt.subplot(211)
plt.title('Sizes of hashtag communities')
plt.xlabel('Community number')
plt.ylabel('Number of nodes')
plt.bar(hashtag_com_count.keys(), hashtag_com_count.values())

plt.subplot(212)
plt.title('Sizes of mention communities')
plt.xlabel('Community number')
plt.ylabel('Number of nodes')
plt.bar(mention_com_count.keys(), mention_com_count.values())

plt.tight_layout()

3.3 Community Hashtags for Mention Graph¶

To drill down further on what plays in the communities found in either network, we will look at the hashtags are used in these communities and how they relate to each other. A heatmap shows the number of occurrences of the top hashtags in each community. The brighter the color, the more prevalent that hashtag is in the tweets from that community and by extension from the users. This gives us an idea about the ideas or opinions of these users. The data is a little sparse for some communities since they are small and there are not that many hashtags, but we can see some interesting patterns emerge in the ones that do have data.

In [131]:

import matplotlib.style
import matplotlib as mpl
mpl.style.use('classic')



def partition_hashtag_analysis(partition):
    # inverse the partitioning to get a dict with { partitioning_id : [usernames]}
    components_inv = inverse_partition(partition)
    # get all hashtags used by users in combination with our query_hashtags
    components_hashtags = defaultdict(list)
    for part_id, usernames in components_inv.iteritems():
        tweets = tweet_collection.find({
            'username': {
                '$in': usernames
            }, 
            'hashtags': {
                '$ne': '', 
                '$nin': map(lambda s: '#' + s, query_hashtags)
            },
        },
        {
            'hashtags': 1
        })
        # filter query hashtags hashtags 
        for row in tweets:
            tags = [tag for tag in row['hashtags'].lower().replace('#', '').split(' ') if tag not in query_hashtags]
            components_hashtags[part_id] += tags

    part_tag_counts = {}
    for part_id, tags in components_hashtags.iteritems():
        counts = Counter(tags)
        part_tag_counts[part_id] = counts
    return part_tag_counts

mention_com_hashtags = partition_hashtag_analysis(partition_mention)

# Heatmap
# get most common hashtags in general
number_of_tags = 25
hashtags_count = Counter([tags for part_id in mention_com_hashtags.itervalues() for tags in part_id])
most_common_tags = map(lambda tup: str(tup[0]), hashtags_count.most_common(number_of_tags))

# from the 'most_common_hashtags', manually group hashtags together on political orientation
# against gun carry
against = ['gunsense', 'guncontrol', 'guncontrolnow','gunviolence', 'stopgunviolence','backgroundchecks']
# generally in favor of gun carry 
in_favor = ['trump', 'maga', '2a','tcot', 'nra', 'msm']
# neutral
most_common_tags = against + in_favor + ['texas', 'sutherlandspringsshooting', 'lasvegasshooting',  'texasstrong',  'ksatnews', 'usatoday', 'kens5eyewitness', 'church', 'shooting', 'firstbaptistchurch', 'gun', 'airforce']

# create a matrix of the count of use of the most commn hashatgs in the communities
heat_array = np.array([[counts[tag] for tag in most_common_tags] for counts in mention_com_hashtags.values()])

# plot heatmap
fig = plt.figure(figsize=(10, 10))
plt.imshow(heat_array, interpolation='nearest')
plt.xticks(range(number_of_tags), most_common_tags, rotation='vertical')
plt.yticks(range(int(mention_size)), mention_com_hashtags.keys())

rect = fig.patch
rect.set_facecolor('white')
plt.colorbar()
plt.show()

The heatmap above displays how often the most common hashtags appear in every community. The brighter the color, the more prevalent that hashtag is in the tweets from that community and by extension from the users. This gives us an idea about the ideas or opinions of these users. The data is a little sparse for most communities since they are small and there are not that many hashtags, but we can see some interesting patterns emerge in the ones that do have data. We have communities 16 and 17 mentioning #guncontrol #gunsense #guncontrolnow #gunviolence and #stopgunviolence, which are all hashtags related to the camp that wants to limit guns in America. Clusters 3 and 5 and 19 mention in #texas combination with news outlets #kens5eyewitness, #ksatnews and #usatoday, seemingly neutral. There are a bunch of more republican oriented hashtags floating around, #trump, #maga (Make America Great Again), #2a (2nd amendment which protects gun owners), #tcot (top conservatives on twitter), #msm (mainstream media) and #nra (national rifle association - lobby group for gun carry rights), which seem a little more used by the clusters 12, 13, 9, 10, 5 and 6.

Let's combine this data with the sentiment analyis we have derived from the deep learning emoji analysis.

3.4 Hashtag Emoji Analysis for Mention Graph¶

As we can see above, the use of a hashtag can be fairly ambiguous. People can use a certain hashtag and be either pro or against it, or use the hastag in a sarcastic or ironic way. To add a bit more context we thought it would be interesting to look at what sentiments or thoughts are associated with the hashtags in each community. For this, we used DeepMoji [2] again. These researchers from MIT, among other universities, have constructed a way to train neural networks on text with emojis that let them predict a series of emojis from a sentence. The project is freely available on github including the pretrained models, which can quite articulately describe the sentiment or feeling of that piece of text. We decided to correlate the hashtags used in every community with the emojis returned from DeepMoji to get a more refined image of the opinions that are prevalent in these communities.

Below, the results of this correlation are displayed in a emoji / hashtag matrix for the communities. De rows represent the different communities in the networks and the columns the most common hashtags in these networks. In the cells the most common emojis found through DeepMoji prediction are displayed. The smallest communities and some trivial hashtags have been left out for clarity.

In [132]:

# get emojis per community similarly as we did for the hashtags

def emoji_hastag_analysis(partition, graph_name, hashtags, threshold=25):
    components_inv = inverse_partition(partition)
    # store data as { part_id : { hashtag : Counter({ emoji : count})}}
    components_emoji = defaultdict(lambda: defaultdict(Counter))
    for part_id, usernames in components_inv.iteritems():
        tweets = tweet_collection.find({
            'username': {
                '$in': usernames
            }, 
            'hashtags': {
                '$ne': '', 
                '$nin': map(lambda s: '#' + s, query_hashtags)
            },
            'deepmoji': {
                '$exists': True
            }
        },
        {
            'hashtags': 1,
            'deepmoji': 1
        })
        # filter query hashtags hashtags 
        for row in tweets:
            tags = [tag for tag in row['hashtags'].lower().replace('#', '').split(' ') if tag not in query_hashtags]
            # store emojis associated with hashtags
            emojis = [emo for k, emo in row['deepmoji'].iteritems() if 'Emoji' in k]
            for tag in tags:
                components_emoji[part_id][tag].update(emojis)

    # import emoji converting dictionary
    import emoji
    emoji_index = {}
    with open('ressources/emoji.txt') as f:
        counter = 0
        for line in f: # for every line
            contents = [x.strip() for x in line.split(',')] # split line into 2
            emoji_index[counter] = contents # contents = [name of emoji, url to emoji photo]
            counter += 1

    emoji_matrix = []

    for part in components_emoji: 
        # only consider larger communities
        if sum(globals()['{}_com_count'.format(graph_name)][part]) > threshold:
            tag_emoji = {
                'Partition': part,
                '1 Partition size': mention_com_count[part]
            }
            # only look at politically charges hastags         
            for tag in hashtags:
                emojis = []
                for item in components_emoji[part][tag].most_common(5):
                    emojis.append(emoji.emojize(emoji_index[item[0]][0], use_aliases=True))
                if len(emojis) > 0:
                    tag_emoji[tag] = ''.join(emojis)
            emoji_matrix.append(tag_emoji)

    df = pd.DataFrame(emoji_matrix)
    df.set_index('Partition', inplace=True)
    display(df)

emoji_hastag_analysis(partition_mention, 'mention', in_favor + against)

	1 Partition size	2a	backgroundchecks	guncontrol	guncontrolnow	gunsense	gunviolence	maga	msm	nra	tcot	trump
Partition
1	28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	46	NaN	NaN	NaN	👏💪👊♥👍	NaN	😡😣😢😞😓	NaN	NaN	NaN	NaN	NaN
3	86	👍🔫😄😡👏	NaN	👍♥😢💔💟	😡👍😢👏💔	👍🔫😎😡👏	NaN	👍😉😄😜🙏	NaN	🔫😄😡😜👍	😄👍😉😜😡	NaN
4	56	🔫♥❤💔💟	NaN	🔫♥👍❤🎶	❤♥🔫💔💟	❤♥🔫💔💟	NaN	NaN	👍😊👏😳😉	❤♥🔫💔💟	NaN	NaN
5	69	👍😈🔫😉😜	NaN	😐🔫😳😢😕	NaN	NaN	NaN	♥😢💔✌💟	NaN	NaN	NaN	😡😠😤🔫😈
6	83	👍👊😉😄💪	NaN	😡😪😤✋😣	💟♥👍❤💙	NaN	NaN	👍💟😜♥👊	😜👍💟♥👊	👍😉😜♥💟	♥😉🎶💔💟	👍😪👏😣😓
8	45	♥💪👊✌👍	NaN	✨😄💜👍😊	NaN	NaN	NaN	😡😒😑😤😠	NaN	NaN	😡♥👊😠💟	NaN
9	270	👍🔫😡💪👊	NaN	😡👍😠😤😢	😡👍😠😢😤	👍😡😢👏💔	👍😡😢💔💟	😡😢😠💔😷	😡😬👍😜👏	👍😡♥😢🔫	NaN	😡😠👍😈😑
10	56	NaN	NaN	😡😠👍🔫💔	NaN	😡🔫👍😑🙏	NaN	👍♥😡🔫🎶	NaN	😡🔫😠💔😑	NaN	👍🎶😎🎧♥
12	37	♥👍💔✨😡	NaN	♥👍😡💟😠	NaN	♥✨👍💔💟	♥👍😄✌💟	👏👍👊🔫😎	👍😉♥👏🙏	😡😠😢😔🔫	♥✨👍💔💟	👏👍👊🔫😎
13	46	😡😠😉👊👍	NaN	😡😠🔫💔😉	👏👍😉😡😠	😡🔫😑💔😠	NaN	🔫🙏♥✌😡	NaN	😂👍🙌👏💟	NaN	NaN
15	43	NaN	NaN	NaN	NaN	NaN	NaN	😡🙌🔫😢🙏	♥👍🙏💔💟	😐🔫😑😅😓	NaN	👍👏💟♥💪
17	45	🔫♥😡😢💔	😡👍😜😠🎶	😡😠👍😜♥	😜♥😡👍😠	😢🔫😡👍💔	👍👏💟♥😡	😡👍😠👊👏	NaN	♥😢🔫💔😜	♥🎶💟🙏😜	😳🔫😜😐😂

From the table above, it is clear that some communities have more unified feelings about certain topics than others. Most homogenic would be community 6, which has either negative or positive emojis for most hashtags. They are big fans of #2a (second amendment) but not so much of #guncontrol. They display rather positive emotions for #maga (make america great again), #msm (mainstram media, but used by far right), #nra (the national rifle association) and #tcot (top conservative on twitter). They seem rather divided on #trump with both clapping and crying emojis. Interesting to see that the most connected users in this community are ABC News and CBS News.

Similar emojis appear for community 9, except that an angry emoji appears alongside all other ones. Maybe the language used in this community has a lot of anger in it. Community 2 seems to condone #gunviolence and in favor of #guncontrolnow.

3.5 Community Hastag Analysis for Hashtag Graph¶

Second, we will do the same thing for the hashtag graph.

This heatmap is bit sparser, but some interesting things can be found still. Partition 0 has high occurences of the hashtags #guncontrol, but in combination with #trump, #2a and #nra, which are all very much pro gun carrying. This could mean the hashtag is used within a whole different context, where the people in this community talk about gun control in a negative sense. Also, they talk about #mentalhealth, which could be a way to divert to a conversation where guns are not the problem, but mental health. The other communities that display noticable correlation seem to follow similar patterns or have very neutral hashtags.

In [133]:

# get the hashtags associated with the partitions
hashtag_partition_hashtags = partition_hashtag_analysis(partition_hashtag)

# Heatmap
# get most common hashtags in general
number_of_tags = 22
hashtags_count = Counter([tags for part_id in hashtag_partition_hashtags.itervalues() for tags in part_id])
most_common_tags = map(lambda tup: str(tup[0]), hashtags_count.most_common(number_of_tags))

# from the 'most_common_hashtags', manually group hashtags together on political orientation
most_common_tags = [tag for tag in most_common_tags if tag and tag not in ['texas']]

# create a matrix of the count of use of the most commn hashatgs in the communities
heat_array = np.array([[float(counts[tag])  for tag in most_common_tags] for counts in hashtag_partition_hashtags.values()])

# plot heatmap
fig = plt.figure(figsize=(10, 10))
plt.imshow(heat_array, interpolation='nearest')
plt.xticks(range(len(most_common_tags)), most_common_tags, rotation='vertical')
plt.yticks(range(int(hashtag_size)), hashtag_partition_hashtags.keys())

rect = fig.patch
rect.set_facecolor('white')
plt.colorbar()
plt.show()

3.6 Hashtag Emoji Analysis for Hashtag Graph¶

Unfortunately, the emoji / hashtag matrix for the hashtag graph is very incoherent. All cells with emojis seem to express contradicting feelings. It is included for completeness. The hashtag network's communities are small and there was less data for this event in the data base to find the hashtags from.

In [134]:

emoji_hastag_analysis(partition_mention, 'hashtag', most_common_tags, 10)

	1 Partition size	2a	devinpatrickkelley	domesticviolence	fbi	guncontrol	guns	lasvegasmassacre	mentalhealth	nra	sutherlandspringsshooting	sutherlandspringstexas	texasstrong
Partition
0	21	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	❤♥💟💔✨
1	28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	😡👍🔫😖😢	NaN	NaN
2	46	NaN	🙈😳😕😬🙊	💔🙏❤💟♥	😡👀👊😈😠	NaN	NaN	NaN	😡😣😢😞😓	NaN	❤💔💟♥💙	🎧💪👊🎶🔫	👍👏🙏😎😉
3	86	👍🔫😄😡👏	NaN	😡😢🔫🙏😠	😡😢😑😕😠	👍♥😢💔💟	🔫😡👍👏😖	♥☺😉💔💟	NaN	🔫😄😡😜👍	😢💔♥👍💟	😄😡👍👏🔫	👍💟🙏♥😢
4	56	🔫♥❤💔💟	NaN	NaN	NaN	🔫♥👍❤🎶	♥😢😞🙏💔	NaN	NaN	❤♥🔫💔💟	🎧💪🔫🎶😈	NaN	NaN

In [ ]:

4 Exporting in format for d3.js¶

In [135]:

# utility to export graphs into JSON format to use in d3js

import json, pprint
from networkx.readwrite import json_graph

def convert_network_json(network, directed, degree, name, partition=defaultdict(int)):
    if not directed:
        # remove double edges
        network = nx.Graph(network);
    print 'Serializing network with {} edges and {} nodes to {}'.format(len(network.edges()), len(network.nodes()), name)
    nodes = [{'id': node, 'degree': degree[node], 'partition': partition[node]} for node in network.nodes()]
    links = [{'source': edge[0], 'target': edge[1]} for edge in network.edges()]
    serialized = {
        'directed': directed,
        'nodes': nodes,
        'links': links,
        'graph': {}
    }
    s = json.dumps(serialized)
    with open('data/{}.json'.format(name), 'w') as outfile:
        outfile.write(s)
        print 'Done writing', name

convert_network_json(mention_gcc, False, mention_degree_gcc, 'mention_gcc')
convert_network_json(mention_graph, False, mention_degree, 'mention_graph')
convert_network_json(mention_gcc, False, mention_degree, 'mention_partition', partition_mention)

convert_network_json(hashtag_gcc, False, hashtag_degree_gcc, 'hashtag_gcc')
convert_network_json(hashtag_graph, False, hashtag_degree, 'hashtag_graph')
convert_network_json(hashtag_gcc, False, hashtag_degree_gcc, 'hashtag_partition', partition_hashtag)

Serializing network with 1211 edges and 1091 nodes to mention_gcc
Done writing mention_gcc
Serializing network with 1700 edges and 15992 nodes to mention_graph
Done writing mention_graph
Serializing network with 1211 edges and 1091 nodes to mention_partition
Done writing mention_partition
Serializing network with 29828 edges and 718 nodes to hashtag_gcc
Done writing hashtag_gcc
Serializing network with 29894 edges and 15992 nodes to hashtag_graph
Done writing hashtag_graph
Serializing network with 29828 edges and 718 nodes to hashtag_partition
Done writing hashtag_partition

Part 5: Sentiment Analysis¶

The sentiment analysis part of the report will focus on three different techniques.

On the first part, with the use of sentiment analysis techniques thaught in class, the tweet sentiment will be plotted over time.

One the second part, we will use the WordCloud library to understand the semantics behind the event.

On the third part of the analysis, the deepmoji library will be used to visualize the sentiments, this will be a (hopefully) fun way of understanding what is happening over time. It's also interesting to understand if the results of these two analysis differ in any way.

3.1. The sentiment changes over time¶

3.1.1. Preparing¶

The first function that needs to be built is one that cleans a raw tweet. Tweets contain a lot of elements that, even though interesting, are not in the scope of a sentiment analysis.

Step 1: A function that cleans a tweet.

In [136]:

def clean_this(raw_tweet):
    text = raw_tweet   # extract text
    text = text.split('http', 1)[0]     # remove links
    text = text.split('pic.', 1)[0]     # remove pictures
    text = text.lower()     # lower case
    text = re.sub(r'(\s)@\w+', r'\1', text)     # remove mentions
    text = re.sub(r'(\B)#\w+', r'\1', text)     # remove hashtags
    text = nltk.word_tokenize(text)     # tokenize text
    text = [token.lower() for token in text if token.isalpha()]     # removes punctuation and numbers
    text = [word for word in text if word not in stopwords.words('english')]        # remove stopwords
    text = list(set(text)) # only return unique tokens
    return text

After cleaning a tweet, we will need to calculate it hapiness, for that, we will need a data file from MIT, that was shown during the course, named Data_set_S1.txt. But for now, let's admit this data comes in a happy_data matrix.

Step 2: A function that calculates the hapiness of a tweet.

In [137]:

def how_happy(tokens):
    happinness_counter = 0.0
    happy_word_counter = 0
    for word in tokens:
        if word in happy_words:
            happy_word_counter += 1
            happinness_counter += happy_data[np.where(happy_words == word)[0][0], 2]
    if happy_word_counter == 0:
        return happinness_counter
    else: 
        return happinness_counter/happy_word_counter

After this, we need to import all of our libraries, create a connection to our mongoDB database, extract our text file mentioned in step 2, as well as some other boring stuff.

Step 3: Importing libraries

In [138]:

from pymongo import MongoClient
import pymongo
from nltk.corpus import stopwords
import nltk
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from pprint import pprint
import matplotlib.style as style
import emoji
import seaborn
from collections import Counter
%pylab inline
style.use('fivethirtyeight')

Populating the interactive namespace from numpy and matplotlib

Step 4: The mongoDB connection.

In [139]:

# creating a mongo connection
client = MongoClient("mongodb://socialgraphs:interactions@socialgraphs-shard-00-00-al7cj."
                     "mongodb.net:27017,socialgraphs-shard-00-01-al7cj.mongodb.net:27017,"
                     "socialgraphs-shard-00-02-al7cj"
                     ".mongodb.net:27017/test?ssl=true"
                     "&replicaSet=SocialGraphs-shard-0&authSource=admin")

# Getting the sentiment analysis file and putting it on a matrix
path = 'ressources/Data_Set_S1.txt'
header = ['words', 'hap.rank', 'hap.avg', 'hap.std', 'tw.rank', 'goog.rank', 'nyt_rank', 'lyr_rank']
happy_data = pd.read_table(filepath_or_buffer=path, header=2).as_matrix()
happy_words = happy_data[:, 0]


# Boring database stuff, including fields to return. 
db = client.texas
tweet_collection = db.tweetHistory
display_conditions = {"query_criteria": 0, "_id": 0,
                      "geo": 0, "mentions": 0,
                      "hashtags": 0, "favorites": 0,
                      "permalink": 0, "username": 0,
                      "id": 0}

3.1.2. Extracting the sentiment¶

Before analysing, let's explain some things. How can tweets be plotted over time ?

Of course we could plot every tweet, but this would cause two things: A very weird plot, and one that is hard to understand.Our approach was to request the tweets from our database in an ordered fashion. The first tweets to be requested will be the older ones (closer to the event) and will advance chronologically. Also, we decided to plot the average sentiment of all of the tweets for every hour.

Finally, it's interesting to understand if the general sentiment is at all related with the most popular tweets, for this, we establish a threshold (only get tweets with more than X retweets) and we do the same procedure.

By doing this, the plot becomes both easier to understand and conceptualise.

Step 1: Importing the tweets in a chronological fashion.

In [140]:

db_my_tweets = list(tweet_collection.find({}, display_conditions).sort("date", pymongo.ASCENDING))

Step 2: Preparing the processing.

In [141]:

# create lists to store sentiment values and periods. 
sentiment = [] 
pop_sentiment = []
periods = [] 

# define important variables for looping
day = 5
hour = 0
happiness = 0
pop_happiness = 0
pop_tweet_counter = 0 
tweet_counter = 0
pop_tweet = 100

# create a string to store entire text 
text = ''

Step 3: Lopping over the tweets

In [142]:

# for every tweet
for idx, t in enumerate(db_my_tweets):
    print 'Processing tweet {} / {} \r'.format(idx, len(db_my_tweets)),
    tweet_counter += 1
    tweet_text = t['text']
    clean_text = clean_this(tweet_text)
    text += ' '.join(word for word in clean_text)
    happiness += how_happy(clean_text)
    
    if t['retweets'] >= pop_tweet: # if the tweet is 'popular'
        pop_tweet_counter += 1
        pop_happiness += how_happy(clean_text)
    
    if t['date'].hour != hour: # if the hour in the tweets that are incoming changes...
        sentiment.append(happiness / tweet_counter)
        if pop_tweet_counter != 0:
            pop_sentiment.append(pop_happiness / pop_tweet_counter)
        else: # if there are no popular tweets append 'nan'
            pop_sentiment.append(float('nan'))
        periods.append('{}/11 at {}'.format(day, hour))
        
        # reset counters for next period
        happiness = 0
        pop_happiness = 0
        tweet_counter = 0
        pop_tweet_counter = 0
        
        if hour == 0:
            day += 1
        hour = t['date'].hour

print 'We got a total of {} sentiment windows.'.format(len(sentiment))

We got a total of 315 sentiment windows.

Now that we have the sentiment vector, we can plot the sentiment over time. Note that the periods and some axis labels dissapeared, this is deliberate in order to increase readibility.

Step 3: Plotting the sentiment of all of the tweets and the sentiment of the popular tweets.

In [143]:

x = np.arange(len(sentiment))
style.use('ggplot')

# defining titles and axis names
plt.title('Sentiment Timeline', fontsize=20)
plt.xlabel('Hours after event', fontsize=17)
plt.ylabel('Normalized Sentiment Index', fontsize=17)

# some styling
plt.tick_params(axis='both', which='major', labelsize=12)
plt.axhline(y=0, color='black', linewidth=1.3, alpha=.7)



# and finally, we plot.

plt.plot(x, sentiment, linewidth=2, label= 'General Sentiment', color='#50514f')
plt.scatter(x, pop_sentiment, linewidth=2, label= 'Popular Tweet Sentiment', color='#f25f5c')
plt.legend(loc=1, prop={'size': 15})
pylab.rcParams['figure.figsize'] = (20, 10)
plt.show()

A couple of things are interesting from this graph:

There is a big fall 6 days after the event. The 1 week mark seems to have a serious impact on the overall happiness.
The popular tweets are generally more happy than the regular tweets. "Comanding" the trends.
After the 1 week mark, the variability of the sentiment isn't as high, this can also be due to the less amount of tweets, as the hashtags "de-popularize".

5.2. WordCloud¶

The main idea of this part of the Sentiment Analysis is to have a visual representation of the most used words througout our database.

To accomplish this task, we will use the very handy WordCloud library.

3.2.1. Preparing¶

Let's start by importing some much needed libraries

Step 1: Importing Libraries

In [144]:

from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import urllib, cStringIO

Step 2: Getting all the text from the database

The idea in this part is too put all of the tweets in a long string called text. But while we do this, we also have to clean these tweets.

This string was built in step 2 of part 3.1.2.

3.2.2. Creating the wordcloud¶

We start by selecting a nice image for the wordcloud countour, in this case an image of Texas. That can be found in the link below.

Step 1: Get a nice image

In [145]:

image_path = cStringIO.StringIO(urllib.urlopen('https://i.imgur.com/Bi09JtN.png').read())
texas_mask = np.array(Image.open(image_path))

When querying twitter for tweets, we used some words related to the event we are analysing, these are obviously going to be used a lot in our database, thereform, using the stopwords function from WordCoud, we can easily avoid them.

Step 2: Avoiding obvious words

In [146]:

stopwords = set(STOPWORDS)
stopwords.add('Sutherland')
stopwords.add('Texas')
stopwords.add('Shooting')
stopwords.add('Springs')

Now that we have all of the elements to plot it, let's finally do it.

Step 3: Plotting everything nicely.

In [147]:

# defining the wordcloud with stopwords and some edgy styling choices. 
word_cloud = WordCloud(mask=texas_mask, width=800, height=400,background_color="white", collocations=False,colormap='inferno', stopwords=stopwords).generate(text)

# plot it.
plt.imshow(word_cloud, interpolation="bilinear")
plt.axis("off")
plt.show()

A couple of interesting points worth mentionning:

We can not that most of the words are related to the shooting itself, words like "church", "victim", "killed", and "people" were expected since the start.
However, on a less expected note, some words were not expected, words such as "trump", "white", "media", seem to have a high political connotation.

5.3. Emoji Analysis¶

3.3.1. Preparing¶

The final part of the sentiment analysis is all about emoji. We started by using a very simple version of DeepMoji where an "emoji" score was given to each tweet. In our database, each tweet now posesses the field "deepmoji" where we find the 5 most likely emoji that characterize that tweet and also the "reliability" of each one of them.

Step 1: Creating a dictionnary of emojis from a txt file.

In [148]:

emoji_index = {}
with open('ressources/emoji.txt') as f:
    counter = 0
    for line in f: # for every line
        contents = [x.strip() for x in line.split(',')] # split line into 2
        emoji_index[counter] = contents # contents = [name of emoji, url to emoji photo]
        counter += 1

Step 2: A simple example.

In [149]:

# define what we will not need from Mongo
display_conditions = {"query_criteria": 0, "_id": 0,
                      "geo": 0, "mentions": 0,
                      "hashtags": 0, "favorites": 0,
                      "permalink": 0, "username": 0,
                      "retweets": 0, "id": 0}
tweets = tweet_collection.find({'deepmoji': {'$exists': True}},display_conditions)[43:45]

# for 2 tweets, extract the deepmoji field
for t in tweets:
    emoji_list = [t['deepmoji']['Emoji_1'],
                  t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'],
                  t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']]
    print  'Tweet: ', t['text']
    print 'Emojis:',
    for emoji_number in emoji_list: 
        print emoji.emojize(emoji_index[emoji_number][0], use_aliases=True),
    print '\n'

Tweet:  FYI, this is the 2nd mass shooting against people praying in a church in under 3 years. Spoiler: It doesn't stop gun violence. #SutherlandSpring Do your job @SpeakerRyan @SenateMajLdr @POTUS
Emojis: 🔫 😡 🙏 😠 😈 

Tweet:  This is totally the turning point on #GunControl legislation right? #SutherlandSpring #Texas #OnceAgain #Enough
Emojis: 😡 👍 😠 😳 😬

We can note that DeepMoji makes a pretty accurate characterization of the sentences, not perfect, but accurate enough.

3.3.2. The most frequent emoji in the whole dataset.¶

The first idea for the emoji/sentiment analysis will be to visualize what are the emojis that are the most used in the whole dataset.

Step 1: Count the occurences of every emoji

In [150]:

# get emojis in a list
db_my_tweets = tweet_collection.find({'deepmoji': {'$exists': True}},
                                     display_conditions).sort("date", pymongo.ASCENDING)
mega_list = []
for t in db_my_tweets:
    emoji_list = [t['deepmoji']['Emoji_1'],
                  t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'],
                  t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']]
    mega_list += emoji_list


# get a counter of that list
counter_ = Counter(mega_list)
labels, values = zip(*counter_.items())
indexes = np.arange(len(labels))

Step 2: A histogram of all of the emojis

In [151]:

plt.barh(labels, values, color=['#50514f', '#f25f5c', '#ffe066', '#247ba0'])
plt.yticks(range(len(labels)),
           [emoji_index[i][0][1:-1] for i in range(len(labels))], fontsize=14)
plt.xlabel('Emoji Frequency')
plt.ylabel('Emoji Name')
pylab.rcParams['figure.figsize'] = (20, 15)
plt.title('Most Used Emojis in DataSet')
plt.show()

Step 3: A simpler way to visualize.

In [152]:

# print the top 20 emojis and their frequency
top = 20
top_list = counter_.most_common(top)
print 'The top {} sentiments according to deepmoji:'.format(top)

for i in range(len(top_list)):
    item_emoji = top_list[i][0]
    item_frequency = top_list[i][1]
    print i+1,'.',emoji.emojize(emoji_index[item_emoji][0], use_aliases=True), 'with', item_frequency, 'characterizations.'

The top 20 sentiments according to deepmoji:
1 . ♥ with 17093 characterizations.
2 . 🔫 with 16037 characterizations.
3 . 💔 with 14085 characterizations.
4 . 🙏 with 13855 characterizations.
5 . 👍 with 12353 characterizations.
6 . 💟 with 12049 characterizations.
7 . 😢 with 12015 characterizations.
8 . 😡 with 11016 characterizations.
9 . 😠 with 8095 characterizations.
10 . 😐 with 5371 characterizations.
11 . 😕 with 4528 characterizations.
12 . ✌ with 4367 characterizations.
13 . 😳 with 4258 characterizations.
14 . ❤ with 4069 characterizations.
15 . 😞 with 3591 characterizations.
16 . 😈 with 3469 characterizations.
17 . 😑 with 3354 characterizations.
18 . 💪 with 3152 characterizations.
19 . 👊 with 3115 characterizations.
20 . ✨ with 2952 characterizations.

Some interesting things to note:

The second most used emoji is the gun emoji with a total of 16037 characterization in total which is aligned with the high frequency of the word gun in our wordcloud.
Some emoji's appearance is rather surprising, such as the devil emoji in 16th place or the bored emoji in 17th place.

Note: Some emojis are not well displayed by OS X, in the emoji_index dictionnary you can consult the links for these emoji yourself. (Deepmoji is made wit Twitter Emoji)

3.3.3. The sentiment over time characterized by emoji.¶

The goal of this part of the analyis is to see how the emoji characterization evolves over time. For example, does the characterization of a tweet by 'gun' change over time? If yes, how?

Step 1: Call the tweets that we need.

In [153]:

db_my_tweets = tweet_collection.find({'deepmoji': {'$exists': True}},
                                     display_conditions).sort("date", pymongo.ASCENDING)

The next step is kind of rough, basically we want to store a matrix, called emoji_grid, where as rows the various possible emoji(64) are stored, and in the columns a period of time is stored. Therefore, the element in position [i,j] of the emoji_grid will be equivalent to the normalized number of characterizations on period j by emoji i.

In this case, we will look at the characterizations every 4 hours, since the shooting. And see how the classification evolves.

Step 2: A matrix that stores the emoji freqeuncy per time period

In [154]:

# start the matrix
emoji_grid = np.zeros((len(emoji_index.values()), 1))
column = np.zeros((len(emoji_index.values()), 1))

# define important variables before loop
absolute_hour = 0
tweets_in_period = 0
periods = []
hours_passed = 0
period_length = 4

# for every tweet
for t in db_my_tweets:
    tweets_in_period += 1 # tweet counter for period of time
    tweet_hour = t['date'].hour
    tweet_day = t['date'].day
    
    # exctract the deepmoji classification
    tweet_emoji_list = [t['deepmoji']['Emoji_1'], t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'], t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']]
    for emoji_number in tweet_emoji_list:
        column[emoji_number, 0] += 1
        
    # this counter counts the hours that have passed
    if tweet_hour != absolute_hour:
        hours_passed += 1
        absolute_hour = tweet_hour
    
    # if X number of hours passed, append those depmoji classifications to the master emoji_grid
    if hours_passed == period_length:
        periods.append('{}/11 at {}'.format(tweet_day, tweet_hour))
        emoji_grid = np.hstack((emoji_grid, column / tweets_in_period)) # here we normalize
        column = np.zeros((len(emoji_index.values()), 1))

        hours_passed = 0
        tweets_in_period = 0

emoji_grid = np.delete(emoji_grid, 0, 1) # deletes the redundant first column. 

Step 3: Plotting the emoji grid.

In [155]:

# define important variables. 
plot_top = 5 # only the most frequent emojis are plotted for simplicity
counter = 0
colors = ['#50514f', '#f25f5c', '#ffe066', '#247ba0', '#70c1b3', '#50514f', '#f25f5c', '#ffe066', '#247ba0', '#70c1b3']

# for every emoji (row of the emoji_grid) that is on the top X, plot it over the periods. 
for i in range(emoji_grid.shape[0]):
    if i in np.argsort(np.sum(emoji_grid, axis=1))[::-1][:plot_top]:
        s = plt.plot(range(emoji_grid.shape[1]), emoji_grid[i, :],
                     label=emoji_index[i][0][1:-1], linewidth=2, color=colors[counter])
        counter += 1

# define titles and axis names
plt.title('Deepmoji Characteristation Every {} Hours'.format(period_length), fontsize=20)
plt.xlabel('Time', fontsize=17)
plt.ylabel('Normalized Sentiment Frequency', fontsize=17)

plt.tick_params(axis='both', which='major', labelsize=12)
plt.axhline(y=0, color='black', linewidth=1.3, alpha=.7)



plt.xticks(range(emoji_grid.shape[1]), periods, rotation='vertical')
pylab.rcParams['figure.figsize'] = (20, 10)
plt.legend()

plt.show()

This figure describes the tweets with emojis over time using the DeepMoji model. Severall things are interesting and worth descibing in this graph, let's mention some of them:

New the event, on the 6th of November there is a great rise in tweets characterized with the gun emoji. At this point, the use of broken heart goes down dramatically, which is unexpected.
Also interesting is the fact that over almost all 4-hour periods, the hearts emoji is the most frequent in characterizing tweets.
Around the 13th the characterizations hearts and prayer seem to follow the same behaviour. Moreover, their evolution appears to be highly correlated.