EVENT ANLYZER TOOL (Texas Shooting)

Prologue

This part is showing the data collection and declaring of variables used throughout the analysis. By changing these variables and running all the code, you can run your own complete analysis of a different event.

In [115]:
# these words will be used to look for hashtags
query_hashtags = ['sutherland springs' ,'sutherland spring', 'texas church shooting', 'texas shooting', 'texas church massacre', 'church shooting']

# add the concatonated version of these strings for hashtags
query_hashtags += map(lambda s: s.replace(' ', ''), query_hashtags)

Part 1: Getting Started

1.1 Basic Statistics

We start by importing all of the data stored in our MongoDB Atlas Suite.

In [116]:
from pymongo import MongoClient
import twitter, pickle, sys
import pymongo
import numpy as np
import pandas as pd
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (15, 6)

# intialize mongo client on MongoDB Atlas
client = MongoClient("mongodb://socialgraphs:[email protected]:27017,socialgraphs-shard-00-01-al7cj.mongodb.net:27017,socialgraphs-shard-00-02-al7cj.mongodb.net:27017/test?ssl=true&replicaSet=SocialGraphs-shard-0&authSource=admin")
db = client.texas

# access tweet collection
# TODO: only select unique tweets
tweet_collection = db.tweetHistory
myTweets = tweet_collection.find()
Populating the interactive namespace from numpy and matplotlib
/anaconda/lib/python2.7/site-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['f', 'text']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"

Let's start by figuring out how many tweets we have in total.

In [117]:
print 'We have a total of {} tweets.'.format(myTweets.count())
We have a total of 52903 tweets.

Let us figure out some very basic statistics about our dataset.

In [118]:
# define initial values
userSet = set()
totalNumberOfWords = 0.0
totalRetweets = 0.0
totalFavorites = 0.0
differentHashtags = set()

# loop over data
for tweet in myTweets:
    userSet.add(tweet['username'])
    differentHashtags = differentHashtags.union(set(tweet['hashtags'].split()))
    totalNumberOfWords += len(tweet['text'].split())
    totalRetweets += tweet['retweets']
    totalFavorites += tweet['favorites']

# define means
averageLength = totalNumberOfWords /  myTweets.count()
averageRetweets = totalRetweets /  myTweets.count()
averageFavorites = totalFavorites /  myTweets.count()

# print results
print 'There are {} differrent users.'.format(len(userSet))
print 'The average lenght of a tweet is {} words.'.format(averageLength)
print 'A total of {} differrent hashtags are used.'.format(len(differentHashtags))
print 'The average number of retweets: {}'.format(averageRetweets)
print 'The average number of favorites: {}'.format(averageFavorites)
There are 26289 differrent users.
The average lenght of a tweet is 19.2577169537 words.
A total of 7342 differrent hashtags are used.
The average number of retweets: 3.88087632081
The average number of favorites: 8.00228720488

What are the most favorited tweets, the most retweeted tweets and the respective users and times?

We start by preparing our databases by taking advantage of mongo.

In [119]:
# create databases and define number of desired elements
get_top = 5
display_conditions = {"deepmoji": 0, "permalink":0, "id":0,"date":0, "query_criteria":0, "_id":0, "geo":0, "mentions":0, "hashtags":0}
db_by_retweets = tweet_collection.find({}, display_conditions).sort("retweets",pymongo.DESCENDING)[0:get_top]
db_by_favorites = tweet_collection.find({}, display_conditions).sort("favorites",pymongo.DESCENDING)[0:get_top]

            
# a function that takes a cursor and pretty prints it.  
def print_result(database):
    array = []
    for t in database: 
        array.append(t)
    pandas = pd.DataFrame(array)
    display(pandas)

And we print the results.

In [120]:
print 'The most retweeted:'
print_result(db_by_retweets)
print 'The most favorited:'
print_result(db_by_favorites)
The most retweeted:
favorites retweets text username
0 17054 6641 I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions. shannonrwatts
1 17054 6641 I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions. shannonrwatts
2 20126 5912 The tragedy in Sutherland Springs happened a little over a week ago. Don’t let this fade into the next news cycle. We need gun safety reforms. Now. KamalaHarris
3 13447 5252 It's been only one week since the Texas mass shooting . 42 days since the Vegas mass shooting . 53 days since Puerto Rico lost power and humanitarian crisis began. Time feels off with this much tragedy. sarahkendzior
4 2523 3922 Anyone hear about this from media??? This happened early Saturday morning, before the #TexasChurchShooting Suspected ILLEGAL ALIEN shoots at cars on I-35 with AR style rifle, hits 7 yr old girl in the head, and 4 others. http://www. informationliberation.com/?id=57612 ChristieC733
The most favorited:
favorites retweets text username
0 20126 5912 The tragedy in Sutherland Springs happened a little over a week ago. Don’t let this fade into the next news cycle. We need gun safety reforms. Now. KamalaHarris
1 17054 6641 I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions. shannonrwatts
2 17054 6641 I know it’s too early to talk about the mass shooting of schoolchildren today in Northern California, but what about mass shooting of 600 Americans last month in Las Vegas. And what about the Texas church mass shooting ? Thanks in advance to whomever makes these decisions. shannonrwatts
3 13447 5252 It's been only one week since the Texas mass shooting . 42 days since the Vegas mass shooting . 53 days since Puerto Rico lost power and humanitarian crisis began. Time feels off with this much tragedy. sarahkendzior
4 9842 3784 NRA can confirm Stephen Willeford is a member & has been certified as a NRA firearms instructor. #SutherlandSprings http://www. 4029tv.com/article/man-wh o-shot-texas-church-gunman-shares-his-story/13437943 … DLoesch

1.3 Visualizing over time

It's also very interesting to understand the tweets in our database from a chronological point of view.

In [121]:
# get dates and remove seconds for readability purposes
myTweets = tweet_collection.find()
dates = list(set([tweet['date'] for tweet in myTweets]))
no_seconds = [date.replace( minute=0, second=0, microsecond=0) for date in dates] 

# count occurences
counter = dict(Counter(no_seconds))

# prepare plot
x = []
y = []
for element in counter:
    x.append(element)
    y.append(counter[element])

# plot nicely 
plt.title('Number of tweets per date')
plt.ylabel('Number of tweets')
plt.xlabel('Date')
plt.scatter(x, y, c=y, marker='.', s=y)
plt.xlim([min(x), max(x)])
plt.grid()
plt.show()

2 Generating the networks from tweets

From the tweets we collected we are going to generate a number of different networks that we will be using throughout the rest of the analysis. In either network, the nodes are going to be the users that have been tweeting about the event using one of the predefined hashtags.
For the first network, the edges will be constructed through mentions in these tweets. So, when a tweet mentions another user that is also a node in the network, there will be an edge between these two nodes. We will refer to this network as mention_graph.
For the second network, we define the edges between nodes if they share a common hashtag, not including the query hashtags. For example if two tweets from different nodes use the hashtag #GunSense, we will create an edge between them. We will refer to this network as hashtag_graph.

Below we will start creating the networks.

In [122]:
import networkx as nx
from collections import defaultdict
from itertools import combinations

# start by finding all unique usernames in the tweets that have either mentions or hashtags
where = {
    '$or': [
        {
            'mentions': {
                '$ne': ''
        
            },
        },{
            'hashtags': {
                '$ne': ''
            }
        }
    ]
}
usernames = tweet_collection.distinct('username', where)

# create two separate graphs for mention relations and hashtags, one simple, one multi
mention_graph = nx.Graph()
hashtag_graph = nx.Graph()

# add nodes from users that wrote tweets
mention_graph.add_nodes_from(usernames)
hashtag_graph.add_nodes_from(usernames)



# add edges to mention_graph between mentions in tweets 
# get all tweets with their mentions
tweet_mentions = list(tweet_collection.find({'mentions': {'$ne' : '',}}, {'username': 1, 'mentions': 1}))

# define a default dictionary to store the unique mentions used per user as a set
mentions_dict = defaultdict(set)

# populate dict {username: set(mentions)}
for tweet in tweet_mentions:
    # split mentions from string to list
    mentions = map(lambda mention: mention[1:], tweet['mentions'].split(' '))
    # update dict
    mentions_dict[tweet['username']].update(mentions)

# add edges from mentions_dict
for user, mentions in mentions_dict.iteritems():
    for to_node in mentions:
        if mention_graph.has_node(to_node):
            mention_graph.add_edge(user, to_node)
            
# add edges to the hashtag_graph
# get all tweets with hashtags
tweet_hashtags = tweet_collection.find({'entities.hashtags': {'$ne': ''}}, {'username': 1, 'hashtags': 1})

# intialize a defaultdict to track the unique hashtags 
# and how often users are using them
hashtags_dict = defaultdict(lambda: defaultdict(int))

# populate the dict {hashtags: set(usernames)}
for tweet in tweet_hashtags:
    username = tweet['username']
    # list of hashtags
    hashtags = map(lambda tag: tag.replace('#', '').lower(), tweet['hashtags'].split(' '))
    # remove the query_hashtags
    new_tags = list(set(set(hashtags) - set(query_hashtags)))
    if len(new_tags) > 0:
        for tag in new_tags:
            if tag:
                hashtags_dict[tag][username] += 1
                
# add edges between all users with same hashtag if they have used it more than once
for tag, userdict in hashtags_dict.iteritems():
    # find users who used the tag more than twice
    usernames = [username for username, count in userdict.iteritems() if count > 2]
    # create tuples of possible combinations of nodes
    sets = combinations(usernames, 2)
    # add edges
    for combi in sets:
        hashtag_graph.add_edge(*combi, atrr=tag)

2.1 Basic stats on the networks

In [123]:
print 'Mention Graph:'
print ' - Number of nodes:', len(mention_graph.nodes())
print ' - Number of edges:', len(mention_graph.edges())
print ' - Average degree:', float(sum(nx.degree(mention_graph).values())) / len(mention_graph.nodes())
print 'Hashtag Graph:'
print ' - Number of nodes:', len(hashtag_graph.nodes())
print ' - Number of edges:', len(hashtag_graph.edges())
print ' - Average degree:', float(sum(nx.degree(hashtag_graph).values())) / len(hashtag_graph.nodes())
Mention Graph:
 - Number of nodes: 15992
 - Number of edges: 1700
 - Average degree: 0.212606303152
Hashtag Graph:
 - Number of nodes: 15992
 - Number of edges: 29894
 - Average degree: 3.73861930965

2.2 Degree distribution

In [124]:
plt.style.use('fivethirtyeight')

# get degree distributions
mention_degree = nx.degree(mention_graph)
hashtag_degree = nx.degree(hashtag_graph)

# get minumum and maximum degrees
min_mention_degree, max_mention_degree = min(mention_degree.values()), max(mention_degree.values())
min_hashtag_degree, max_hashtag_degree = min(hashtag_degree.values()), max(hashtag_degree.values())

# plot the degree distributions
plt.figure()
plt.subplot(211)
plt.yscale('log', nonposy='clip')
plt.title('Mention graph degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.hist(sorted(mention_degree.values(),reverse=True), range(min_mention_degree, max_mention_degree + 1)) # degree sequence

plt.subplot(212)
plt.title('Hahstag graph degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.yscale('log', nonposy='clip')
plt.hist(sorted(hashtag_degree.values(),reverse=True), range(min_hashtag_degree, max_hashtag_degree + 1)) # degree sequence
plt.show()

2.3 The GCC

Plotting the size of components

In [125]:
# get all the separate components
components_mention = sorted(nx.connected_component_subgraphs(mention_graph), key=len, reverse=True)
components_hashtag = sorted(nx.connected_component_subgraphs(hashtag_graph), key=len, reverse=True)

print 'The mention graph has {0} disconnected components'.format(len(components_mention))
print 'The hashtag graph has {0} disconnected components'.format(len(components_hashtag))

plt.figure()

plt.subplot(211)
mention_component_lengths = [len(c) for c in components_mention]
plt.yscale('log', nonposy='clip')
plt.title('Mention graph components')
plt.ylabel('Number of components')
plt.xlabel('Number of nodes')
max_mcl = max(mention_component_lengths)
plt.hist(mention_component_lengths, range(max_mcl + 1))

plt.subplot(212)
plt.yscale('log', nonposy='clip')
plt.title('Hashtag graph components')
plt.ylabel('Number of components')
plt.xlabel('Number of nodes')
hashtag_component_lengths = [len(c) for c in components_hashtag]
max_hcl = max(hashtag_component_lengths)
plt.hist(hashtag_component_lengths, range(max_hcl + 1))

plt.tight_layout()
The mention graph has 14501 disconnected components
The hashtag graph has 15245 disconnected components

Examining the GCC

Since the main graph is so disconnecte, we decide to only work with the GCC of both graphs. This allows us to perform more in depth analysis.

In [126]:
# get the giant connected component for both graphs
mention_gcc, hashtag_gcc = components_mention[0], components_hashtag[0]

print 'Mention GCC'
print ' - Number of nodes:', len(mention_gcc.nodes())
print ' - Number of edges:', len(mention_gcc.edges())
print ' - Average degree:', float(sum(nx.degree(mention_gcc).values())) / len(mention_gcc.nodes())
print 'Hashtag GCC:'
print ' - Number of nodes:', len(hashtag_gcc.nodes())
print ' - Number of edges:', len(hashtag_gcc.edges())
print ' - Average degree:', float(sum(nx.degree(hashtag_gcc).values())) / len(hashtag_gcc.nodes())

# draw the graphs
nx.draw_networkx(mention_gcc, pos=nx.spring_layout(mention_gcc), node_size=[v * 100 for v in mention_degree.values()], with_labels=False)
plt.title('Mention GCC')
plt.show()

nx.draw_networkx(hashtag_gcc, pos=nx.spring_layout(hashtag_gcc), node_size=[v * 0.1 for v in hashtag_degree.values()], with_labels=False)
plt.title('Hashtag GCC')
plt.show()
Mention GCC
 - Number of nodes: 1091
 - Number of edges: 1211
 - Average degree: 2.21998166819
Hashtag GCC:
 - Number of nodes: 718
 - Number of edges: 29828
 - Average degree: 83.0863509749

GCC degree distribution

Since we are now only looking at the GCC of both graphs, we run the degree distribution again. This time we have no nodes without edges. The shapes are, however, remarkably similar the the full graphs. The distributions are plotted on a logarithmic scale so we can easily see whether the degrees follow a power law distribution. We can see that the mention_graph still looks logarithmic, so this seems to be an extreme distribution of a couple of highly connected nodes and a lot of poorly connected nodes. The hashtag_graph seems to follow a more linear relation now that one scale is logarithmic.

In [127]:
mention_degree_gcc = nx.degree(mention_gcc)
hashtag_degree_gcc = nx.degree(hashtag_gcc)

# get minumum and maximum degrees
max_mention_gcc_degree = max(mention_degree_gcc.values())
max_hashtag_gcc_degree = max(hashtag_degree_gcc.values())

# plot the degree distributions
plt.figure()
plt.subplot(211)
plt.yscale('log', nonposy='clip')
plt.title('Mention GCC degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.hist(sorted(mention_degree_gcc.values(),reverse=True), range(max_mention_gcc_degree + 1)) # degree sequence

plt.subplot(212)
plt.title('Hahstag GCC degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.yscale('log', nonposy='clip')
plt.hist(sorted(hashtag_degree_gcc.values(),reverse=True), range( max_hashtag_gcc_degree + 1)) # degree sequence

plt.tight_layout()

3.1 Community detection

The next step in our analysis is to define communities in our network and see what these communities revolve around.First, we will look into the sizes of the communities and the biggest accounts in the biggest communities, to get a sense for the kind of accounts we find. Then, we will look into the most common hashtags used in every community in the mention graph, to get a feeling for the topics that live in every community.

We use the Louvain method [1] for community detection with the following implementation in Python.

In [128]:
import community

# use the python Louvain implementation to find communities in the networks
partition_mention = community.best_partition(mention_gcc)
partition_hashtag = community.best_partition(hashtag_gcc)


#drawing
mention_size = float(len(set(partition_mention.values())))
pos = nx.spring_layout(mention_gcc)
count = 0.
for com in set(partition_mention.values()) :
    count = count + 1.
    list_nodes = [nodes for nodes in partition_mention.keys()
                                if partition_mention[nodes] == com]
    nx.draw_networkx_nodes(mention_gcc, pos, list_nodes, node_size = 20,
                                node_color = str(count / mention_size))

print 'For the mention GCC we have found {} communities'.format(int(mention_size))
nx.draw_networkx_edges(mention_gcc,pos, alpha=0.5)
plt.show()

hashtag_size = float(len(set(partition_hashtag.values())))
pos = nx.spring_layout(hashtag_gcc)
count = 0.
for com in set(partition_hashtag.values()) :
    count = count + 1.
    list_nodes = [nodes for nodes in partition_hashtag.keys()
                                if partition_hashtag[nodes] == com]
    nx.draw_networkx_nodes(hashtag_gcc, pos, list_nodes, node_size = 20,
                                node_color = str(count / hashtag_size))

print 'For the hashtag GCC we have found {} communities'.format(int(hashtag_size))
nx.draw_networkx_edges(hashtag_gcc,pos, alpha=0.5)
plt.show()
For the mention GCC we have found 25 communities
For the hashtag GCC we have found 10 communities

3.1 Top Accounts in Communities

We can see that for the mention graph, the communities have for the biggest part centred themselves around major news outlets. We see @FoxNews and @ABCNews, but also local news stations as @dallasnews and their reporters, like @lmcgaughy. For the hashtag graph, seem a bit more random. We do, however recognize the twitter accounts of Sputnik News, a Russion state controlled media outlet, linked to fake news on multiple occasions, and marypatriotnews.com which is a hyper conservative outlet to say the least.

It is interesting to see that the highest degree nodes in the hashtag_graph's partitions are not necessarily accounts with many followers. The links in this network are much more 'democratic', where anyone who uses a lot of prevalent hashtags can becomes a well connected node in the graph. This is different from the mention_graph where a user only gets mentioned a lot if he is well known and thus likely to have many followers.

In [129]:
# look at accounts in each partition with highest degree

# twitter api credentials for lookup
CONSUMER_KEY='29JF8nm5HttFcbwyNXkIq8S5b'
CONSUMER_SECRET='szo1IuEuyHuHCnh93VjLLGb5xg9NcfDVqMsLtOt3DbL5hXxpbt'
OAUTH_TOKEN='575782797-w96NPIzKF07TpC3c78nEadEfACLclYvSusuOPl8z'
OAUTH_TOKEN_SECRET='h0oitwxLkDjOLSejSQl2AWSrcmNeUwBpEvSUWonYzZTNz'

# instantiate API object
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter_api= twitter.Twitter(auth=auth)

# uxilirary function
def inverse_partition(partition):
    components_inv = defaultdict(list)
    for key, value in partition.iteritems():
        components_inv[value].append(key)
    return components_inv

# get top accounts by degree
def partition_top_accounts(partition, degree):
    part_inv = inverse_partition(partition)
    return {part: max(usernames, key=lambda user: degree[user]) for part, usernames in part_inv.iteritems()}

# get data on account
def twitter_account(username):
    return twitter_api.users.lookup(screen_name=username)

    
# display data in dataframe
def pprint_partition_overview(partition, degree, outfile=None):
    data = []
    columns = ['Partition', 'Partition Size', 'Screen Name', 'Name', 'Url', 'Location', 'Followers', 'Degree']
    top_accounts = partition_top_accounts(partition, degree)
    for part_id, account in top_accounts.iteritems():
        user = twitter_account(account)[0]
#         print pprint( user)
        url = ''
        try:
            url = user['entities']['url']['urls'][0]['display_url']
        except:
            pass
        row = {
            'Partition': part_id,
            'Partition Size': len(inverse_partition(partition)[part_id]),
            'Screen Name': account,
            'Name': user['name'],
            'Url': url,
            'Location': user['location'],
            'Followers': user['followers_count'],
            'Degree': degree[account]
        }
        data.append(row)
    data.sort(key=lambda row: row['Partition Size'], reverse=True)
    df = pd.DataFrame(data)
    df = df[columns]
    display(df)
    if outfile:
        serialized = json.dumps(data)
        with open('data/{}'.format(outfile), 'w') as ofile:
            ofile.write(serialized);
    

print 'The mention graph partitons with an overview of the accounts with the highest degrees'
pprint_partition_overview(partition_mention, mention_degree_gcc, 'mention_partition_accounts.json')
pprint_partition_overview(partition_hashtag, hashtag_degree_gcc, 'hashtag_partition_accounts.json')
The mention graph partitons with an overview of the accounts with the highest degrees
Partition Partition Size Screen Name Name Url Location Followers Degree
0 9 270 FoxNews Fox News foxnews.com U.S.A. 16589039 288
1 3 86 JohnCornyn Senator JohnCornyn cornyn.senate.gov Austin, Texas 120492 53
2 6 83 ABC ABC News ABCNews.com New York City / Worldwide 12986622 58
3 5 69 USATODAY USA TODAY usatoday.com USA TODAY HQ, McLean, Va. 3529784 54
4 4 56 lmcgaughy Lauren McGaughy dallasnews.com/author/lauren-… Austin, TX 10613 35
5 10 56 AP The Associated Press apnews.com Global 12058706 53
6 2 46 usairforce U.S. Air Force af.mil Air, Space and Cyberspace 886224 46
7 13 46 DLoesch Dana Loesch amazon.com/Flyover-Nation… Texas, USA 674971 14
8 8 45 chelseahandler Chelsea Handler chelseahandler.com Los Angeles, CA 8343721 23
9 17 45 Everytown Everytown Everytown.org 114979 19
10 15 43 RealAlexJones Alex Jones infowars.com Austin, TX 754784 16
11 12 37 scrowder Steven Crowder louderwithcrowder.com Ghostlike 491141 25
12 1 28 ExpressNews San Antonio E-N ExpressNews.com San Antonio, TX 19660 8
13 7 23 abcnews ABC News abc.net.au/news Australia 1414667 11
14 11 23 RNS Religion News Service religionnews.com DC, NYC, London, Rome 73982 13
15 0 21 KHOU KHOU 11 News Houston khou.com Houston, TX 659018 13
16 19 20 KENS5 KENS 5 kens5.com San Antonio, Texas 130446 12
17 18 19 YahooNews Yahoo News yahoo.com/news/ New York City 1130694 17
18 16 17 israelnash Israel Nash twitter.com/israelnash Dripping Springs, TX 2453 8
19 20 12 foxandfriends FOX & friends foxandfriends.com New York City 1084193 11
20 22 12 NewsHour PBS NewsHour pbs.org/newshour/ Arlington, VA | New York, NY 983682 8
21 24 12 NRO National Review NationalReview.com New York 271442 11
22 21 8 ChrisCuomo Christopher C. Cuomo ChrisCuomo.com In the Arena 1271167 7
23 14 7 InsideEdition Inside Edition insideedition.com New York 72604 4
24 23 7 ABCWorldNews World News Tonight abcnews.com/wnt New York 1277571 6
Partition Partition Size Screen Name Name Url Location Followers Degree
0 0 279 BigGator5 BigGator5 biggator5.net/about/twitter-… Lake County, Florida 5876 343
1 1 192 Adrian_Rafael Adrian R. Morales 1239 268
2 4 84 MacChomhghaill McChomhghaill TrumpUnifies.tk Northern California, USA 4361 362
3 2 79 TrendingNewsTV Trending Newscast tn.dvolatility.com Metro Detroit, MI 205 78
4 3 60 Ms1Scs #DeepStateSwampDrain USA 6760 93
5 9 6 Johnathin79 Lock'm ALL Up! 6851 7
6 5 5 SputnikInt Sputnik sputniknews.com 205091 11
7 7 5 Expose_The_Lies ExposeTheLies facebook.com/ExposeTheLies 97 6
8 8 5 PrgrsvArchitect ProgressiveArchitect Tucson, AZ USA 1015 7
9 6 3 MaryPatriotNews Mary Budesheim marypatriotnews.com Glens Falls, NY 13642 4

Comunnity Sizes

In [130]:
from collections import Counter

# commnity histogram
hashtag_com_count = Counter(partition_hashtag.values())
mention_com_count = Counter(partition_mention.values())

plt.figure()

plt.subplot(211)
plt.title('Sizes of hashtag communities')
plt.xlabel('Community number')
plt.ylabel('Number of nodes')
plt.bar(hashtag_com_count.keys(), hashtag_com_count.values())

plt.subplot(212)
plt.title('Sizes of mention communities')
plt.xlabel('Community number')
plt.ylabel('Number of nodes')
plt.bar(mention_com_count.keys(), mention_com_count.values())

plt.tight_layout()

3.3 Community Hashtags for Mention Graph

To drill down further on what plays in the communities found in either network, we will look at the hashtags are used in these communities and how they relate to each other. A heatmap shows the number of occurrences of the top hashtags in each community. The brighter the color, the more prevalent that hashtag is in the tweets from that community and by extension from the users. This gives us an idea about the ideas or opinions of these users. The data is a little sparse for some communities since they are small and there are not that many hashtags, but we can see some interesting patterns emerge in the ones that do have data.

In [131]:
import matplotlib.style
import matplotlib as mpl
mpl.style.use('classic')



def partition_hashtag_analysis(partition):
    # inverse the partitioning to get a dict with { partitioning_id : [usernames]}
    components_inv = inverse_partition(partition)
    # get all hashtags used by users in combination with our query_hashtags
    components_hashtags = defaultdict(list)
    for part_id, usernames in components_inv.iteritems():
        tweets = tweet_collection.find({
            'username': {
                '$in': usernames
            }, 
            'hashtags': {
                '$ne': '', 
                '$nin': map(lambda s: '#' + s, query_hashtags)
            },
        },
        {
            'hashtags': 1
        })
        # filter query hashtags hashtags 
        for row in tweets:
            tags = [tag for tag in row['hashtags'].lower().replace('#', '').split(' ') if tag not in query_hashtags]
            components_hashtags[part_id] += tags

    part_tag_counts = {}
    for part_id, tags in components_hashtags.iteritems():
        counts = Counter(tags)
        part_tag_counts[part_id] = counts
    return part_tag_counts

mention_com_hashtags = partition_hashtag_analysis(partition_mention)

# Heatmap
# get most common hashtags in general
number_of_tags = 25
hashtags_count = Counter([tags for part_id in mention_com_hashtags.itervalues() for tags in part_id])
most_common_tags = map(lambda tup: str(tup[0]), hashtags_count.most_common(number_of_tags))

# from the 'most_common_hashtags', manually group hashtags together on political orientation
# against gun carry
against = ['gunsense', 'guncontrol', 'guncontrolnow','gunviolence', 'stopgunviolence','backgroundchecks']
# generally in favor of gun carry 
in_favor = ['trump', 'maga', '2a','tcot', 'nra', 'msm']
# neutral
most_common_tags = against + in_favor + ['texas', 'sutherlandspringsshooting', 'lasvegasshooting',  'texasstrong',  'ksatnews', 'usatoday', 'kens5eyewitness', 'church', 'shooting', 'firstbaptistchurch', 'gun', 'airforce']

# create a matrix of the count of use of the most commn hashatgs in the communities
heat_array = np.array([[counts[tag] for tag in most_common_tags] for counts in mention_com_hashtags.values()])

# plot heatmap
fig = plt.figure(figsize=(10, 10))
plt.imshow(heat_array, interpolation='nearest')
plt.xticks(range(number_of_tags), most_common_tags, rotation='vertical')
plt.yticks(range(int(mention_size)), mention_com_hashtags.keys())

rect = fig.patch
rect.set_facecolor('white')
plt.colorbar()
plt.show()

The heatmap above displays how often the most common hashtags appear in every community. The brighter the color, the more prevalent that hashtag is in the tweets from that community and by extension from the users. This gives us an idea about the ideas or opinions of these users. The data is a little sparse for most communities since they are small and there are not that many hashtags, but we can see some interesting patterns emerge in the ones that do have data. We have communities 16 and 17 mentioning #guncontrol #gunsense #guncontrolnow #gunviolence and #stopgunviolence, which are all hashtags related to the camp that wants to limit guns in America. Clusters 3 and 5 and 19 mention in #texas combination with news outlets #kens5eyewitness, #ksatnews and #usatoday, seemingly neutral. There are a bunch of more republican oriented hashtags floating around, #trump, #maga (Make America Great Again), #2a (2nd amendment which protects gun owners), #tcot (top conservatives on twitter), #msm (mainstream media) and #nra (national rifle association - lobby group for gun carry rights), which seem a little more used by the clusters 12, 13, 9, 10, 5 and 6.

Let's combine this data with the sentiment analyis we have derived from the deep learning emoji analysis.

3.4 Hashtag Emoji Analysis for Mention Graph

As we can see above, the use of a hashtag can be fairly ambiguous. People can use a certain hashtag and be either pro or against it, or use the hastag in a sarcastic or ironic way. To add a bit more context we thought it would be interesting to look at what sentiments or thoughts are associated with the hashtags in each community. For this, we used DeepMoji [2] again. These researchers from MIT, among other universities, have constructed a way to train neural networks on text with emojis that let them predict a series of emojis from a sentence. The project is freely available on github including the pretrained models, which can quite articulately describe the sentiment or feeling of that piece of text. We decided to correlate the hashtags used in every community with the emojis returned from DeepMoji to get a more refined image of the opinions that are prevalent in these communities.

Below, the results of this correlation are displayed in a emoji / hashtag matrix for the communities. De rows represent the different communities in the networks and the columns the most common hashtags in these networks. In the cells the most common emojis found through DeepMoji prediction are displayed. The smallest communities and some trivial hashtags have been left out for clarity.

In [132]:
# get emojis per community similarly as we did for the hashtags

def emoji_hastag_analysis(partition, graph_name, hashtags, threshold=25):
    components_inv = inverse_partition(partition)
    # store data as { part_id : { hashtag : Counter({ emoji : count})}}
    components_emoji = defaultdict(lambda: defaultdict(Counter))
    for part_id, usernames in components_inv.iteritems():
        tweets = tweet_collection.find({
            'username': {
                '$in': usernames
            }, 
            'hashtags': {
                '$ne': '', 
                '$nin': map(lambda s: '#' + s, query_hashtags)
            },
            'deepmoji': {
                '$exists': True
            }
        },
        {
            'hashtags': 1,
            'deepmoji': 1
        })
        # filter query hashtags hashtags 
        for row in tweets:
            tags = [tag for tag in row['hashtags'].lower().replace('#', '').split(' ') if tag not in query_hashtags]
            # store emojis associated with hashtags
            emojis = [emo for k, emo in row['deepmoji'].iteritems() if 'Emoji' in k]
            for tag in tags:
                components_emoji[part_id][tag].update(emojis)

    # import emoji converting dictionary
    import emoji
    emoji_index = {}
    with open('ressources/emoji.txt') as f:
        counter = 0
        for line in f: # for every line
            contents = [x.strip() for x in line.split(',')] # split line into 2
            emoji_index[counter] = contents # contents = [name of emoji, url to emoji photo]
            counter += 1

    emoji_matrix = []

    for part in components_emoji: 
        # only consider larger communities
        if sum(globals()['{}_com_count'.format(graph_name)][part]) > threshold:
            tag_emoji = {
                'Partition': part,
                '1 Partition size': mention_com_count[part]
            }
            # only look at politically charges hastags         
            for tag in hashtags:
                emojis = []
                for item in components_emoji[part][tag].most_common(5):
                    emojis.append(emoji.emojize(emoji_index[item[0]][0], use_aliases=True))
                if len(emojis) > 0:
                    tag_emoji[tag] = ''.join(emojis)
            emoji_matrix.append(tag_emoji)

    df = pd.DataFrame(emoji_matrix)
    df.set_index('Partition', inplace=True)
    display(df)

emoji_hastag_analysis(partition_mention, 'mention', in_favor + against)
1 Partition size 2a backgroundchecks guncontrol guncontrolnow gunsense gunviolence maga msm nra tcot trump
Partition
1 28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 46 NaN NaN NaN 👏💪👊♥👍 NaN 😡😣😢😞😓 NaN NaN NaN NaN NaN
3 86 👍🔫😄😡👏 NaN 👍♥😢💔💟 😡👍😢👏💔 👍🔫😎😡👏 NaN 👍😉😄😜🙏 NaN 🔫😄😡😜👍 😄👍😉😜😡 NaN
4 56 🔫♥❤💔💟 NaN 🔫♥👍❤🎶 ❤♥🔫💔💟 ❤♥🔫💔💟 NaN NaN 👍😊👏😳😉 ❤♥🔫💔💟 NaN NaN
5 69 👍😈🔫😉😜 NaN 😐🔫😳😢😕 NaN NaN NaN ♥😢💔✌💟 NaN NaN NaN 😡😠😤🔫😈
6 83 👍👊😉😄💪 NaN 😡😪😤✋😣 💟♥👍❤💙 NaN NaN 👍💟😜♥👊 😜👍💟♥👊 👍😉😜♥💟 ♥😉🎶💔💟 👍😪👏😣😓
8 45 ♥💪👊✌👍 NaN ✨😄💜👍😊 NaN NaN NaN 😡😒😑😤😠 NaN NaN 😡♥👊😠💟 NaN
9 270 👍🔫😡💪👊 NaN 😡👍😠😤😢 😡👍😠😢😤 👍😡😢👏💔 👍😡😢💔💟 😡😢😠💔😷 😡😬👍😜👏 👍😡♥😢🔫 NaN 😡😠👍😈😑
10 56 NaN NaN 😡😠👍🔫💔 NaN 😡🔫👍😑🙏 NaN 👍♥😡🔫🎶 NaN 😡🔫😠💔😑 NaN 👍🎶😎🎧♥
12 37 ♥👍💔✨😡 NaN ♥👍😡💟😠 NaN ♥✨👍💔💟 ♥👍😄✌💟 👏👍👊🔫😎 👍😉♥👏🙏 😡😠😢😔🔫 ♥✨👍💔💟 👏👍👊🔫😎
13 46 😡😠😉👊👍 NaN 😡😠🔫💔😉 👏👍😉😡😠 😡🔫😑💔😠 NaN 🔫🙏♥✌😡 NaN 😂👍🙌👏💟 NaN NaN
15 43 NaN NaN NaN NaN NaN NaN 😡🙌🔫😢🙏 ♥👍🙏💔💟 😐🔫😑😅😓 NaN 👍👏💟♥💪
17 45 🔫♥😡😢💔 😡👍😜😠🎶 😡😠👍😜♥ 😜♥😡👍😠 😢🔫😡👍💔 👍👏💟♥😡 😡👍😠👊👏 NaN ♥😢🔫💔😜 ♥🎶💟🙏😜 😳🔫😜😐😂

From the table above, it is clear that some communities have more unified feelings about certain topics than others. Most homogenic would be community 6, which has either negative or positive emojis for most hashtags. They are big fans of #2a (second amendment) but not so much of #guncontrol. They display rather positive emotions for #maga (make america great again), #msm (mainstram media, but used by far right), #nra (the national rifle association) and #tcot (top conservative on twitter). They seem rather divided on #trump with both clapping and crying emojis. Interesting to see that the most connected users in this community are ABC News and CBS News.

Similar emojis appear for community 9, except that an angry emoji appears alongside all other ones. Maybe the language used in this community has a lot of anger in it. Community 2 seems to condone #gunviolence and in favor of #guncontrolnow.

3.5 Community Hastag Analysis for Hashtag Graph

Second, we will do the same thing for the hashtag graph.

This heatmap is bit sparser, but some interesting things can be found still. Partition 0 has high occurences of the hashtags #guncontrol, but in combination with #trump, #2a and #nra, which are all very much pro gun carrying. This could mean the hashtag is used within a whole different context, where the people in this community talk about gun control in a negative sense. Also, they talk about #mentalhealth, which could be a way to divert to a conversation where guns are not the problem, but mental health. The other communities that display noticable correlation seem to follow similar patterns or have very neutral hashtags.

In [133]:
# get the hashtags associated with the partitions
hashtag_partition_hashtags = partition_hashtag_analysis(partition_hashtag)

# Heatmap
# get most common hashtags in general
number_of_tags = 22
hashtags_count = Counter([tags for part_id in hashtag_partition_hashtags.itervalues() for tags in part_id])
most_common_tags = map(lambda tup: str(tup[0]), hashtags_count.most_common(number_of_tags))

# from the 'most_common_hashtags', manually group hashtags together on political orientation
most_common_tags = [tag for tag in most_common_tags if tag and tag not in ['texas']]

# create a matrix of the count of use of the most commn hashatgs in the communities
heat_array = np.array([[float(counts[tag])  for tag in most_common_tags] for counts in hashtag_partition_hashtags.values()])

# plot heatmap
fig = plt.figure(figsize=(10, 10))
plt.imshow(heat_array, interpolation='nearest')
plt.xticks(range(len(most_common_tags)), most_common_tags, rotation='vertical')
plt.yticks(range(int(hashtag_size)), hashtag_partition_hashtags.keys())

rect = fig.patch
rect.set_facecolor('white')
plt.colorbar()
plt.show()

3.6 Hashtag Emoji Analysis for Hashtag Graph

Unfortunately, the emoji / hashtag matrix for the hashtag graph is very incoherent. All cells with emojis seem to express contradicting feelings. It is included for completeness. The hashtag network's communities are small and there was less data for this event in the data base to find the hashtags from.

In [134]:
emoji_hastag_analysis(partition_mention, 'hashtag', most_common_tags, 10)
1 Partition size 2a devinpatrickkelley domesticviolence fbi guncontrol guns lasvegasmassacre mentalhealth nra sutherlandspringsshooting sutherlandspringstexas texasstrong
Partition
0 21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ❤♥💟💔✨
1 28 NaN NaN NaN NaN NaN NaN NaN NaN NaN 😡👍🔫😖😢 NaN NaN
2 46 NaN 🙈😳😕😬🙊 💔🙏❤💟♥ 😡👀👊😈😠 NaN NaN NaN 😡😣😢😞😓 NaN ❤💔💟♥💙 🎧💪👊🎶🔫 👍👏🙏😎😉
3 86 👍🔫😄😡👏 NaN 😡😢🔫🙏😠 😡😢😑😕😠 👍♥😢💔💟 🔫😡👍👏😖 ♥☺😉💔💟 NaN 🔫😄😡😜👍 😢💔♥👍💟 😄😡👍👏🔫 👍💟🙏♥😢
4 56 🔫♥❤💔💟 NaN NaN NaN 🔫♥👍❤🎶 ♥😢😞🙏💔 NaN NaN ❤♥🔫💔💟 🎧💪🔫🎶😈 NaN NaN
In [ ]:
 

4 Exporting in format for d3.js

In [135]:
# utility to export graphs into JSON format to use in d3js

import json, pprint
from networkx.readwrite import json_graph

def convert_network_json(network, directed, degree, name, partition=defaultdict(int)):
    if not directed:
        # remove double edges
        network = nx.Graph(network);
    print 'Serializing network with {} edges and {} nodes to {}'.format(len(network.edges()), len(network.nodes()), name)
    nodes = [{'id': node, 'degree': degree[node], 'partition': partition[node]} for node in network.nodes()]
    links = [{'source': edge[0], 'target': edge[1]} for edge in network.edges()]
    serialized = {
        'directed': directed,
        'nodes': nodes,
        'links': links,
        'graph': {}
    }
    s = json.dumps(serialized)
    with open('data/{}.json'.format(name), 'w'<