In this project we have taken a systematic approach to analyze the Las Vegas Shooting. This sad event occured on the night of October 1, 2017. A gunman opened fire on a crowd of concertgoers on the Las Vegas Strip in Nevada, leaving 58 people dead and 546 injured. To analyze this event, we have collected publicly available twitter data under the hashtag '#LasVegasShooting'. This hashtag was specifically chosen as it is biggest hashtag on the topic and it is rather neutral(i.e.people from all different views could easily use). Hashtags such as #PrayForVegas were not collected for this exact reason.

The focus here was to analyze “alternative narratives” of crisis events. In this type of big events, alternative media might be used to spread rumors. Some conspiracy theories claiming either the event didn’t happen or that it was perpetrated by someone other than the current suspects are spread using this media. By analyzing the publicly available Twitter data, here we attempted to get some insights about the event and the ways media was used to spread information about it.

Data Collection

This part showw the data collection and its format. Here, the actual data collection code was not included. However given the similar data for other events it is possible to rerun the analysis quite easily. This was why we could easily rerun a similar analysis for another event such as the Sutherland 'Springs Church Shooting, Texas, U.S.A, November 5, 2017' easily.

The official Twitter API sets some limitation in terms of time constraints, for example, it is impossible to get tweets older than a week. Thus, for data collection we have used 'Jefferson-Henrique/GetOldTweets-python' library found in a Github repository. This project was written in Python to get old tweets, it bypassess some limitations of the Offical Twitter API.

With this tool we have collected all the publicly available tweets under the '#LasVegasShooting' hashtag between the dates 30/09/2017 - 30/11/2017. Remember that the event occured at the night of the 1/10/2017. Thus the very first tweet we have is from next day.

In [4]:
# those hashtags will be analyzed
query_hashtags = ['#lasvegasshooting']

Each tweet is stored with some fields of interest. Let's now see the format of our data. First, let's import necessary libraries and reach the database.

In [5]:
from pymongo import MongoClient
import pickle, sys
import pymongo
import numpy as np
import pandas as pd
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (15, 6)

# intialize mongo client on MongoDB Atlas
client = MongoClient("mongodb://socialgraphs:[email protected]:27017,,")
db = client.lasvegas

# access tweet collection
tweet_collection = db.tweetHistory
Populating the interactive namespace from numpy and matplotlib

Below you can see an example tweet as it is stored in our database.

In [6]:
allTweets = list(tweet_collection.find())# A list containing all the tweets
In [7]:
exampleTweet = allTweets[603]#Just an interesting tweet with all fields filled
In [8]:
for field in exampleTweet:    
    print field,':',exampleTweet[field]
username : Luma923
permalink :
query_criteria : #LasVegasShooting
text : @cjgmartell Mmm... 3 yrs ago, 2014 #LasVegasShooting = #falseflag w/2 shooters Las Vegas Shooting Could B False Flag https://www. K85dc&sns=tw …
citations_urls : [u'']
hashtags : [u'#falseflag', u'#lasvegasshooting']
citations_domain_names : [u'']
retweets : 0
favorites : 0
mentions : [u'@cjgmartell']
date : 2017-10-03 02:39:13
deepmoji : {u'Emoji_4': 44, u'Emoji_5': 41, u'Emoji_1': 50, u'Emoji_2': 42, u'Emoji_3': 54, u'Top5%': 0.30533380806446075, u'Pct_5': u'0.0537824', u'Pct_4': u'0.0576868', u'Pct_1': u'0.0686643', u'Pct_3': u'0.062333', u'Pct_2': u'0.0628672'}
_id : 5a25fae5929c3244118029a4
geo : 
id : 914998232505307136

As you can go and see in the permanent link (permalink field), this tweet belongs to the user 'Luma923'. The user has some text in the tweet. Two additional hashtags were used in addition the one we searched for(#falseflag', '#lasvegasshooting'). Also note that the user cited a website in the tweet. In this example it is a video sharing website, however on many occasions, mainstream or alternative media are cited. This tweet instance has no retweet nor favorites. The tweet mentions another user (@cjgmartell)

Moreover, the user tweeted on the '#falseflag' hashtag. When analyzing other tweets from the same user, we see that this user posted multiple times on this hashtag by citing to alternative media websites. This shows that by analyzing the relation between users, and the websites they cite (of course taking into account many other variables), it is possible to analyze media.

Part 1: Getting Started

So, let's get started by learning a bit more about our data.

1.1 Basic Statistics

How many tweets do we have?

In [9]:
print 'We have a total of {} tweets.'.format(len(allTweets))
We have a total of 169913 tweets.

Let us figure out some very basic statistics about our dataset.

In [10]:
# define initial values
allUsersList = [] #All the users
totalNumberOfWords = 0.0
totalRetweets = 0.0
totalFavorites = 0.0
allHashtagList = []
allCitedWebsite = [] # All the websites tweets have ever cited
userCited = dict() # A dictionary which shows which user cited which website

# loop over data
for tweet in allTweets:
    user_name = tweet['username']
    citedByThisUser = userCited.get(user_name,[])#Get the websites cited by this user
    citedByThisUser+= tweet['citations_domain_names']# Add citations of this tweet
    userCited[user_name] = citedByThisUser
    allHashtagList += tweet['hashtags']
    allCitedWebsite += tweet['citations_domain_names']
    totalNumberOfWords += len(tweet['text'].split())#Add number of words used in this tweet
    totalRetweets += tweet['retweets']
    totalFavorites += tweet['favorites']

# Get averages
averageLength = totalNumberOfWords /  len(allTweets)
averageRetweets = totalRetweets /  len(allTweets)
averageFavorites = totalFavorites /  len(allTweets)

# print results
print 'There are {} differrent users.'.format(len(set(allUsersList)))
print 'A total of {} differrent hashtags are used.'.format(len(set(allHashtagList)))
print 'There are {} citations in total'.format(len(allCitedWebsite))
print 'There are {} different websites cited by users'.format(len(set(allCitedWebsite)))
print 'The average lenght of a tweet is {} words.'.format(round(averageLength,2))
print 'The average number of retweets: {}'.format(round(averageRetweets,2))
print 'The average number of favorites: {}'.format(round(averageFavorites,2))
There are 90721 differrent users.
A total of 22141 differrent hashtags are used.
There are 37077 citations in total
There are 4730 different websites cited by users
The average lenght of a tweet is 16.49 words.
The average number of retweets: 4.14
The average number of favorites: 7.71

Let's see who are the top 5 users who tweeted the most using the hashtag '#LasVegasShooting'.

In [11]:
for user, nOfTweets in sorted( Counter(allUsersList).iteritems(), key=lambda (user,n):n, reverse = True)[:5]:
    print user,nOfTweets
nooneishere51 725
reviewjournal 599
nativekittens 593
TrishaDishes 338
ConsciousOptix 275

We are already getting some insights about the event. If you search for the usernames, you can see that reviewjournal is a local newspaper published in Las Vegas. Also 'nooneishere' and 'ConsciousOptix' look to be supporters of Presiden Donald Trump.

Let's see what hashtags were the most common

In [12]:
for hashTag, nOfTweets in sorted( Counter(allHashtagList).iteritems(), key=lambda (hashTag,n):n, reverse = True)[:5]:
    print hashTag,nOfTweets
#lasvegasshooting 169907
#lasvegas 18005
#guncontrolnow 7057
#guncontrol 6871
#mandalaybay 5389

The most frequent hashtags also give some insigths about the event. The event occured at the Mandalay Bay Resort Hotel. After the event, there were severall reactions to guncontrol policies. (Note that the most frequent hashtag is the one we queried for, as expected.)

An interesting thing to look into our database are most popular tweets in terms of retweet and favorites.

In [13]:
# create databases and define number of desired elements
get_top = 10
display_conditions = {"deepmoji":0,"citations_urls":0,"citations_domain_names":0,  "id":0,"date":0, "query_criteria":0, "_id":0, "geo":0, "mentions":0, "hashtags":0}
db_by_retweets = tweet_collection.find({}, display_conditions).sort("retweets",pymongo.DESCENDING)[0:get_top]
db_by_favorites = tweet_collection.find({}, display_conditions).sort("favorites",pymongo.DESCENDING)[0:get_top]

# a function that takes a cursor and pretty prints it.  
def print_result(database):
    array = []
    for t in database: 
    pandas = pd.DataFrame(array)
In [14]:
print 'The most retweeted:'
print 'The most favorited:'
The most retweeted:
favorites permalink retweets text username
0 29423 16048 My relative's friend posted this. Wypipo so desperate to make the #LasVegasShooting about brown people #LasVegas paopao619
1 9195 10862 BREAKING: Kymberley Suchomel, an eyewitness present at the concert who identified multiple shooters has been found dead. #LasVegasShooting MikeTokes
2 14901 6513 Prayer may provide comfort and consolation, but it is POLICY that provides protection and prevention. #LasVegasShooting CharlesMBlow
3 16475 6486 Millionaires typically don't go on mass murder sprees--Paddock had a motive & we need to know what it is #LasVegasShooting DineshDSouza
4 7126 6480 News orgs obtained Trump's talking points on guns following #LasVegasShooting . He lifted them from @NRA , so I debunked them. Pls share igorvolsky
5 18143 6007 #LasVegasShooting victim on meeting @POTUS : I don't care what anybody has to say to me - he cared. FoxNews
6 25434 5147 Our prayers and deepest condolences are with all those affected by the evil perpetrated in #lasvegas #lasvegasshooting DonaldJTrumpJr
7 12163 4572 The city Hall of Tel Aviv displays the American flag tonight, as we stand in solidarity w/ the American ppl & #LasVegasShooting victims Israel
8 19771 3893 Dear Media, Stop glorifying criminals by broadcasting their names and faces. Call the scum what they really are: COWARDS. #LasVegasShooting MatPatGT
9 2081 3834 Would Stronger Gun Control Laws Have Prevented the #LasVegasShooting ? kwilli1046
The most favorited:
favorites permalink retweets text username
0 29423 16048 My relative's friend posted this. Wypipo so desperate to make the #LasVegasShooting about brown people #LasVegas paopao619
1 25434 5147 Our prayers and deepest condolences are with all those affected by the evil perpetrated in #lasvegas #lasvegasshooting DonaldJTrumpJr
2 19771 3893 Dear Media, Stop glorifying criminals by broadcasting their names and faces. Call the scum what they really are: COWARDS. #LasVegasShooting MatPatGT
3 18143 6007 #LasVegasShooting victim on meeting @POTUS : I don't care what anybody has to say to me - he cared. FoxNews
4 16475 6486 Millionaires typically don't go on mass murder sprees--Paddock had a motive & we need to know what it is #LasVegasShooting DineshDSouza
5 14901 6513 Prayer may provide comfort and consolation, but it is POLICY that provides protection and prevention. #LasVegasShooting CharlesMBlow
6 12163 4572 The city Hall of Tel Aviv displays the American flag tonight, as we stand in solidarity w/ the American ppl & #LasVegasShooting victims Israel
7 12114 2959 Pray for the families of those killed and for the 100+ wounded in a shooting rampage last night in #LasVegas . #LasVegasShooting Franklin_Graham
8 9991 2474 “May God give us the grace of healing &...provide the grieving families w/strength to carry on.” - @POTUS Donald J. Trump #LasVegasShooting Franklin_Graham
9 9940 3168 Hundreds of people lined up to donate blood at a #LasVegas blood bank #lasvegasshooting FoxNews

As you see between the users who have most popular tweets there are famous politicians and also political activists such as Donald Trump Jr. Also, some important organizations such as Fox News. Generally top tweets do not tend to cite any news but have some emotional content such as this one.

People are spreading news using references. They comment on an event depending on an article and they reference this article to spread the word. So it is pretty interesting to see the sources of information. Let's check the most common urls:

In [103]:
print 'There are {} differrent websites where people get information.'.format(len(set(allCitedWebsite)))
There are 4730 differrent websites where people get information.
In [16]:
for webSite, nOfCitation in sorted( Counter(allCitedWebsite).iteritems(), key=lambda (w,n):n, reverse = True)[:15]:
    print webSite,nOfCitation 8335 897 860 610 596 535 497 448 441 308 295 255 247 239 225

Here we can see some mainstream media such as New York Times, Fox News, CNN, and the Washingtonpost. However, people also cite some alternative media such as InfoWars and IntelliHub.

1.4 Visualizing over time

It's also very interesting to understand the tweets in our database from a chronological point of view. So let's see chronological distribution.

In [106]:
# get dates and remove seconds for readability purposes
dates = list(set([tweet['date'] for tweet in allTweets]))
no_seconds = [date.replace( minute=0, second=0, microsecond=0) for date in dates] 

# count occurences
counter = dict(Counter(no_seconds))

# prepare plot
x = []
y = []
for element in counter:

# plot nicely 
plt.title('Number of tweets per date')
plt.ylabel('Number of tweets')
plt.scatter(x, y, c=y, marker='.', s=y)
plt.xlim([min(x), max(x)])

An interesting fact we see here is that the closer we are to the event, the more tweets we have (exponential distribution actually). However of course in the first couple of hours there are few people who know the event so we have less tweets. Thus if focus on the first two days we see the next plot.

In [107]:
#Get first 2 days
firstTwoDays = sorted(x)[:48]
nOfTweetsInFirstTwoDays = [counter[hour] for hour in firstTwoDays]
# plot nicely 
plt.title('Number of tweets per date in first Two Days ')
plt.ylabel('Number of tweets')
plt.scatter(firstTwoDays, nOfTweetsInFirstTwoDays, c=nOfTweetsInFirstTwoDays, marker='.', s=nOfTweetsInFirstTwoDays)
plt.xlim([min(firstTwoDays), max(firstTwoDays)])


We have very rich and interesting data to analyze. There are different networks hidden in our data.

From the tweets we collected we are going to generate a number of different networks that we will be used throughout the analysis.

Network 1: For the first network, the nodes are the users who have tweeted under the hashtag '#LasVegasShooting'. The edges will be constructed through mentions in these tweets. So, when a tweet mentions another user that is also a node in the network, there will be an edge between these two. We will refer to this network as mention_graph.

Network 2: For the second network, nodes are still the users. We define the edges between nodes if they share a common hashtag, not including the query hashtags. For example if two tweets from different nodes use the hashtag #GunSense, we will create an edge between them. We will refer to this network as hashtag_graph.

Network 3: Finally, for the third network, nodes are sources of information. Those are the websites, users are referencing to. We define the edges between nodes if same user share an article from both of the websites. For example if the user 'DonaldTrumpJr' shared articles from both 'Fox News' and 'CNN' there will be an edge between these nodes.

Below we will start creating the networks.

2.1 Mention Graph

Let's create the mention graph first.

In [19]:
import networkx as nx
from collections import defaultdict
from itertools import combinations

# start by finding all unique usernames in the tweets 
usernames = list(set(allUsersList))

# create two separate graphs for mention relations 
mention_graph = nx.Graph()

# add nodes from users that wrote tweets

print 'Number of nodes in mention_graph', len(mention_graph.nodes())

# add edges to mention_graph between mentions in tweets 
# get all tweets with their mentions
tweet_mentions = list(tweet_collection.find({'mentions': {'$ne' : [],}}, {'username': 1, 'mentions': 1}))

# define a default dictionary to store the unique mentions used per user as a set
mentions_dict = defaultdict(set)

# populate dict {username: set(mentions)}
for tweet in tweet_mentions:
    # get mentions from without @ (@DonaldTrumpJr -> DonaldTrumpJr)
    mentions = map(lambda mention: mention[1:], tweet['mentions'])
    # update dict

# add edges from mentions_dict
for user, mentions in mentions_dict.iteritems():
    for to_node in mentions:
        if mention_graph.has_node(to_node):
            mention_graph.add_edge(user, to_node)
print 'Number of edges in mention_graph', len(mention_graph.edges())

# get degree distributions
mention_degree = dict(
Number of nodes in mention_graph 90721
Number of edges in mention_graph 6209

As you see in this graph there are much more nodes than edges. Actually this is expected because people in twitter generally tend to mention other people if there is a direct relation to the event with the mentioned user.

To further analyze this graph let's see some basic statistics about the graph.

Basic stats on Mention Graph

Degree distribution

Let's analyze the degree distribution to understand our graph.

In [20]:'fivethirtyeight')
# get minumum and maximum degrees
min_mention_degree, max_mention_degree = min(mention_degree.values()), max(mention_degree.values())

# plot the degree distributions
plt.yscale('log', nonposy='clip')
plt.title('Mention graph degree distribution')
plt.ylabel('Number of nodes')
d = sorted(mention_degree.values(),reverse=True)
r = range(min_mention_degree, max_mention_degree + 1)
_ = plt.hist(d,r) # degree sequence

Note that the histogram above is in logarithmic scale let's see the distribution in loglog scale also.

In [21]:
c = Counter(d)
frequency = c.values()
degrees_values = c.keys()
plt.loglog(degrees_values, frequency, 'ro')
plt.title('LogLog Plot of Degree Distribution for Mention Graph')

As you can see we can safely say that the degree distribution follows power law. Note that there a lot of nodes without any connection. Here it is wiser to look only to GCC.


Plotting the size of components

Let's get all the connected components. Biggest one will be the GCC.

In [22]:
# components_mention = pickle.load(open('componentsMention','rb'))
components_mention = sorted(nx.connected_component_subgraphs(mention_graph), key=len, reverse=True)
print 'The mention graph has {0} disconnected components'.format(len(components_mention))
The mention graph has 85876 disconnected components

A lot of subgraphs! Let's try to understand their sizes.

In [23]:
mention_component_lengths = [len(c) for c in components_mention]
plt.yscale('log', nonposy='clip')
plt.title('Mention graph components')
plt.ylabel('Number of components')
plt.xlabel('Number of nodes')
max_mcl = max(mention_component_lengths)
_ = plt.hist(mention_component_lengths, range(max_mcl + 1))

So apparently most of the components are actually pretty small. Let's see first 5 biggest component.

In [24]:
[3832, 20, 11, 11, 10]

Here we can see that GCC is big enough to give us good insight.

Examining the GCC

Since the main graph is so disconnecte, we decide to only work with the GCC of the graph. This allows us to perform more in depth analysis.

In [25]:
# get the giant connected component for both graphs
mention_gcc = components_mention[0]
mention_degree_gcc = dict(

# number of nodes and edges
print 'The GCC of the mention graph has {nodes} nodes and {edges} edges.'.format(nodes=len(mention_gcc.nodes()), edges=len(mention_gcc.edges()))
print ' - Average degree:', float(sum(mention_degree_gcc.values())) / len(mention_gcc.nodes())

# draw the graphs
nx.draw_networkx(mention_gcc, pos=nx.spring_layout(mention_gcc), node_size=[v * 100 for v in mention_degree.values()], with_labels=False)
plt.title('Mention GCC')
The GCC of the mention graph has 3832 nodes and 5078 edges.
 - Average degree: 2.65970772443

Here the size of the nodes depend on their degrees. Obviously there are some guys who have lots of connections. Let's see who those are.

In [26]:
mention_degree_gcc = dict(
usersWithMostDegree = sorted(mention_degree_gcc.items(), key = lambda x:x[1], reverse = True)[:5]
print usersWithMostDegree
[(u'CNN', 649), (u'FoxNews', 538), (u'LauraLoomer', 516), (u'TuckerCarlson', 195), (u'RealAlexJones', 160)]

Apparently people like to mention about he media in their tweets. Here we can also see that especially right wing polytical commentators (Alex Jones, TuckerCarlson) was mentioned frequently by users. This might be because the event concerns laws about gun restriction in U.S.A.

GCC degree distribution

Since we are now only looking at the GCC, we run the degree distribution again. This time we have no nodes without edges.

In [27]:
# get minumum and maximum degrees
max_mention_gcc_degree = max(mention_degree_gcc.values())

# plot the degree distributions
plt.yscale('log', nonposy='clip')
plt.title('Mention GCC degree distribution')
plt.ylabel('Number of nodes')
_=plt.hist(sorted(mention_degree_gcc.values(),reverse=True), range(max_mention_gcc_degree + 1)) # degree sequence

#So let's also print the node in the gcc with the minimum degree to see if it has a link
print 'The user with the lowest degree:', min(mention_degree_gcc.items(), key= lambda x: x[1])
The user with the lowest degree: (u'SonoranRed', 1)

As you see now in GCC all nodes have at least one degree and the distribution is nicer.

Community detection

To further analyze our graph, let's now see the communities to understand if some users are especially mentioning between eachother.

In [28]:
import community

# use the python Louvain implementation to find communities in the networks
partition_mention = community.best_partition(mention_gcc)

mention_size = float(len(set(partition_mention.values())))
pos = nx.spring_layout(mention_gcc)
count = 0.
for com in set(partition_mention.values()) :
    count = count + 1.
    list_nodes = [nodes for nodes in partition_mention.keys()
                                if partition_mention[nodes] == com]
    nx.draw_networkx_nodes(mention_gcc, pos, list_nodes, node_size = 20,
                                node_color = str(count / mention_size))

print 'For the mention GCC we have found {} communities'.format(int(mention_size))
nx.draw_networkx_edges(mention_gcc,pos, alpha=0.4)
For the mention GCC we have found 42 communities

Let us dive a little deeper in what these communities are.

First, we will look into the sizes of the communities and the biggest accounts in the communities. This is to get a sense for the kind of accounts we find.

Then, we will look into the most common hashtags used in every community in the mention graph. This is to get a feeling for the topics that live in every community.

Community Analysis

Below you can see a table showing each community size, their mostly cited source of information, the biggest account in the community (in terms of degree).

In [29]:
import twitter
# look at accounts in each partition with highest degree

# twitter api credentials for lookup

# instantiate API object
twitter_api= twitter.Twitter(auth=auth)

# uxilirary function
def inverse_partition(partition):
    components_inv = defaultdict(list)
    for key, value in partition.iteritems():
    return components_inv

# get top accounts by degree
def partition_top_accounts(partition, degree):
    part_inv = inverse_partition(partition)
    return {part: max(usernames, key=lambda user: degree[user]) for part, usernames in part_inv.iteritems()}

# get data on account
def twitter_account(username):
    return twitter_api.users.lookup(screen_name=username)

# Get the most commonly cited website
def getMostCommonWebsite(partition, userCited):
    inv = inverse_partition(partition)
    out = dict()
    for com in inv:
        l = []
        for user in inv[com]:
            l+= list(set(userCited[user]))#Add all the citations only ONCE  to a list in this community
        c = Counter(l)
        mostFreqSite = max(set(l), key=l.count)  
        out[com] = (mostFreqSite, round(c[mostFreqSite],2))
    return out
# display data in dataframe
def pprint_partition_overview(partition, degree, userCited):
    data = []
    columns = ['Community No', 'Most Cited Website in Community','Percentage of Users who cited', 'Community Size', 'Screen Name', 'Name', 'Url', 'Location', 'Followers', 'Degree']
    top_accounts = partition_top_accounts(partition, degree) 
    top_websites = getMostCommonWebsite(partition,userCited)
    for part_id, account in top_accounts.iteritems():
        user = twitter_account(account)[0]
#         print pprint( user)
        url = ''
            url = user['entities']['url']['urls'][0]['display_url']
        row = {
            'Community No': part_id,
            'Most Cited Website in Community': top_websites[part_id][0],
            'Percentage of Users who cited': float(top_websites[part_id][1])/len(inverse_partition(partition)[part_id]) * 100,
            'Community Size': len(inverse_partition(partition)[part_id]),
            'Screen Name': account,
            'Name': user['name'],
            'Url': url,
            'Location': user['location'],
            'Followers': user['followers_count'],
            'Degree': degree[account]
    data.sort(key=lambda row: row['Percentage of Users who cited'], reverse=True)
    df = pd.DataFrame(data)
    df = df[columns]
In [30]:
print 'Overview of communities'
pprint_partition_overview(partition_mention, mention_degree_gcc,userCited)
Overview of communities
Community No Most Cited Website in Community Percentage of Users who cited Community Size Screen Name Name Url Location Followers Degree
0 5 83.333333 60 SmythRadio Kerry Smyth Cincinnati, OH 15694 32
1 40 75.000000 8 AntiMedia Anti-Media 45317 5
2 32 50.000000 8 vinarmani Ⓥin Ⓐrmani Las Vegas 10238 4
3 6 40.697674 86 intellihubnews Intellihub 26213 59
4 3 37.228261 368 RealAlexJones Alex Jones Austin, TX 754556 160
5 1 34.024896 482 LauraLoomer Laura Loomer New York, USA 112096 516
6 21 33.333333 6 boston25 Boston 25 News Boston, MA 298877 5
7 27 33.333333 6 TrumpCard555 Deplorable Kathy Pensilvânia, USA 6554 4
8 28 33.333333 6 CharlieDaniels Charlie Daniels Mt. Juliet, TN 597081 4
9 39 33.333333 9 France24_en FRANCE 24 English Paris, France 181614 4
10 38 28.571429 7 ABC7News ABC 7 News - WJLA Washington, DC 142811 6
11 7 24.137931 29 ralphlopez Ralph Lopez 82 20
12 0 24.050633 79 Timothytrippin Timothy Sullivan Central Fifth Dimension 1508 27
13 34 22.222222 9 cspanwj Washington Journal Washington, DC 57340 9
14 15 20.571429 175 reviewjournal Las Vegas RJ Las Vegas, NV 229333 113
15 24 20.370370 54 OANN One America News 155062 19
16 29 20.000000 5 RitaCosby Rita Cosby WABC RADIO 120595 3
17 36 20.000000 10 KSL5TV KSL 5 TV Salt Lake City, Utah 62235 6
18 16 19.672131 61 scrowder Steven Crowder Ghostlike 490897 20
19 2 17.647059 187 TuckerCarlson Tucker Carlson Washington, DC 1356392 195
20 8 17.647059 17 Lrihendry Lori Hendry 208607 6
21 31 16.666667 6 FOX5Atlanta FOX 5 Atlanta Atlanta 557489 5
22 41 16.666667 6 CBNNews CBN News D.C.-Nashville-Jerusalem-VA 68703 5
23 23 16.216216 37 LasVegasSun Las Vegas Sun Las Vegas, NV 224171 32
24 33 15.384615 13 yesgregyes Greg Morelli Chicago, Illinois 1029 12
25 20 15.000000 100 USATODAY USA TODAY USA TODAY HQ, McLean, Va. 3529342 40
26 35 14.285714 7 juliettekayyem Juliette Kayyem United States 72576 6
27 10 13.043478 69 ABC7 ABC7 Eyewitness News 1009448 20
28 26 11.764706 17 NEWS1130 NEWS 1130 Vancouver 211788 8
29 19 10.606061 66 NewsHour PBS NewsHour Arlington, VA | New York, NY 983640 16
30 14 10.344828 29 AliVelshi Ali Velshi New York/Philly/The World 294448 8
31 11 8.505747 435 FoxNews Fox News U.S.A. 16584348 538
32 30 8.333333 24 watchyourReps Watch Your Reps United States 848 8
33 18 7.920792 101 AC360 Anderson Cooper 360° New York, NY 1133762 44
34 22 7.352941 68 ATFHQ ATF HQ Washington, DC 47997 29
35 25 7.142857 14 CharlesMBlow Charles M. Blow… Brooklyn 414445 8
36 37 6.666667 15 GovAbbott Gov. Greg Abbott The Texas Capitol, Austin, TX 151756 9
37 4 6.609808 469 CNN CNN 38682912 649
38 13 5.434783 276 CityOfLasVegas City of Las Vegas Las Vegas, Nevada 223481 116
39 9 5.405405 111 SenJohnMcCain John McCain Phoenix, AZ / Washington, DC 3029886 36
40 17 5.405405 111 SkyNews Sky News London, UK 4423583 44
41 12 5.376344 186 MomsDemand Moms Demand Action USA 88921 87

This table is built in the following way:

  • Each row corresponds to a community found by Louvain algorithm. In you can notice several characteristics such as partition size, the most popular user (in terms of degree), number of followers of this user.

  • You can also see the most cited domain for each of this communities including the percentage of users in the community that cited the domain.

  • Here by most cited website we mean that cited by most of the users. For example in all the citations of the community, if one person cited a site 50 times, we count it as 1.

Example: In the first row, community number 5, 83% of the 60 users cited the domain '' (periscope). Furthermore the most popular user in this community is Kerry Smith from Cincinnati,OH with 15K followers.

Some interesting things are worth noting:

  • Some communities related to a certain source of information(domain) in a very strong way. Notice how on the first two rows, at least 75% of users cited a certain source.

  • When we analyze alternative information sources such as '' and '', we can observe that the citation percentage of their communities quite highly citing these cites (intellihub: 40%, theantimedia: 75%).

  • Also you can see that one of the most common sites is ''. This is expected since people share video news about the event through this site.

  • Another interesting thing we are seeing some of the communities in fact correspond to geographical communites. For example the community 23 is all about the Las Vegas area and its local media (Same situation with Boston community 21).

A problem we see here is that some communities have youtube as most cited website. This does not tell much about that community. Let's see what are first 3 most cited website in each community.

In [31]:
# Get the most commonly cited website
def getMostCommonWebsite3(partition, userCited):
    inv = inverse_partition(partition)
    out = dict()
    for com in inv:
        l = []
        for user in inv[com]:
            l+= list(set(userCited[user]))#Add all the citations only ONCE  to a list in this community
        c = Counter(l)
        mostFreqSites = sorted(set(l), key=l.count,reverse = True)[:3]
        percentage = dict()
        for i in mostFreqSites:
            percentage[i] = round(float(c[i]) / len(inv[com]) * 100)
        while len(mostFreqSites)<3:
            percentage['NaN'] = 0
        out[com] = (mostFreqSites, percentage)
    return out

def showIt(partition):
    columns = ['Community No','Community Size', 'M1', 'P1','M2','P2','M3','P3']
    inv = inverse_partition(partition)
    data = []
    top3_websites = getMostCommonWebsite3(partition,userCited)
    for part_id in inv:
        mostFreqSites = top3_websites[part_id][0]
        percentage = top3_websites[part_id][1]
        row = {
            'Community No':part_id,
            'Community Size': len(inv[part_id]),
            'M1': mostFreqSites[0],
            'P1': percentage[mostFreqSites[0]],
            'M2': mostFreqSites[1],
            'P2': percentage[mostFreqSites[1]],
            'M3': mostFreqSites[2],
            'P3': percentage[mostFreqSites[2]]
    data.sort(key=lambda row: row['P1'], reverse=True)
    df = pd.DataFrame(data)
    df = df[columns]
Community No Community Size M1 P1 M2 P2 M3 P3
0 5 60 83.0 15.0 3.0
1 40 8 75.0 50.0 25.0
2 32 8 50.0 13.0 13.0
3 6 86 41.0 34.0 16.0
4 3 368 37.0 5.0 4.0
5 1 482 34.0 12.0 2.0
6 21 6 33.0 17.0 17.0
7 27 6 33.0 17.0 17.0
8 28 6 33.0 17.0 NaN 0.0
9 39 9 33.0 11.0 11.0
10 38 7 29.0 NaN 0.0 NaN 0.0
11 0 79 24.0 8.0 6.0
12 7 29 24.0 7.0 7.0
13 34 9 22.0 11.0 11.0
14 15 175 21.0 18.0 7.0
15 16 61 20.0 3.0 3.0
16 24 54 20.0 11.0 6.0
17 29 5 20.0 20.0 NaN 0.0
18 36 10 20.0 10.0 10.0
19 2 187 18.0 5.0 3.0
20 8 17 18.0 18.0 12.0
21 31 6 17.0 NaN 0.0 NaN 0.0
22 41 6 17.0 17.0 17.0
23 23 37 16.0 11.0 11.0
24 20 100 15.0 5.0 4.0
25 33 13 15.0 8.0 8.0
26 35 7 14.0 14.0 14.0
27 10 69 13.0 7.0 6.0
28 26 17 12.0 6.0 6.0
29 19 66 11.0 9.0 6.0
30 14 29 10.0 10.0 7.0
31 11 435 9.0 4.0 3.0
32 18 101 8.0 4.0 3.0
33 30 24 8.0 8.0 4.0
34 4 469 7.0 4.0 1.0
35 22 68 7.0 6.0 4.0
36 25 14 7.0 7.0 NaN 0.0
37 37 15 7.0 7.0 7.0
38 9 111 5.0 4.0 4.0
39 12 186 5.0 5.0 4.0
40 13 276 5.0 5.0 3.0
41 17 111 5.0 3.0 3.0

Here the percentages are percentage of the users who cited the website at least ones. So the for the community 40, (second line), 75.0% of the users cited, 50% cited

When we think both table above we see some interesting results:

  • Especially when people are spreading rumors, they tend to share videos. For example, for this event people generally claim that there was multiple shooter and they share videos related to this claim. We can see this in lines 1 and 3 (community no 40 and 6). Most of the users who shared news from alternative media are sharing videos from youtube.

  • There might be a relationship between '', ''. (Actually we saw that both are alternative media)

  • In the line forteen we can see that there is a relation between sites, and This is actually true since both of them are local media of Las Vegas.

An interesting way to analyze communities is to find which hashtags they tend to use. For example if a community tends to use '#guncontrol' hashtag we can infer that there is a relation with this community and the gun laws.

In [32]:
import matplotlib as mpl'classic')

def partition_hashtag_analysis(partition):
    # inverse the partitioning to get a dict with { partitioning_id : [usernames]}
    components_inv = inverse_partition(partition)
    # get all hashtags used by users in combination with our query_hashtags
    components_hashtags = defaultdict(list)
    for part_id, usernames in components_inv.iteritems():
        tweets = tweet_collection.find({
            'username': {
                '$in': usernames
            'hashtags': {
                '$ne': [], 
                '$nin': map(lambda s: '#' + s, query_hashtags)
            'hashtags': 1
        # filter query hashtags hashtags 
        for row in tweets:
            tags = [tag.replace('#', '') for tag in row['hashtags'] if tag not in query_hashtags]
            components_hashtags[part_id] += tags

    part_tag_counts = {}
    for part_id, tags in components_hashtags.iteritems():
        counts = Counter(tags)
        part_tag_counts[part_id] = counts
    return part_tag_counts

mention_com_hashtags = partition_hashtag_analysis(partition_mention)

# Heatmap
# get most common hashtags in general
number_of_tags = 20
hashtags_count = Counter([tags for part_id in mention_com_hashtags.itervalues() for tags in part_id])
most_common_tags = map(lambda tup: str(tup[0]), hashtags_count.most_common(number_of_tags))

# create a matrix of the count of use of the most commn hashatgs in the communities
heat_array = np.array([[counts[tag] for tag in most_common_tags] for counts in mention_com_hashtags.values()])

# plot heatmap
fig = plt.figure(figsize=(10, 10))
plt.imshow(heat_array, interpolation='nearest')
plt.xticks(range(number_of_tags), most_common_tags, rotation='vertical')
plt.yticks(range(heat_array.shape[0]), mention_com_hashtags.keys())
# ax.set_facecolor('white')

The heatmap above displays how often the most common hashtags appear in every community. The brighter the color, the more prevalent that hashtag is in the tweets from that community and by extension from the users. This gives us an idea about the ideas or opinions of these users. The data is a little sparse for most communities since they are small and there are not that many hashtags, but we can see some interesting patterns emerge in the ones that do have data.

  • The most interesting fact we see is that there is a strong correlation betweent the community 1 and the community 3. Actually those communities are the alternative media communities (AntiMedia and IntelliHub communities)

  • Only communities 0,1,3 are using falseflag. This shows that alternative media tend to spread the news that the event was fake.

2.2 Hashtag Graph

Another interesting graph is hashtag graph. For this second network, nodes are still the users. There is an edge between two node if they had ever used same hashtag for more than a limit. Here we have chosen this limit to be 10. Thus if two users have more than 10 tweets under a hashtag there will be an edge between them.

Let's create the graph.

In [33]:
usernames = tweet_collection.distinct('username')
# create two separate graphs for mention relations and hashtags, one simple, one multi
hashtag_graph = nx.MultiGraph()

# add nodes from users that wrote tweets

print 'Number of nodes in hashtag_graph', len(hashtag_graph.nodes())

# add edges to the hashtag_graph
# get all tweets with hashtags
tweet_hashtags = tweet_collection.find({'entities.hashtags': {'$ne': []}}, {'username': 1, 'hashtags': 1})
Number of nodes in hashtag_graph 90721

Let's have collection showing which user used which hashtag (of course we are eliminating the hashtag we queried to get the data)

In [34]:
# intialize a defaultdict to track the unique hashtags 
# and how often users are using them
hashtags_dict = defaultdict(lambda: defaultdict(int))

# populate the dict {hashtags: set(usernames)}
for tweet in tweet_hashtags:
    username = tweet['username']
    # list of hashtags
    hashtags = map(lambda tag: tag.replace('#', '').lower(), tweet['hashtags'])
    # remove the query_hashtags
    new_tags = list(set(set(hashtags) - set(query_hashtags)))
    if len(new_tags) > 0:
        for tag in new_tags:
            if tag:
                hashtags_dict[tag][username] += 1

Get the edges. If two user have more than 10 tweets under same hash than there is a link between them

In [35]:
i = 0            
# add edges between all users with same hashtag if they have used it more than once
for tag, userdict in hashtags_dict.iteritems():
    # find users who used the tag more than once
    usernames = [username for username, count in userdict.iteritems() if count >= 10]
    # create tuples of possible combinations of nodes
    sets = combinations(usernames, 2)
    # add edges

    for combi in sets: 
        hashtag_graph.add_edge(*combi, atrr=tag)
print 'Number of edges in hashtag_graph', len(hashtag_graph.edges())
Number of edges in hashtag_graph 1453345

2.1.1 Basic stats on Hashtag Graph

Degree distribution

Like we did before let's see some statistics about this graph.

In [36]:'fivethirtyeight')

# get degree distributions
hashtag_degree = dict(
# get minumum and maximum degrees
min_hashtag_degree, max_hashtag_degree = min(hashtag_degree.values()), max(hashtag_degree.values())

# plot the degree distributions
plt.title('Hahstag graph degree distribution')
plt.ylabel('Number of nodes')
plt.yscale('log', nonposy='clip')
_= plt.hist(sorted(hashtag_degree.values(),reverse=True), range(min_hashtag_degree, max_hashtag_degree + 1)) # degree sequence

This is rather an interesting distribution. Apparently nodes either do not have any edge or they have a lot. We can thus think that there are some arount 2000 accounts which tweets a lot on similar topics. Let's print nodes which has maximum and minimum degrees

In [37]:
print sorted(hashtag_degree.items(),key = lambda x: x[1], reverse = True)[0:5]
print sorted(hashtag_degree.items(),key = lambda x: x[1], reverse = False)[0:5]
[(u'tiniskwerl', 1967), (u'PatJohnson_9', 1960), (u'PatJohnson_8', 1960), (u'PatJohnson_3', 1960), (u'PatJohnson_2', 1960)]
[(u'mamajazzyy', 0), (u'Fatal_Romantic', 0), (u'bayoucityy', 0), (u'Protest_Works', 0), (u'_Jlach', 0)]

So as seen here, most of the nodes do not have any edges, thus here also we should use GCC.


In [38]:
# get all the separate components
components_hashtag = sorted(nx.connected_component_subgraphs(hashtag_graph), key=len, reverse=True)
# components_hashtag = pickle.load(open('comhtagponenetsHas','rb'))
print 'The hashtag graph has {0} disconnected components'.format(len(components_hashtag))
The hashtag graph has 89025 disconnected components

Let's see the biggest subgraphs.

In [39]:
hashtag_component_lengths = [len(c) for c in components_hashtag]
hashtag_component_lengths = sorted(hashtag_component_lengths, reverse = True)[:5]
print hashtag_component_lengths
[1697, 1, 1, 1, 1]

So apparently there is a big graph which includes 1697 and other nodes are mostly floating. It is best to work on this graph.

In contrast to mention graph, this graph includes too many edges. Thus printing this graph is not very explanatory. So we directly go for community detection. (If we print it you will only see some black fur)

Community detection

In [40]:
# get the giant connected component 
hashtag_gcc = components_hashtag[0]
hashtag_degree_gcc =
# number of nodes and edges
print 'The GCC of the hashtag graph has {nodes} nodes and {edges} edges'.format(nodes=len(hashtag_gcc.nodes()), edges=len(hashtag_gcc.edges()))
The GCC of the hashtag graph has 1697 nodes and 1453345 edges
In [41]:
partition_hashtag = community.best_partition(hashtag_gcc)

So let's make the similar analyze that we did for the mention graph.

In [42]:
print 'The hashtag graph partitons with an overview of the accounts with the highest degrees'
pprint_partition_overview(partition_hashtag, hashtag_degree_gcc,userCited)
The hashtag graph partitons with an overview of the accounts with the highest degrees
Community No Most Cited Website in Community Percentage of Users who cited Community Size Screen Name Name Url Location Followers Degree
0 1 38.461538 221 tiniskwerl tiniskwerl Northern California U.S.A. 1446 1967
1 0 29.146426 1441 GrandeFormaggio GrandeFormaggio 908 1725
2 2 22.857143 35 thomasj17431826 Thomas J 3032 1790

So only 3 community exists in this graph. So to further analyze this graph let's see 3 most cited web pages here

In [43]:
Community No Community Size M1 P1 M2 P2 M3 P3
0 1 221 38.0 8.0 7.0
1 0 1441 29.0 6.0 4.0
2 2 35 23.0 11.0 11.0

Apparently this graph gives less insight to analyze our data.

Source of Information Graph

The last graph that we think is interesting, shows the relation between sources of information. Here, the nodes are websites. There will be an edge between two nodes if the same user cited from both sites. Previously we had already collected data for this part. So let's remember some numbers and built the graph.

Let's create the graph

In [45]:
print 'There are {} citation in the total {} tweet'.format(len(allCitedWebsite), len(allTweets))
print 'There are {} different source of information that users cite'.format(len(set(allCitedWebsite)))
There are 37077 citation in the total 169913 tweet
There are 4730 different source of information that users cite
In [47]:
infoSource_graph = nx.Graph()

# add nodes

for user in userCited:
    citedByUser = list(set(userCited[user]))
    sets = combinations(citedByUser, 2)
    # add edges
    for combi in sets: 
print 'The information sources graph has {} nodes'.format(len(infoSource_graph.nodes()))
print 'The information sources graph has {} edges'.format(len(infoSource_graph.edges()))
The information sources graph has 4730 nodes
The information sources graph has 15234 edges

Let's draw the graph

In [51]:
info_degree = dict(
# draw the graphs
nx.draw_networkx(infoSource_graph, pos=nx.spring_layout(infoSource_graph), node_size=[v * 5 for v in info_degree.values()], with_labels=False)
plt.title('Source of Information Graph')