In this project we have taken a systematic approach to analyze the Las Vegas Shooting. This sad event occured on the night of October 1, 2017. A gunman opened fire on a crowd of concertgoers on the Las Vegas Strip in Nevada, leaving 58 people dead and 546 injured. To analyze this event, we have collected publicly available twitter data under the hashtag '#LasVegasShooting'. This hashtag was specifically chosen as it is biggest hashtag on the topic and it is rather neutral(i.e.people from all different views could easily use). Hashtags such as #PrayForVegas were not collected for this exact reason.
The focus here was to analyze “alternative narratives” of crisis events. In this type of big events, alternative media might be used to spread rumors. Some conspiracy theories claiming either the event didn’t happen or that it was perpetrated by someone other than the current suspects are spread using this media. By analyzing the publicly available Twitter data, here we attempted to get some insights about the event and the ways media was used to spread information about it.
This part showw the data collection and its format. Here, the actual data collection code was not included. However given the similar data for other events it is possible to rerun the analysis quite easily. This was why we could easily rerun a similar analysis for another event such as the Sutherland 'Springs Church Shooting, Texas, U.S.A, November 5, 2017' easily.
The official Twitter API sets some limitation in terms of time constraints, for example, it is impossible to get tweets older than a week. Thus, for data collection we have used 'Jefferson-Henrique/GetOldTweets-python' library found in a Github repository. This project was written in Python to get old tweets, it bypassess some limitations of the Offical Twitter API.
With this tool we have collected all the publicly available tweets under the '#LasVegasShooting' hashtag between the dates 30/09/2017 - 30/11/2017. Remember that the event occured at the night of the 1/10/2017. Thus the very first tweet we have is from next day.
# those hashtags will be analyzed
query_hashtags = ['#lasvegasshooting']
Each tweet is stored with some fields of interest. Let's now see the format of our data. First, let's import necessary libraries and reach the database.
#IMPORT SOME LIBRARIES NECESSARY FOR THE REST OF THE PROJECT
from pymongo import MongoClient
import pickle, sys
import pymongo
import numpy as np
import pandas as pd
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (15, 6)
# intialize mongo client on MongoDB Atlas
client = MongoClient("mongodb://socialgraphs:interactions@socialgraphs-shard-00-00-al7cj.mongodb.net:27017,socialgraphs-shard-00-01-al7cj.mongodb.net:27017,socialgraphs-shard-00-02-al7cj.mongodb.net:27017/test?ssl=true&replicaSet=SocialGraphs-shard-0&authSource=admin")
db = client.lasvegas
# access tweet collection
tweet_collection = db.tweetHistory
Populating the interactive namespace from numpy and matplotlib
Below you can see an example tweet as it is stored in our database.
allTweets = list(tweet_collection.find())# A list containing all the tweets
exampleTweet = allTweets[603]#Just an interesting tweet with all fields filled
for field in exampleTweet:
print field,':',exampleTweet[field]
username : Luma923 permalink : https://twitter.com/Luma923/status/914998232505307136 query_criteria : #LasVegasShooting text : @cjgmartell Mmm... 3 yrs ago, 2014 #LasVegasShooting = #falseflag w/2 shooters Las Vegas Shooting Could B False Flag https://www. youtube.com/watch?v=VG1TNd K85dc&sns=tw … citations_urls : [u'https://www.youtube.com/watch?v=VG1TNdK85dc&sns=tw'] hashtags : [u'#falseflag', u'#lasvegasshooting'] citations_domain_names : [u'www.youtube.com'] retweets : 0 favorites : 0 mentions : [u'@cjgmartell'] date : 2017-10-03 02:39:13 deepmoji : {u'Emoji_4': 44, u'Emoji_5': 41, u'Emoji_1': 50, u'Emoji_2': 42, u'Emoji_3': 54, u'Top5%': 0.30533380806446075, u'Pct_5': u'0.0537824', u'Pct_4': u'0.0576868', u'Pct_1': u'0.0686643', u'Pct_3': u'0.062333', u'Pct_2': u'0.0628672'} _id : 5a25fae5929c3244118029a4 geo : id : 914998232505307136
As you can go and see in the permanent link (permalink field), this tweet belongs to the user 'Luma923'. The user has some text in the tweet. Two additional hashtags were used in addition the one we searched for(#falseflag', '#lasvegasshooting'). Also note that the user cited a website in the tweet. In this example it is a video sharing website, however on many occasions, mainstream or alternative media are cited. This tweet instance has no retweet nor favorites. The tweet mentions another user (@cjgmartell)
Moreover, the user tweeted on the '#falseflag' hashtag. When analyzing other tweets from the same user, we see that this user posted multiple times on this hashtag by citing to alternative media websites. This shows that by analyzing the relation between users, and the websites they cite (of course taking into account many other variables), it is possible to analyze media.
So, let's get started by learning a bit more about our data.
How many tweets do we have?
print 'We have a total of {} tweets.'.format(len(allTweets))
We have a total of 169913 tweets.
Let us figure out some very basic statistics about our dataset.
# define initial values
allUsersList = [] #All the users
totalNumberOfWords = 0.0
totalRetweets = 0.0
totalFavorites = 0.0
allHashtagList = []
allCitedWebsite = [] # All the websites tweets have ever cited
userCited = dict() # A dictionary which shows which user cited which website
# loop over data
for tweet in allTweets:
user_name = tweet['username']
allUsersList.append(user_name)
citedByThisUser = userCited.get(user_name,[])#Get the websites cited by this user
citedByThisUser+= tweet['citations_domain_names']# Add citations of this tweet
userCited[user_name] = citedByThisUser
allHashtagList += tweet['hashtags']
allCitedWebsite += tweet['citations_domain_names']
totalNumberOfWords += len(tweet['text'].split())#Add number of words used in this tweet
totalRetweets += tweet['retweets']
totalFavorites += tweet['favorites']
# Get averages
averageLength = totalNumberOfWords / len(allTweets)
averageRetweets = totalRetweets / len(allTweets)
averageFavorites = totalFavorites / len(allTweets)
# print results
print 'There are {} differrent users.'.format(len(set(allUsersList)))
print 'A total of {} differrent hashtags are used.'.format(len(set(allHashtagList)))
print 'There are {} citations in total'.format(len(allCitedWebsite))
print 'There are {} different websites cited by users'.format(len(set(allCitedWebsite)))
print 'The average lenght of a tweet is {} words.'.format(round(averageLength,2))
print 'The average number of retweets: {}'.format(round(averageRetweets,2))
print 'The average number of favorites: {}'.format(round(averageFavorites,2))
There are 90721 differrent users. A total of 22141 differrent hashtags are used. There are 37077 citations in total There are 4730 different websites cited by users The average lenght of a tweet is 16.49 words. The average number of retweets: 4.14 The average number of favorites: 7.71
Let's see who are the top 5 users who tweeted the most using the hashtag '#LasVegasShooting'.
for user, nOfTweets in sorted( Counter(allUsersList).iteritems(), key=lambda (user,n):n, reverse = True)[:5]:
print user,nOfTweets
nooneishere51 725 reviewjournal 599 nativekittens 593 TrishaDishes 338 ConsciousOptix 275
We are already getting some insights about the event. If you search for the usernames, you can see that reviewjournal is a local newspaper published in Las Vegas. Also 'nooneishere' and 'ConsciousOptix' look to be supporters of Presiden Donald Trump.
Let's see what hashtags were the most common
for hashTag, nOfTweets in sorted( Counter(allHashtagList).iteritems(), key=lambda (hashTag,n):n, reverse = True)[:5]:
print hashTag,nOfTweets
#lasvegasshooting 169907 #lasvegas 18005 #guncontrolnow 7057 #guncontrol 6871 #mandalaybay 5389
The most frequent hashtags also give some insigths about the event. The event occured at the Mandalay Bay Resort Hotel. After the event, there were severall reactions to guncontrol policies. (Note that the most frequent hashtag is the one we queried for, as expected.)
An interesting thing to look into our database are most popular tweets in terms of retweet and favorites.
# create databases and define number of desired elements
get_top = 10
display_conditions = {"deepmoji":0,"citations_urls":0,"citations_domain_names":0, "id":0,"date":0, "query_criteria":0, "_id":0, "geo":0, "mentions":0, "hashtags":0}
db_by_retweets = tweet_collection.find({}, display_conditions).sort("retweets",pymongo.DESCENDING)[0:get_top]
db_by_favorites = tweet_collection.find({}, display_conditions).sort("favorites",pymongo.DESCENDING)[0:get_top]
# a function that takes a cursor and pretty prints it.
def print_result(database):
array = []
for t in database:
array.append(t)
pandas = pd.DataFrame(array)
display(pandas)
print 'The most retweeted:'
print_result(db_by_retweets)
print 'The most favorited:'
print_result(db_by_favorites)
The most retweeted:
favorites | permalink | retweets | text | username | |
---|---|---|---|---|---|
0 | 29423 | https://twitter.com/paopao619/status/916809797923569665 | 16048 | My relative's friend posted this. Wypipo so desperate to make the #LasVegasShooting about brown people #LasVegas pic.twitter.com/InEMbcMxEa | paopao619 |
1 | 9195 | https://twitter.com/MikeTokes/status/919325090277244929 | 10862 | BREAKING: Kymberley Suchomel, an eyewitness present at the concert who identified multiple shooters has been found dead. #LasVegasShooting pic.twitter.com/N7IKVBkFkv | MikeTokes |
2 | 14901 | https://twitter.com/CharlesMBlow/status/914807607554183168 | 6513 | Prayer may provide comfort and consolation, but it is POLICY that provides protection and prevention. #LasVegasShooting | CharlesMBlow |
3 | 16475 | https://twitter.com/DineshDSouza/status/915412752910491649 | 6486 | Millionaires typically don't go on mass murder sprees--Paddock had a motive & we need to know what it is #LasVegasShooting | DineshDSouza |
4 | 7126 | https://twitter.com/igorvolsky/status/915361057073451009 | 6480 | News orgs obtained Trump's talking points on guns following #LasVegasShooting . He lifted them from @NRA , so I debunked them. Pls share pic.twitter.com/Gq7JXEMlTV | igorvolsky |
5 | 18143 | https://twitter.com/FoxNews/status/916119532778921984 | 6007 | #LasVegasShooting victim on meeting @POTUS : I don't care what anybody has to say to me - he cared. pic.twitter.com/YCaHhPHGdk | FoxNews |
6 | 25434 | https://twitter.com/DonaldJTrumpJr/status/914819406445862912 | 5147 | Our prayers and deepest condolences are with all those affected by the evil perpetrated in #lasvegas #lasvegasshooting | DonaldJTrumpJr |
7 | 12163 | https://twitter.com/Israel/status/914946705933705216 | 4572 | The city Hall of Tel Aviv displays the American flag tonight, as we stand in solidarity w/ the American ppl & #LasVegasShooting victims pic.twitter.com/uoFEVw1Ngo | Israel |
8 | 19771 | https://twitter.com/MatPatGT/status/914980392905478144 | 3893 | Dear Media, Stop glorifying criminals by broadcasting their names and faces. Call the scum what they really are: COWARDS. #LasVegasShooting | MatPatGT |
9 | 2081 | https://twitter.com/kwilli1046/status/914923080195166208 | 3834 | Would Stronger Gun Control Laws Have Prevented the #LasVegasShooting ? | kwilli1046 |
The most favorited:
favorites | permalink | retweets | text | username | |
---|---|---|---|---|---|
0 | 29423 | https://twitter.com/paopao619/status/916809797923569665 | 16048 | My relative's friend posted this. Wypipo so desperate to make the #LasVegasShooting about brown people #LasVegas pic.twitter.com/InEMbcMxEa | paopao619 |
1 | 25434 | https://twitter.com/DonaldJTrumpJr/status/914819406445862912 | 5147 | Our prayers and deepest condolences are with all those affected by the evil perpetrated in #lasvegas #lasvegasshooting | DonaldJTrumpJr |
2 | 19771 | https://twitter.com/MatPatGT/status/914980392905478144 | 3893 | Dear Media, Stop glorifying criminals by broadcasting their names and faces. Call the scum what they really are: COWARDS. #LasVegasShooting | MatPatGT |
3 | 18143 | https://twitter.com/FoxNews/status/916119532778921984 | 6007 | #LasVegasShooting victim on meeting @POTUS : I don't care what anybody has to say to me - he cared. pic.twitter.com/YCaHhPHGdk | FoxNews |
4 | 16475 | https://twitter.com/DineshDSouza/status/915412752910491649 | 6486 | Millionaires typically don't go on mass murder sprees--Paddock had a motive & we need to know what it is #LasVegasShooting | DineshDSouza |
5 | 14901 | https://twitter.com/CharlesMBlow/status/914807607554183168 | 6513 | Prayer may provide comfort and consolation, but it is POLICY that provides protection and prevention. #LasVegasShooting | CharlesMBlow |
6 | 12163 | https://twitter.com/Israel/status/914946705933705216 | 4572 | The city Hall of Tel Aviv displays the American flag tonight, as we stand in solidarity w/ the American ppl & #LasVegasShooting victims pic.twitter.com/uoFEVw1Ngo | Israel |
7 | 12114 | https://twitter.com/Franklin_Graham/status/914783253722263552 | 2959 | Pray for the families of those killed and for the 100+ wounded in a shooting rampage last night in #LasVegas . #LasVegasShooting | Franklin_Graham |
8 | 9991 | https://twitter.com/Franklin_Graham/status/914952619969454080 | 2474 | “May God give us the grace of healing &...provide the grieving families w/strength to carry on.” - @POTUS Donald J. Trump #LasVegasShooting | Franklin_Graham |
9 | 9940 | https://twitter.com/FoxNews/status/914928212974510080 | 3168 | Hundreds of people lined up to donate blood at a #LasVegas blood bank #lasvegasshooting pic.twitter.com/QNW6LW94Hk | FoxNews |
As you see between the users who have most popular tweets there are famous politicians and also political activists such as Donald Trump Jr. Also, some important organizations such as Fox News. Generally top tweets do not tend to cite any news but have some emotional content such as this one.
People are spreading news using references. They comment on an event depending on an article and they reference this article to spread the word. So it is pretty interesting to see the sources of information. Let's check the most common urls:
print 'There are {} differrent websites where people get information.'.format(len(set(allCitedWebsite)))
There are 4730 differrent websites where people get information.
for webSite, nOfCitation in sorted( Counter(allCitedWebsite).iteritems(), key=lambda (w,n):n, reverse = True)[:15]:
print webSite,nOfCitation
www.youtube.com 8335 www.reviewjournal.com 897 www.facebook.com 860 www.nytimes.com 610 www.pscp.tv 596 www.instagram.com 535 www.gofundme.com 497 www.foxnews.com 448 www.washingtonpost.com 441 www.theguardian.com 308 www.intellihub.com 295 www.dailymail.co.uk 255 abcnews.go.com 247 www.cnn.com 239 www.infowars.com 225
Here we can see some mainstream media such as New York Times, Fox News, CNN, and the Washingtonpost. However, people also cite some alternative media such as InfoWars and IntelliHub.
It's also very interesting to understand the tweets in our database from a chronological point of view. So let's see chronological distribution.
# get dates and remove seconds for readability purposes
dates = list(set([tweet['date'] for tweet in allTweets]))
no_seconds = [date.replace( minute=0, second=0, microsecond=0) for date in dates]
# count occurences
counter = dict(Counter(no_seconds))
# prepare plot
x = []
y = []
for element in counter:
x.append(element)
y.append(counter[element])
# plot nicely
plt.title('Number of tweets per date')
plt.ylabel('Number of tweets')
plt.xlabel('Date')
plt.scatter(x, y, c=y, marker='.', s=y)
plt.xlim([min(x), max(x)])
plt.ylim([min(y),max(y)])
plt.grid()
plt.show()
An interesting fact we see here is that the closer we are to the event, the more tweets we have (exponential distribution actually). However of course in the first couple of hours there are few people who know the event so we have less tweets. Thus if focus on the first two days we see the next plot.
#Get first 2 days
firstTwoDays = sorted(x)[:48]
nOfTweetsInFirstTwoDays = [counter[hour] for hour in firstTwoDays]
# plot nicely
plt.title('Number of tweets per date in first Two Days ')
plt.ylabel('Number of tweets')
plt.xlabel('Date')
plt.scatter(firstTwoDays, nOfTweetsInFirstTwoDays, c=nOfTweetsInFirstTwoDays, marker='.', s=nOfTweetsInFirstTwoDays)
plt.xlim([min(firstTwoDays), max(firstTwoDays)])
plt.grid()
plt.show()
We have very rich and interesting data to analyze. There are different networks hidden in our data.
From the tweets we collected we are going to generate a number of different networks that we will be used throughout the analysis.
Network 1:
For the first network, the nodes are the users who have tweeted under the hashtag '#LasVegasShooting'. The edges will be constructed through mentions in these tweets. So, when a tweet mentions another user that is also a node in the network, there will be an edge between these two. We will refer to this network as mention_graph
.
Network 2:
For the second network, nodes are still the users. We define the edges between nodes if they share a common hashtag, not including the query hashtags. For example if two tweets from different nodes use the hashtag #GunSense, we will create an edge between them. We will refer to this network as hashtag_graph
.
Network 3: Finally, for the third network, nodes are sources of information. Those are the websites, users are referencing to. We define the edges between nodes if same user share an article from both of the websites. For example if the user 'DonaldTrumpJr' shared articles from both 'Fox News' and 'CNN' there will be an edge between these nodes.
Below we will start creating the networks.
Let's create the mention graph first.
import networkx as nx
from collections import defaultdict
from itertools import combinations
# start by finding all unique usernames in the tweets
usernames = list(set(allUsersList))
# create two separate graphs for mention relations
mention_graph = nx.Graph()
# add nodes from users that wrote tweets
mention_graph.add_nodes_from(usernames)
print 'Number of nodes in mention_graph', len(mention_graph.nodes())
# add edges to mention_graph between mentions in tweets
# get all tweets with their mentions
tweet_mentions = list(tweet_collection.find({'mentions': {'$ne' : [],}}, {'username': 1, 'mentions': 1}))
# define a default dictionary to store the unique mentions used per user as a set
mentions_dict = defaultdict(set)
# populate dict {username: set(mentions)}
for tweet in tweet_mentions:
# get mentions from without @ (@DonaldTrumpJr -> DonaldTrumpJr)
mentions = map(lambda mention: mention[1:], tweet['mentions'])
# update dict
mentions_dict[tweet['username']].update(mentions)
# add edges from mentions_dict
for user, mentions in mentions_dict.iteritems():
for to_node in mentions:
if mention_graph.has_node(to_node):
mention_graph.add_edge(user, to_node)
print 'Number of edges in mention_graph', len(mention_graph.edges())
# get degree distributions
mention_degree = dict(mention_graph.degree())
Number of nodes in mention_graph 90721 Number of edges in mention_graph 6209
As you see in this graph there are much more nodes than edges. Actually this is expected because people in twitter generally tend to mention other people if there is a direct relation to the event with the mentioned user.
To further analyze this graph let's see some basic statistics about the graph.
Let's analyze the degree distribution to understand our graph.
plt.style.use('fivethirtyeight')
# get minumum and maximum degrees
min_mention_degree, max_mention_degree = min(mention_degree.values()), max(mention_degree.values())
# plot the degree distributions
plt.figure()
plt.subplot(211)
plt.yscale('log', nonposy='clip')
plt.title('Mention graph degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
d = sorted(mention_degree.values(),reverse=True)
r = range(min_mention_degree, max_mention_degree + 1)
_ = plt.hist(d,r) # degree sequence
Note that the histogram above is in logarithmic scale let's see the distribution in loglog scale also.
c = Counter(d)
frequency = c.values()
degrees_values = c.keys()
plt.loglog(degrees_values, frequency, 'ro')
plt.xlabel('k')
plt.ylabel('count')
plt.title('LogLog Plot of Degree Distribution for Mention Graph')
plt.show()
As you can see we can safely say that the degree distribution follows power law. Note that there a lot of nodes without any connection. Here it is wiser to look only to GCC.
Let's get all the connected components. Biggest one will be the GCC.
# components_mention = pickle.load(open('componentsMention','rb'))
components_mention = sorted(nx.connected_component_subgraphs(mention_graph), key=len, reverse=True)
print 'The mention graph has {0} disconnected components'.format(len(components_mention))
The mention graph has 85876 disconnected components
A lot of subgraphs! Let's try to understand their sizes.
plt.figure()
plt.subplot(211)
mention_component_lengths = [len(c) for c in components_mention]
plt.yscale('log', nonposy='clip')
plt.title('Mention graph components')
plt.ylabel('Number of components')
plt.xlabel('Number of nodes')
max_mcl = max(mention_component_lengths)
_ = plt.hist(mention_component_lengths, range(max_mcl + 1))
So apparently most of the components are actually pretty small. Let's see first 5 biggest component.
mention_component_lengths[:5]
[3832, 20, 11, 11, 10]
Here we can see that GCC is big enough to give us good insight.
Since the main graph is so disconnecte, we decide to only work with the GCC of the graph. This allows us to perform more in depth analysis.
# get the giant connected component for both graphs
mention_gcc = components_mention[0]
mention_degree_gcc = dict(nx.degree(mention_gcc))
# number of nodes and edges
print 'The GCC of the mention graph has {nodes} nodes and {edges} edges.'.format(nodes=len(mention_gcc.nodes()), edges=len(mention_gcc.edges()))
print ' - Average degree:', float(sum(mention_degree_gcc.values())) / len(mention_gcc.nodes())
# draw the graphs
nx.draw_networkx(mention_gcc, pos=nx.spring_layout(mention_gcc), node_size=[v * 100 for v in mention_degree.values()], with_labels=False)
plt.title('Mention GCC')
plt.show()
The GCC of the mention graph has 3832 nodes and 5078 edges. - Average degree: 2.65970772443
Here the size of the nodes depend on their degrees. Obviously there are some guys who have lots of connections. Let's see who those are.
mention_degree_gcc = dict(nx.degree(mention_gcc))
usersWithMostDegree = sorted(mention_degree_gcc.items(), key = lambda x:x[1], reverse = True)[:5]
print usersWithMostDegree
[(u'CNN', 649), (u'FoxNews', 538), (u'LauraLoomer', 516), (u'TuckerCarlson', 195), (u'RealAlexJones', 160)]
Apparently people like to mention about he media in their tweets. Here we can also see that especially right wing polytical commentators (Alex Jones, TuckerCarlson) was mentioned frequently by users. This might be because the event concerns laws about gun restriction in U.S.A.
Since we are now only looking at the GCC, we run the degree distribution again. This time we have no nodes without edges.
# get minumum and maximum degrees
max_mention_gcc_degree = max(mention_degree_gcc.values())
# plot the degree distributions
plt.figure()
plt.subplot(211)
plt.yscale('log', nonposy='clip')
plt.title('Mention GCC degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
_=plt.hist(sorted(mention_degree_gcc.values(),reverse=True), range(max_mention_gcc_degree + 1)) # degree sequence
#So let's also print the node in the gcc with the minimum degree to see if it has a link
print 'The user with the lowest degree:', min(mention_degree_gcc.items(), key= lambda x: x[1])
The user with the lowest degree: (u'SonoranRed', 1)
As you see now in GCC all nodes have at least one degree and the distribution is nicer.
To further analyze our graph, let's now see the communities to understand if some users are especially mentioning between eachother.
import community
# use the python Louvain implementation to find communities in the networks
partition_mention = community.best_partition(mention_gcc)
#drawing
mention_size = float(len(set(partition_mention.values())))
pos = nx.spring_layout(mention_gcc)
count = 0.
for com in set(partition_mention.values()) :
count = count + 1.
list_nodes = [nodes for nodes in partition_mention.keys()
if partition_mention[nodes] == com]
nx.draw_networkx_nodes(mention_gcc, pos, list_nodes, node_size = 20,
node_color = str(count / mention_size))
print 'For the mention GCC we have found {} communities'.format(int(mention_size))
nx.draw_networkx_edges(mention_gcc,pos, alpha=0.4)
plt.show()
For the mention GCC we have found 42 communities
Let us dive a little deeper in what these communities are.
First, we will look into the sizes of the communities and the biggest accounts in the communities. This is to get a sense for the kind of accounts we find.
Then, we will look into the most common hashtags used in every community in the mention graph. This is to get a feeling for the topics that live in every community.
Below you can see a table showing each community size, their mostly cited source of information, the biggest account in the community (in terms of degree).
import twitter
# look at accounts in each partition with highest degree
# twitter api credentials for lookup
CONSUMER_KEY='29JF8nm5HttFcbwyNXkIq8S5b'
CONSUMER_SECRET='szo1IuEuyHuHCnh93VjLLGb5xg9NcfDVqMsLtOt3DbL5hXxpbt'
OAUTH_TOKEN='575782797-w96NPIzKF07TpC3c78nEadEfACLclYvSusuOPl8z'
OAUTH_TOKEN_SECRET='h0oitwxLkDjOLSejSQl2AWSrcmNeUwBpEvSUWonYzZTNz'
# instantiate API object
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter_api= twitter.Twitter(auth=auth)
# uxilirary function
def inverse_partition(partition):
components_inv = defaultdict(list)
for key, value in partition.iteritems():
components_inv[value].append(key)
return components_inv
# get top accounts by degree
def partition_top_accounts(partition, degree):
part_inv = inverse_partition(partition)
return {part: max(usernames, key=lambda user: degree[user]) for part, usernames in part_inv.iteritems()}
# get data on account
def twitter_account(username):
return twitter_api.users.lookup(screen_name=username)
# Get the most commonly cited website
def getMostCommonWebsite(partition, userCited):
inv = inverse_partition(partition)
out = dict()
for com in inv:
l = []
for user in inv[com]:
l+= list(set(userCited[user]))#Add all the citations only ONCE to a list in this community
c = Counter(l)
mostFreqSite = max(set(l), key=l.count)
out[com] = (mostFreqSite, round(c[mostFreqSite],2))
return out
# display data in dataframe
def pprint_partition_overview(partition, degree, userCited):
data = []
columns = ['Community No', 'Most Cited Website in Community','Percentage of Users who cited', 'Community Size', 'Screen Name', 'Name', 'Url', 'Location', 'Followers', 'Degree']
top_accounts = partition_top_accounts(partition, degree)
top_websites = getMostCommonWebsite(partition,userCited)
for part_id, account in top_accounts.iteritems():
user = twitter_account(account)[0]
# print pprint( user)
url = ''
try:
url = user['entities']['url']['urls'][0]['display_url']
except:
pass
row = {
'Community No': part_id,
'Most Cited Website in Community': top_websites[part_id][0],
'Percentage of Users who cited': float(top_websites[part_id][1])/len(inverse_partition(partition)[part_id]) * 100,
'Community Size': len(inverse_partition(partition)[part_id]),
'Screen Name': account,
'Name': user['name'],
'Url': url,
'Location': user['location'],
'Followers': user['followers_count'],
'Degree': degree[account]
}
data.append(row)
data.sort(key=lambda row: row['Percentage of Users who cited'], reverse=True)
df = pd.DataFrame(data)
df = df[columns]
display(df)
print 'Overview of communities'
pprint_partition_overview(partition_mention, mention_degree_gcc,userCited)
Overview of communities
Community No | Most Cited Website in Community | Percentage of Users who cited | Community Size | Screen Name | Name | Url | Location | Followers | Degree | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | www.pscp.tv | 83.333333 | 60 | SmythRadio | Kerry Smyth | smythradio.info | Cincinnati, OH | 15694 | 32 |
1 | 40 | theantimedia.org | 75.000000 | 8 | AntiMedia | Anti-Media | TheAntiMedia.org | 45317 | 5 | |
2 | 32 | www.youtube.com | 50.000000 | 8 | vinarmani | Ⓥin Ⓐrmani | SelfOwnership.me | Las Vegas | 10238 | 4 |
3 | 6 | www.intellihub.com | 40.697674 | 86 | intellihubnews | Intellihub | intellihub.com | 26213 | 59 | |
4 | 3 | www.youtube.com | 37.228261 | 368 | RealAlexJones | Alex Jones | infowars.com | Austin, TX | 754556 | 160 |
5 | 1 | www.youtube.com | 34.024896 | 482 | LauraLoomer | Laura Loomer | lauraloomer.us | New York, USA | 112096 | 516 |
6 | 21 | boston25.com | 33.333333 | 6 | boston25 | Boston 25 News | Boston25News.com | Boston, MA | 298877 | 5 |
7 | 27 | soundcloud.com | 33.333333 | 6 | TrumpCard555 | Deplorable Kathy | Pensilvânia, USA | 6554 | 4 | |
8 | 28 | www.cbc.ca | 33.333333 | 6 | CharlieDaniels | Charlie Daniels | charliedaniels.com | Mt. Juliet, TN | 597081 | 4 |
9 | 39 | www.youtube.com | 33.333333 | 9 | France24_en | FRANCE 24 English | france24.com/en/ | Paris, France | 181614 | 4 |
10 | 38 | wjla.com | 28.571429 | 7 | ABC7News | ABC 7 News - WJLA | wjla.com | Washington, DC | 142811 | 6 |
11 | 7 | www.youtube.com | 24.137931 | 29 | ralphlopez | Ralph Lopez | 82 | 20 | ||
12 | 0 | www.youtube.com | 24.050633 | 79 | Timothytrippin | Timothy Sullivan | Central Fifth Dimension | 1508 | 27 | |
13 | 34 | www.c-span.org | 22.222222 | 9 | cspanwj | Washington Journal | c-span.org/WJ | Washington, DC | 57340 | 9 |
14 | 15 | www.reviewjournal.com | 20.571429 | 175 | reviewjournal | Las Vegas RJ | reviewjournal.com | Las Vegas, NV | 229333 | 113 |
15 | 24 | www.youtube.com | 20.370370 | 54 | OANN | One America News | oann.com | 155062 | 19 | |
16 | 29 | www.newsmaxtv.com | 20.000000 | 5 | RitaCosby | Rita Cosby | RITACOSBY.com | WABC RADIO | 120595 | 3 |
17 | 36 | www.ksl.com | 20.000000 | 10 | KSL5TV | KSL 5 TV | ksl.com/index.php | Salt Lake City, Utah | 62235 | 6 |
18 | 16 | www.youtube.com | 19.672131 | 61 | scrowder | Steven Crowder | louderwithcrowder.com | Ghostlike | 490897 | 20 |
19 | 2 | www.youtube.com | 17.647059 | 187 | TuckerCarlson | Tucker Carlson | Washington, DC | 1356392 | 195 | |
20 | 8 | truthfeednews.com | 17.647059 | 17 | Lrihendry | Lori Hendry | RedWhiteandLori.com | 208607 | 6 | |
21 | 31 | www.fox5atlanta.com | 16.666667 | 6 | FOX5Atlanta | FOX 5 Atlanta | fox5atlanta.com | Atlanta | 557489 | 5 |
22 | 41 | www1.cbn.com | 16.666667 | 6 | CBNNews | CBN News | cbnnews.com | D.C.-Nashville-Jerusalem-VA | 68703 | 5 |
23 | 23 | lasvegassun.com | 16.216216 | 37 | LasVegasSun | Las Vegas Sun | lasvegassun.com | Las Vegas, NV | 224171 | 32 |
24 | 33 | www.youtube.com | 15.384615 | 13 | yesgregyes | Greg Morelli | yesgregyes.com | Chicago, Illinois | 1029 | 12 |
25 | 20 | www.usatoday.com | 15.000000 | 100 | USATODAY | USA TODAY | usatoday.com | USA TODAY HQ, McLean, Va. | 3529342 | 40 |
26 | 35 | www.politico.com | 14.285714 | 7 | juliettekayyem | Juliette Kayyem | juliettekayyem.com | United States | 72576 | 6 |
27 | 10 | abc7.com | 13.043478 | 69 | ABC7 | ABC7 Eyewitness News | abc7.com | 1009448 | 20 | |
28 | 26 | www.news1130.com | 11.764706 | 17 | NEWS1130 | NEWS 1130 | news1130.com | Vancouver | 211788 | 8 |
29 | 19 | www.pscp.tv | 10.606061 | 66 | NewsHour | PBS NewsHour | pbs.org/newshour/ | Arlington, VA | New York, NY | 983640 | 16 |
30 | 14 | thehill.com | 10.344828 | 29 | AliVelshi | Ali Velshi | thevx.com | New York/Philly/The World | 294448 | 8 |
31 | 11 | www.youtube.com | 8.505747 | 435 | FoxNews | Fox News | foxnews.com | U.S.A. | 16584348 | 538 |
32 | 30 | www.youtube.com | 8.333333 | 24 | watchyourReps | Watch Your Reps | United States | 848 | 8 | |
33 | 18 | www.youtube.com | 7.920792 | 101 | AC360 | Anderson Cooper 360° | ac360.com | New York, NY | 1133762 | 44 |
34 | 22 | www.foxnews.com | 7.352941 | 68 | ATFHQ | ATF HQ | atf.gov | Washington, DC | 47997 | 29 |
35 | 25 | globalnews.ca | 7.142857 | 14 | CharlesMBlow | Charles M. Blow | topics.nytimes.com/top/opinion/ed… | Brooklyn | 414445 | 8 |
36 | 37 | www.ajc.com | 6.666667 | 15 | GovAbbott | Gov. Greg Abbott | bit.ly/1RQpY3j | The Texas Capitol, Austin, TX | 151756 | 9 |
37 | 4 | edition.cnn.com | 6.609808 | 469 | CNN | CNN | cnn.com | 38682912 | 649 | |
38 | 13 | www.youtube.com | 5.434783 | 276 | CityOfLasVegas | City of Las Vegas | lasvegasnevada.gov | Las Vegas, Nevada | 223481 | 116 |
39 | 9 | www.youtube.com | 5.405405 | 111 | SenJohnMcCain | John McCain | Phoenix, AZ / Washington, DC | 3029886 | 36 | |
40 | 17 | www.iol.co.za | 5.405405 | 111 | SkyNews | Sky News | news.sky.com | London, UK | 4423583 | 44 |
41 | 12 | www.youtube.com | 5.376344 | 186 | MomsDemand | Moms Demand Action | momsdemandaction.org | USA | 88921 | 87 |
This table is built in the following way:
Each row corresponds to a community found by Louvain algorithm. In you can notice several characteristics such as partition size, the most popular user (in terms of degree), number of followers of this user.
You can also see the most cited domain for each of this communities including the percentage of users in the community that cited the domain.
Here by most cited website we mean that cited by most of the users. For example in all the citations of the community, if one person cited a site 50 times, we count it as 1.
Example: In the first row, community number 5, 83% of the 60 users cited the domain 'www.pscp.tv' (periscope). Furthermore the most popular user in this community is Kerry Smith from Cincinnati,OH with 15K followers.
Some interesting things are worth noting:
Some communities related to a certain source of information(domain) in a very strong way. Notice how on the first two rows, at least 75% of users cited a certain source.
When we analyze alternative information sources such as 'intellihub.com' and 'theantimedia.org', we can observe that the citation percentage of their communities quite highly citing these cites (intellihub: 40%, theantimedia: 75%).
Also you can see that one of the most common sites is 'youtube.com'. This is expected since people share video news about the event through this site.
Another interesting thing we are seeing some of the communities in fact correspond to geographical communites. For example the community 23 is all about the Las Vegas area and its local media (Same situation with Boston community 21).
A problem we see here is that some communities have youtube as most cited website. This does not tell much about that community. Let's see what are first 3 most cited website in each community.
# Get the most commonly cited website
def getMostCommonWebsite3(partition, userCited):
inv = inverse_partition(partition)
out = dict()
for com in inv:
l = []
for user in inv[com]:
l+= list(set(userCited[user]))#Add all the citations only ONCE to a list in this community
c = Counter(l)
mostFreqSites = sorted(set(l), key=l.count,reverse = True)[:3]
percentage = dict()
for i in mostFreqSites:
percentage[i] = round(float(c[i]) / len(inv[com]) * 100)
while len(mostFreqSites)<3:
mostFreqSites.append('NaN')
percentage['NaN'] = 0
out[com] = (mostFreqSites, percentage)
return out
def showIt(partition):
columns = ['Community No','Community Size', 'M1', 'P1','M2','P2','M3','P3']
inv = inverse_partition(partition)
data = []
top3_websites = getMostCommonWebsite3(partition,userCited)
for part_id in inv:
mostFreqSites = top3_websites[part_id][0]
percentage = top3_websites[part_id][1]
row = {
'Community No':part_id,
'Community Size': len(inv[part_id]),
'M1': mostFreqSites[0],
'P1': percentage[mostFreqSites[0]],
'M2': mostFreqSites[1],
'P2': percentage[mostFreqSites[1]],
'M3': mostFreqSites[2],
'P3': percentage[mostFreqSites[2]]
}
data.append(row)
data.sort(key=lambda row: row['P1'], reverse=True)
df = pd.DataFrame(data)
df = df[columns]
display(df)
showIt(partition_mention)
Community No | Community Size | M1 | P1 | M2 | P2 | M3 | P3 | |
---|---|---|---|---|---|---|---|---|
0 | 5 | 60 | www.pscp.tv | 83.0 | www.youtube.com | 15.0 | yournewswire.com | 3.0 |
1 | 40 | 8 | theantimedia.org | 75.0 | www.youtube.com | 50.0 | www.presstv.com | 25.0 |
2 | 32 | 8 | www.youtube.com | 50.0 | www.npr.org | 13.0 | www.lasvegasadvisor.com | 13.0 |
3 | 6 | 86 | www.intellihub.com | 41.0 | www.youtube.com | 34.0 | yournewswire.com | 16.0 |
4 | 3 | 368 | www.youtube.com | 37.0 | www.infowars.com | 5.0 | www.intellihub.com | 4.0 |
5 | 1 | 482 | www.youtube.com | 34.0 | www.pscp.tv | 12.0 | www.zerohedge.com | 2.0 |
6 | 21 | 6 | boston25.com | 33.0 | www.latimes.com | 17.0 | edition.cnn.com | 17.0 |
7 | 27 | 6 | soundcloud.com | 33.0 | www.thecommonsenseshow.com | 17.0 | gsiexchange.com | 17.0 |
8 | 28 | 6 | www.cbc.ca | 33.0 | www.charliedaniels.com | 17.0 | NaN | 0.0 |
9 | 39 | 9 | www.youtube.com | 33.0 | www.thedoctorstv.com | 11.0 | www.c-span.org | 11.0 |
10 | 38 | 7 | wjla.com | 29.0 | NaN | 0.0 | NaN | 0.0 |
11 | 0 | 79 | www.youtube.com | 24.0 | www.facebook.com | 8.0 | www.bitchute.com | 6.0 |
12 | 7 | 29 | www.youtube.com | 24.0 | www.reviewjournal.com | 7.0 | truthfeednews.com | 7.0 |
13 | 34 | 9 | www.c-span.org | 22.0 | dailycaller.com | 11.0 | www.gofundme.com | 11.0 |
14 | 15 | 175 | www.reviewjournal.com | 21.0 | lvrj.com | 18.0 | www.youtube.com | 7.0 |
15 | 16 | 61 | www.youtube.com | 20.0 | www.infowars.com | 3.0 | www.facebook.com | 3.0 |
16 | 24 | 54 | www.youtube.com | 20.0 | legalinsurrection.com | 11.0 | www.nationalreview.com | 6.0 |
17 | 29 | 5 | www.newsmaxtv.com | 20.0 | nypost.com | 20.0 | NaN | 0.0 |
18 | 36 | 10 | www.ksl.com | 20.0 | www.intellihub.com | 10.0 | www.facebook.com | 10.0 |
19 | 2 | 187 | www.youtube.com | 18.0 | www.foxnews.com | 5.0 | www.reviewjournal.com | 3.0 |
20 | 8 | 17 | truthfeednews.com | 18.0 | www.youtube.com | 18.0 | freedomdaily.com | 12.0 |
21 | 31 | 6 | www.fox5atlanta.com | 17.0 | NaN | 0.0 | NaN | 0.0 |
22 | 41 | 6 | www1.cbn.com | 17.0 | campaign.r20.constantcontact.com | 17.0 | www.youtube.com | 17.0 |
23 | 23 | 37 | lasvegassun.com | 16.0 | www.youtube.com | 11.0 | www.facebook.com | 11.0 |
24 | 20 | 100 | www.usatoday.com | 15.0 | www.youtube.com | 5.0 | www.reviewjournal.com | 4.0 |
25 | 33 | 13 | www.youtube.com | 15.0 | dennismichaellynch.com | 8.0 | www.facebook.com | 8.0 |
26 | 35 | 7 | www.politico.com | 14.0 | news.wgbh.org | 14.0 | itunes.apple.com | 14.0 |
27 | 10 | 69 | abc7.com | 13.0 | losangeles.cbslocal.com | 7.0 | www.gofundme.com | 6.0 |
28 | 26 | 17 | www.news1130.com | 12.0 | worldview.stratfor.com | 6.0 | www.kvue.com | 6.0 |
29 | 19 | 66 | www.pscp.tv | 11.0 | www.facebook.com | 9.0 | www.youtube.com | 6.0 |
30 | 14 | 29 | thehill.com | 10.0 | www.facebook.com | 10.0 | www.newsweek.com | 7.0 |
31 | 11 | 435 | www.youtube.com | 9.0 | www.pscp.tv | 4.0 | www.foxnews.com | 3.0 |
32 | 18 | 101 | www.youtube.com | 8.0 | variety.com | 4.0 | www.snappytv.com | 3.0 |
33 | 30 | 24 | www.youtube.com | 8.0 | www.tennessean.com | 8.0 | www.jacksonsun.com | 4.0 |
34 | 4 | 469 | edition.cnn.com | 7.0 | www.youtube.com | 4.0 | www.nbcnews.com | 1.0 |
35 | 22 | 68 | www.foxnews.com | 7.0 | www.facebook.com | 6.0 | www.youtube.com | 4.0 |
36 | 25 | 14 | globalnews.ca | 7.0 | international.la-croix.com | 7.0 | NaN | 0.0 |
37 | 37 | 15 | www.ajc.com | 7.0 | therivardreport.com | 7.0 | www.salon.com | 7.0 |
38 | 9 | 111 | www.youtube.com | 5.0 | www.nytimes.com | 4.0 | www.washingtonpost.com | 4.0 |
39 | 12 | 186 | www.youtube.com | 5.0 | www.facebook.com | 5.0 | www.nytimes.com | 4.0 |
40 | 13 | 276 | www.youtube.com | 5.0 | www.facebook.com | 5.0 | www.pscp.tv | 3.0 |
41 | 17 | 111 | www.iol.co.za | 5.0 | www.facebook.com | 3.0 | www.youtube.com | 3.0 |
Here the percentages are percentage of the users who cited the website at least ones. So the for the community 40, (second line), 75.0% of the users cited theantimedia.org, 50% cited youtube.com.
When we think both table above we see some interesting results:
Especially when people are spreading rumors, they tend to share videos. For example, for this event people generally claim that there was multiple shooter and they share videos related to this claim. We can see this in lines 1 and 3 (community no 40 and 6). Most of the users who shared news from alternative media are sharing videos from youtube.
There might be a relationship between 'www.intellihub.com', 'yournewswire.com'. (Actually we saw that both are alternative media)
In the line forteen we can see that there is a relation between sites, www.reviewjournal.com and lvrj.com. This is actually true since both of them are local media of Las Vegas.
An interesting way to analyze communities is to find which hashtags they tend to use. For example if a community tends to use '#guncontrol' hashtag we can infer that there is a relation with this community and the gun laws.
import matplotlib.style
import matplotlib as mpl
mpl.style.use('classic')
def partition_hashtag_analysis(partition):
# inverse the partitioning to get a dict with { partitioning_id : [usernames]}
components_inv = inverse_partition(partition)
# get all hashtags used by users in combination with our query_hashtags
components_hashtags = defaultdict(list)
for part_id, usernames in components_inv.iteritems():
tweets = tweet_collection.find({
'username': {
'$in': usernames
},
'hashtags': {
'$ne': [],
'$nin': map(lambda s: '#' + s, query_hashtags)
},
},
{
'hashtags': 1
})
# filter query hashtags hashtags
for row in tweets:
tags = [tag.replace('#', '') for tag in row['hashtags'] if tag not in query_hashtags]
components_hashtags[part_id] += tags
part_tag_counts = {}
for part_id, tags in components_hashtags.iteritems():
counts = Counter(tags)
part_tag_counts[part_id] = counts
return part_tag_counts
mention_com_hashtags = partition_hashtag_analysis(partition_mention)
# Heatmap
# get most common hashtags in general
number_of_tags = 20
hashtags_count = Counter([tags for part_id in mention_com_hashtags.itervalues() for tags in part_id])
most_common_tags = map(lambda tup: str(tup[0]), hashtags_count.most_common(number_of_tags))
# create a matrix of the count of use of the most commn hashatgs in the communities
heat_array = np.array([[counts[tag] for tag in most_common_tags] for counts in mention_com_hashtags.values()])
# plot heatmap
fig = plt.figure(figsize=(10, 10))
plt.imshow(heat_array, interpolation='nearest')
plt.xticks(range(number_of_tags), most_common_tags, rotation='vertical')
plt.yticks(range(heat_array.shape[0]), mention_com_hashtags.keys())
# ax.set_facecolor('white')
plt.colorbar()
plt.clim(100,300)
plt.show()
The heatmap above displays how often the most common hashtags appear in every community. The brighter the color, the more prevalent that hashtag is in the tweets from that community and by extension from the users. This gives us an idea about the ideas or opinions of these users. The data is a little sparse for most communities since they are small and there are not that many hashtags, but we can see some interesting patterns emerge in the ones that do have data.
The most interesting fact we see is that there is a strong correlation betweent the community 1 and the community 3. Actually those communities are the alternative media communities (AntiMedia and IntelliHub communities)
Only communities 0,1,3 are using falseflag. This shows that alternative media tend to spread the news that the event was fake.
Another interesting graph is hashtag graph. For this second network, nodes are still the users. There is an edge between two node if they had ever used same hashtag for more than a limit. Here we have chosen this limit to be 10. Thus if two users have more than 10 tweets under a hashtag there will be an edge between them.
Let's create the graph.
usernames = tweet_collection.distinct('username')
# create two separate graphs for mention relations and hashtags, one simple, one multi
hashtag_graph = nx.MultiGraph()
# add nodes from users that wrote tweets
hashtag_graph.add_nodes_from(usernames)
print 'Number of nodes in hashtag_graph', len(hashtag_graph.nodes())
# add edges to the hashtag_graph
# get all tweets with hashtags
tweet_hashtags = tweet_collection.find({'entities.hashtags': {'$ne': []}}, {'username': 1, 'hashtags': 1})
Number of nodes in hashtag_graph 90721
Let's have collection showing which user used which hashtag (of course we are eliminating the hashtag we queried to get the data)
# intialize a defaultdict to track the unique hashtags
# and how often users are using them
hashtags_dict = defaultdict(lambda: defaultdict(int))
# populate the dict {hashtags: set(usernames)}
for tweet in tweet_hashtags:
username = tweet['username']
# list of hashtags
hashtags = map(lambda tag: tag.replace('#', '').lower(), tweet['hashtags'])
# remove the query_hashtags
new_tags = list(set(set(hashtags) - set(query_hashtags)))
if len(new_tags) > 0:
for tag in new_tags:
if tag:
hashtags_dict[tag][username] += 1
Get the edges. If two user have more than 10 tweets under same hash than there is a link between them
i = 0
# add edges between all users with same hashtag if they have used it more than once
for tag, userdict in hashtags_dict.iteritems():
# find users who used the tag more than once
usernames = [username for username, count in userdict.iteritems() if count >= 10]
# create tuples of possible combinations of nodes
sets = combinations(usernames, 2)
# add edges
for combi in sets:
hashtag_graph.add_edge(*combi, atrr=tag)
print 'Number of edges in hashtag_graph', len(hashtag_graph.edges())
Number of edges in hashtag_graph 1453345
Like we did before let's see some statistics about this graph.
plt.style.use('fivethirtyeight')
# get degree distributions
hashtag_degree = dict(nx.degree(hashtag_graph))
# get minumum and maximum degrees
min_hashtag_degree, max_hashtag_degree = min(hashtag_degree.values()), max(hashtag_degree.values())
# plot the degree distributions
plt.figure()
plt.subplot(211)
plt.title('Hahstag graph degree distribution')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.yscale('log', nonposy='clip')
_= plt.hist(sorted(hashtag_degree.values(),reverse=True), range(min_hashtag_degree, max_hashtag_degree + 1)) # degree sequence
This is rather an interesting distribution. Apparently nodes either do not have any edge or they have a lot. We can thus think that there are some arount 2000 accounts which tweets a lot on similar topics. Let's print nodes which has maximum and minimum degrees
print sorted(hashtag_degree.items(),key = lambda x: x[1], reverse = True)[0:5]
print sorted(hashtag_degree.items(),key = lambda x: x[1], reverse = False)[0:5]
[(u'tiniskwerl', 1967), (u'PatJohnson_9', 1960), (u'PatJohnson_8', 1960), (u'PatJohnson_3', 1960), (u'PatJohnson_2', 1960)] [(u'mamajazzyy', 0), (u'Fatal_Romantic', 0), (u'bayoucityy', 0), (u'Protest_Works', 0), (u'_Jlach', 0)]
So as seen here, most of the nodes do not have any edges, thus here also we should use GCC.
# get all the separate components
components_hashtag = sorted(nx.connected_component_subgraphs(hashtag_graph), key=len, reverse=True)
# components_hashtag = pickle.load(open('comhtagponenetsHas','rb'))
print 'The hashtag graph has {0} disconnected components'.format(len(components_hashtag))
The hashtag graph has 89025 disconnected components
Let's see the biggest subgraphs.
hashtag_component_lengths = [len(c) for c in components_hashtag]
hashtag_component_lengths = sorted(hashtag_component_lengths, reverse = True)[:5]
print hashtag_component_lengths
[1697, 1, 1, 1, 1]
So apparently there is a big graph which includes 1697 and other nodes are mostly floating. It is best to work on this graph.
In contrast to mention graph, this graph includes too many edges. Thus printing this graph is not very explanatory. So we directly go for community detection. (If we print it you will only see some black fur)
# get the giant connected component
hashtag_gcc = components_hashtag[0]
hashtag_degree_gcc = nx.degree(hashtag_gcc)
# number of nodes and edges
print 'The GCC of the hashtag graph has {nodes} nodes and {edges} edges'.format(nodes=len(hashtag_gcc.nodes()), edges=len(hashtag_gcc.edges()))
The GCC of the hashtag graph has 1697 nodes and 1453345 edges
partition_hashtag = community.best_partition(hashtag_gcc)
So let's make the similar analyze that we did for the mention graph.
print 'The hashtag graph partitons with an overview of the accounts with the highest degrees'
pprint_partition_overview(partition_hashtag, hashtag_degree_gcc,userCited)
The hashtag graph partitons with an overview of the accounts with the highest degrees
Community No | Most Cited Website in Community | Percentage of Users who cited | Community Size | Screen Name | Name | Url | Location | Followers | Degree | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | www.youtube.com | 38.461538 | 221 | tiniskwerl | tiniskwerl | Northern California U.S.A. | 1446 | 1967 | |
1 | 0 | www.youtube.com | 29.146426 | 1441 | GrandeFormaggio | GrandeFormaggio | 908 | 1725 | ||
2 | 2 | www.youtube.com | 22.857143 | 35 | thomasj17431826 | Thomas J | 3032 | 1790 |
So only 3 community exists in this graph. So to further analyze this graph let's see 3 most cited web pages here
showIt(partition_hashtag)
Community No | Community Size | M1 | P1 | M2 | P2 | M3 | P3 | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 221 | www.youtube.com | 38.0 | abcnews.go.com | 8.0 | www.foxnews.com | 7.0 |
1 | 0 | 1441 | www.youtube.com | 29.0 | www.foxnews.com | 6.0 | www.facebook.com | 4.0 |
2 | 2 | 35 | www.youtube.com | 23.0 | www.washingtonpost.com | 11.0 | www.dailymail.co.uk | 11.0 |
Apparently this graph gives less insight to analyze our data.
The last graph that we think is interesting, shows the relation between sources of information. Here, the nodes are websites. There will be an edge between two nodes if the same user cited from both sites. Previously we had already collected data for this part. So let's remember some numbers and built the graph.
Let's create the graph
print 'There are {} citation in the total {} tweet'.format(len(allCitedWebsite), len(allTweets))
print 'There are {} different source of information that users cite'.format(len(set(allCitedWebsite)))
There are 37077 citation in the total 169913 tweet There are 4730 different source of information that users cite
infoSource_graph = nx.Graph()
# add nodes
infoSource_graph.add_nodes_from(set(allCitedWebsite))
for user in userCited:
citedByUser = list(set(userCited[user]))
sets = combinations(citedByUser, 2)
# add edges
for combi in sets:
infoSource_graph.add_edge(*combi)
print 'The information sources graph has {} nodes'.format(len(infoSource_graph.nodes()))
print 'The information sources graph has {} edges'.format(len(infoSource_graph.edges()))
The information sources graph has 4730 nodes The information sources graph has 15234 edges
Let's draw the graph
info_degree = dict(infoSource_graph.degree())
# draw the graphs
nx.draw_networkx(infoSource_graph, pos=nx.spring_layout(infoSource_graph), node_size=[v * 5 for v in info_degree.values()], with_labels=False)
plt.title('Source of Information Graph')
plt.show()
Again let's check the size of the GCC in case it is meaningful we should only use it.
components_info = sorted(nx.connected_component_subgraphs(infoSource_graph), key=len, reverse=True)
info_component_lengths = [len(c) for c in components_info]
info_component_lengths = sorted(info_component_lengths, reverse = True)[:5]
print info_component_lengths
[1602, 4, 3, 3, 3]
Actually we probably should use the gcc because it is much bigger than the second subgraph.
info_gcc = components_info[0]
info_gcc_degree = dict(info_gcc.degree())
# draw the graphs
nx.draw_networkx(info_gcc, pos=nx.spring_layout(info_gcc), node_size=[v * 6 for v in info_gcc_degree.values()], with_labels=False)
plt.title('Sources of Information Graph GCC')
plt.show()
So let's see some sites which are common citations. For this let's print nodes of high degree
data = []
columns = ['Website','Degree']
for w,n in sorted(info_gcc_degree.items(), key = lambda x:x[1], reverse = True)[:20]:
row = {'Website':w, 'Degree':n}
data.append(row)
df = pd.DataFrame(data)
df = df[columns]
display(df)
Website | Degree | |
---|---|---|
0 | www.youtube.com | 791 |
1 | www.nytimes.com | 351 |
2 | www.foxnews.com | 348 |
3 | www.dailymail.co.uk | 332 |
4 | www.reviewjournal.com | 318 |
5 | abcnews.go.com | 303 |
6 | www.washingtonpost.com | 300 |
7 | www.facebook.com | 299 |
8 | www.latimes.com | 297 |
9 | www.intellihub.com | 257 |
10 | www.thegatewaypundit.com | 252 |
11 | www.zerohedge.com | 238 |
12 | nypost.com | 233 |
13 | yournewswire.com | 233 |
14 | www.cbsnews.com | 226 |
15 | www.cnn.com | 218 |
16 | truepundit.com | 215 |
17 | www.infowars.com | 215 |
18 | www.gofundme.com | 212 |
19 | www.newsweek.com | 211 |
Here there are some interesting insights.
Let's do community analysis here
partition_info = community.best_partition(info_gcc)
inverse_partition_info = inverse_partition(partition_info)
print 'There are {} communities'.format(len(inverse_partition_info))
There are 14 communities
Let's analyze those communities a bit further like we always do.
data = []
columns = ['Partition', 'Partition Size', 'Node with biggest degree', 'Degree', 'FirstFiveBiggestSites']
for part_id, websitesInCom in inverse_partition_info.iteritems():
degrees = [(w,info_gcc_degree[w]) for w in websitesInCom]
maxDegreeSite = max(degrees, key=lambda x:x[1])
firstFive = sorted(degrees, key = lambda x:x[1],reverse = True)[0:4]
row = {
'Partition': part_id,
'Partition Size': len(websitesInCom),
'Node with biggest degree':maxDegreeSite[0],
'Degree': maxDegreeSite[1],
'FirstFiveBiggestSites':firstFive
}
data.append(row)
# data.sort(key=lambda row: row['Percentage of Users who cited'], reverse=True)
df = pd.DataFrame(data)
df = df[columns]
display(df)
Partition | Partition Size | Node with biggest degree | Degree | FirstFiveBiggestSites | |
---|---|---|---|---|---|
0 | 0 | 146 | www.reviewjournal.com | 318 | [(www.reviewjournal.com, 318), (vimeo.com, 120), (news3lv.com, 108), (m.youtube.com, 95)] |
1 | 1 | 238 | www.youtube.com | 791 | [(www.youtube.com, 791), (soundcloud.com, 52), (www.veteranstoday.com, 24), (www.wnyc.org, 24)] |
2 | 2 | 360 | www.nytimes.com | 351 | [(www.nytimes.com, 351), (www.washingtonpost.com, 300), (www.cnn.com, 218), (www.newsweek.com, 211)] |
3 | 3 | 108 | www.facebook.com | 299 | [(www.facebook.com, 299), (www.msn.com, 22), (squawker.org, 15), (www.worldstarhiphop.com, 11)] |
4 | 4 | 110 | abcnews.go.com | 303 | [(abcnews.go.com, 303), (www.latimes.com, 297), (vid.me, 51), (nordic.businessinsider.com, 47)] |
5 | 5 | 102 | www.dailymail.co.uk | 332 | [(www.dailymail.co.uk, 332), (www.naturalnews.com, 150), (www.reddit.com, 72), (thenewyorknewsday.com, 69)] |
6 | 6 | 182 | www.intellihub.com | 257 | [(www.intellihub.com, 257), (yournewswire.com, 233), (www.cbsnews.com, 226), (truepundit.com, 215)] |
7 | 7 | 168 | www.thegatewaypundit.com | 252 | [(www.thegatewaypundit.com, 252), (www.zerohedge.com, 238), (nypost.com, 233), (www.gofundme.com, 212)] |
8 | 8 | 94 | www.foxnews.com | 348 | [(www.foxnews.com, 348), (imgur.com, 28), (www.ibtimes.co.uk, 26), (chicago.suntimes.com, 22)] |
9 | 9 | 46 | www.businessinsider.com | 82 | [(www.businessinsider.com, 82), (insider.foxnews.com, 65), (www.americanthinker.com, 52), (www.circa.com, 51)] |
10 | 10 | 8 | www.redcross.org | 5 | [(www.redcross.org, 5), (www.redcrossblood.org, 4), (insidefirstaid.com, 3), (cpr.heart.org, 3)] |
11 | 11 | 28 | www.google.com | 84 | [(www.google.com, 84), (www.ctvnews.ca, 17), (pbs.twimg.com, 14), (wjhl.com, 8)] |
12 | 12 | 3 | abc27.com | 3 | [(abc27.com, 3), (s.pennlive.com, 2), (www.pahouse.com, 2)] |
13 | 13 | 9 | elpais.com | 16 | [(elpais.com, 16), (www.letraslibres.com, 4), (www.sexenio.com.mx, 4), (canaln.pe, 4)] |
This table clearly gives incredible insight about which website is similar to which. Similar sites are generally grouped with other similars.
Finally we can safely say that thanks to the analysis we performed here, we can categorize media. Thanks to this analysis it is possible to detect alternative media which is used to spread fake news.
The sentiment analysis part of the report will focus on two different visualization techniques. On the first part, with the use of sentiment analysis techniques thaught in class, the tweet sentiment will be plotted over time.
On the second part of the analysis, the deepmoji library will be used to visualize the sentiments, this will be a (hopefully) fun way of understanding what is happening over time. It's also interesting to understand if the results of these two analysis differ in any way.
The first function that needs to be built is one that cleans a raw tweet. Tweets contain a lot of elements that, even though interesting, are not in the scope of a sentiment analysis.
Step 1: A function that cleans a tweet.
def clean_this(raw_tweet):
text = raw_tweet # extract text
text = text.split('http', 1)[0] # remove links
text = text.split('pic.', 1)[0] # remove pictures
text = text.lower() # lower case
text = re.sub(r'(\s)@\w+', r'\1', text) # remove mentions
text = re.sub(r'(\B)#\w+', r'\1', text) # remove hashtags
text = nltk.word_tokenize(text) # tokenize text
text = [token.lower() for token in text if token.isalpha()] # removes punctuation and numbers
text = [word for word in text if word not in stopwords.words('english')] # remove stopwords
text = list(set(text)) # only return unique tokens
return text
After cleaning a tweet, we will need to calculate it hapiness, for that, we will need a data file from MIT, that was shown during the course, named Data_set_S1.txt
. But for now, let's admit this data comes in a happy_data
matrix.
Step 2: A function that calculates the hapiness of a tweet.
def how_happy(tokens):
happinness_counter = 0.0
happy_word_counter = 0
for word in tokens:
if word in happy_words:
happy_word_counter += 1
happinness_counter += happy_data[np.where(happy_words == word)[0][0], 2]
if happy_word_counter == 0:
return happinness_counter
else:
return happinness_counter/happy_word_counter
After this, we need to import all of our libraries, create a connection to our mongoDB
database, extract our text file mentioned in step 2, as well as some other boring stuff.
Step 3: Importing libraries
from pymongo import MongoClient
import pymongo
from nltk.corpus import stopwords
import nltk
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from pprint import pprint
import matplotlib.style as style
import emoji
import seaborn
from collections import Counter
%pylab inline
style.use('fivethirtyeight')
Populating the interactive namespace from numpy and matplotlib
# Getting the sentiment analysis file and putting it on a matrix
path = 'ressources/Data_Set_S1.txt'
header = ['words', 'hap.rank', 'hap.avg', 'hap.std', 'tw.rank', 'goog.rank', 'nyt_rank', 'lyr_rank']
happy_data = pd.read_table(filepath_or_buffer=path, header=2).as_matrix()
happy_words = happy_data[:, 0]
# Boring database stuff, including fields to return.
display_conditions = {"query_criteria": 0, "_id": 0,
"geo": 0, "mentions": 0,
"hashtags": 0, "favorites": 0,
"permalink": 0, "username": 0,
"id": 0}
Before analysing, let's explain some things. How can tweets be plotted over time ?
Of course we could plot every tweet, but this would cause two things: A very weird plot, and one that is hard to understand. Our approach was to sort our database in a chronological way. Also, we decided to plot the average sentiment of all of the tweets for every hour.
Finally, it's interesting to understand if the general sentiment is at all related with the most popular tweets, for this, we establish a threshold (only get tweets with more than X retweets) and we do the same procedure.
By doing this, the plot becomes both easier to understand and conceptualise.
Step 1: Importing the tweets in a chronological fashion.
db_my_tweets = sorted(list(tweet_collection.find({}, display_conditions)), key = lambda x: x['date'])
Last group our tweets by hour and day
import itertools
tweetsByDay = [list(g) for k, g in itertools.groupby(db_my_tweets, key=lambda t: t['date'].date())]
tweetsByHourByDay = [[list(g) for k, g in itertools.groupby(day, key=lambda t: t['date'].hour)] for day in tweetsByDay]
Hopefully here the output will have a sorted list of lists where each innner list would contain tweets from the same hour. Let's check this by printing 5 tweets from first day first hour and second day first hour.
for t in tweetsByHourByDay[0][0][:5]:
print t['date']
print '--------------------------'
for t in tweetsByHourByDay[1][0][:5]:
print t['date']
2017-10-02 08:40:22 2017-10-02 08:47:44 2017-10-02 08:51:33 2017-10-02 08:53:16 2017-10-02 08:55:08 -------------------------- 2017-10-03 00:00:01 2017-10-03 00:00:01 2017-10-03 00:00:02 2017-10-03 00:00:02 2017-10-03 00:00:02
It seems to be working (note that the earliest tweet we obtained is from 2017-10-02 08:40:22).
Step 2: Preparing the processing.
sentiment = [] # here the sentiment is stored
pop_sentiment = []#Here sentiment of popular tweet is stored
periods = [] # here the period is stored, example 11 of november at 23:00 will be stored as '11/11 at 23'
happinessInHour = 0.0 #here total happiness in the hour is stored
happinessInHourPopular = 0.0
retweet_threshold = 100
pop_tweet_counter = 0
text = ''
# sentiment = pickle.load(open('sentiment','rb'))
# periods = pickle.load(open('periods','rb'))
# pop_sentiment = pickle.load(open('pop_sentiment','rb'))
# text = pickle.load(open('allText','rb'))
tweet_counter = 0
index = 0
for day in tweetsByHourByDay:
for hidx, hour in enumerate(day):
for tweet in hour:
clean_text = clean_this(tweet['text'])
happinessInHour += how_happy(clean_text)
text += ' '.join(word for word in clean_text)
if tweet['retweets'] >= retweet_threshold: # if the tweet is 'popular'
pop_tweet_counter += 1
happinessInHourPopular += how_happy(clean_text)
index+=1
print 'Processing tweet {} / {} \r'.format(index, len(db_my_tweets)),
averageHourHappiness = happinessInHour / len(hour)
sentiment.append(averageHourHappiness)
if pop_tweet_counter>0:
averageHourPopularHappiness = happinessInHourPopular / pop_tweet_counter
pop_sentiment.append(averageHourPopularHappiness)
else:
pop_sentiment.append(float('nan'))
periods.append('{}/{} at {}'.format(tweet['date'].day, tweet['date'].month, tweet['date'].hour))
#Reset values
happinessInHour = 0.0
happinessInHourPopular = 0.0
pop_tweet_counter = 0
print 'We got a total of {} sentiment windows.'.format(len(sentiment))
We got a total of 1338 sentiment windows.
Now that we have the sentiment vector, we can plot the sentiment over time. Note that the periods and some axis labels dissapeared, this is deliberate in order to increase readibility.
Step 3: Plotting the sentiment of all of the tweets and the sentiment of the popular tweets.
# allHours = [hour for hour in day for day in tweetsByHourByDay]
x = np.arange(len(sentiment))
style.use('ggplot')
# defining titles and axis names
plt.title('Sentiment Timeline', fontsize=20)
plt.xlabel('Hours after event', fontsize=17)
plt.ylabel('Normalized Sentiment Index', fontsize=17)
# some styling
plt.tick_params(axis='both', which='major', labelsize=12)
plt.axhline(y=0, color='black', linewidth=1.3, alpha=.7)
# and finally, we plot.
plt.plot(x, sentiment, linewidth=2, label= 'General Sentiment', color='#50514f')
plt.scatter(x, pop_sentiment, linewidth=2, label= 'Popular Tweet Sentiment', color='#f25f5c')
plt.legend(loc=1, prop={'size': 15})
plt.xlim(min(x),max(x))
pylab.rcParams['figure.figsize'] = (30, 10)
plt.show()
A couple of things are worth noting:
In the first hours after the shooting event there is a large amount of "popular" tweets about the event. This is expected due to the topic's popularity. Moreover, the amount decreases significantly over time.
Surprisingly, the variance in the happiness of the general tweets increases over time. This can be due to the fact that the number of tweets in a given hour is much less when the topic is not as hot.
Popular tweets have on average a higher normalized sentiment index. This can be due to a number of reasons, one of them might be that popular users are usually more "expressive".
In the popular tweets we see some points on the 0 axis. When these tweets are observed, we see that those tweets are the ones which have shared a link or an image.
Let's see the popular tweet with the greatest normalized index.
allHours = []#All the hours serialized
for day in tweetsByHourByDay:
for hour in day:
allHours.append(hour)
pop_sentiment_nanreplace = [i if str(i) != 'nan' else -1 for i in pop_sentiment ]#replace the nans with -1 to get max
ind = pop_sentiment.index(max(pop_sentiment_nanreplace))#get the index of the hour which has maximum sentiment index
for tweet in allHours[ind]:#Look for all the popular tweets in that hour
if tweet['retweets']>=retweet_threshold:
print tweet['text']
Cartoon for @chronicleherald #Trump #birthcontrol #guns #LasVegasShooting pic.twitter.com/JhfxxUrkj2
As can be seen by this example, the sentiment analysis tool used in this part, even though correct, is limited. Here since we are not analyzing hashtags and usernames for sentiment analysis, in this particular case only the word 'Cartoon' was seen by the analyzer which gives a high score. However, there are a lot of popular tweets thus overall insight are correct
The main idea of this part of the Sentiment Analysis is to have a visual representation of the most used words througout our database.
To accomplish this task, we will use the very handy WordCloud library.
Let's start by importing some much needed libraries
Step 1: Importing Libraries
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import urllib, cStringIO
Step 2: Getting all the text from the database
The idea in this part is too put all of the tweets in a long string called text
. But while we do this, we also have to clean these tweets.
This string was built in step 2 of part 3.1.2.
We start by selecting a nice image for the wordcloud countour, in this case an image of Texas. That can be found in the link below.
Step 1: Get a nice image
image_path = cStringIO.StringIO(urllib.urlopen('https://i.imgur.com/kftApxC.png').read())
texas_mask = np.array(Image.open(image_path))
Step 2: Avoiding obvious words
stopwords = set(STOPWORDS)
Now that we have all of the elements to plot it, let's finally do it.
Step 3: Plotting everything nicely.
# defining the wordcloud with stopwords and some edgy styling choices.
word_cloud = WordCloud(mask=texas_mask, width=800, height=400,background_color="white", collocations=False,colormap='inferno', stopwords=stopwords, normalize_plurals=False).generate(text)
# plot it.
plt.imshow(word_cloud, interpolation="bilinear")
plt.axis("off")
plt.show()
A couple of interesting points worth mentionning:
Most of the words were expected: 'Vegas', 'Guns', 'Victims' and 'Shooter'.
According to these key words we can estimate some of the common content of tweets. For example, the word 'prayer' shows a sentiment.
There are also some less expected note, some words were not expected, words such as "trump", "white" seem to have a high political connotation.
The final part of the sentiment analysis is all about emoji. We started by using a very simple version of DeepMoji where an "emoji" score was given to each tweet. In our database, each tweet now posesses the field "deepmoji" where we find the 5 most likely emoji that characterize that tweet and also the "reliability" of each one of them.
Step 1: Creating a dictionnary of emojis from a txt file.
emoji_index = {}
with open('ressources/emoji.txt') as f:
counter = 0
for line in f: # for every line
contents = [x.strip() for x in line.split(',')] # split line into 2
emoji_index[counter] = contents # contents = [name of emoji, url to emoji photo]
counter += 1
Step 2: A simple example
# define what we will not need from Mongo
display_conditions = {"query_criteria": 0, "_id": 0,
"geo": 0, "mentions": 0,
"hashtags": 0, "favorites": 0,
"permalink": 0, "username": 0,
"retweets": 0, "id": 0}
tweets = tweet_collection.find({'deepmoji': {'$exists': True}},display_conditions)[43:45]
# for 2 tweets, extract the deepmoji field
for t in tweets:
emoji_list = [t['deepmoji']['Emoji_1'],
t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'],
t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']]
print 'Tweet: ', t['text']
print 'Emojis:',
for emoji_number in emoji_list:
print emoji.emojize(emoji_index[emoji_number][0], use_aliases=True),
print '\n'
Tweet: I heard ISIS took credit for #LasVegasShooting #ISIS also knows how Jack died #ThisIsUs https:// twitter.com/theintercept/s tatus/914991897722212353 … Emojis: 💔 😢 😓 😞 😕 Tweet: Just shut up. You've lost the gun control argument when gun deaths in Chicago surpassed Afghanistan #LasVegasShooting #GunControlNow Emojis: 🔫 😡 😠 😤 👊
We can note that DeepMoji makes a pretty accurate characterization of the sentences, not perfect, but accurate enough.
The first idea for the emoji/sentiment analysis will be to visualize what are the emojis that are the most used in the whole dataset.
Step 1: Count the occurences of every emoji
# get emojis in a list
db_my_tweets = sorted(list(tweet_collection.find({}, display_conditions)), key = lambda x: x['date'])
mega_list = []
for t in db_my_tweets:
emoji_list = [t['deepmoji']['Emoji_1'],
t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'],
t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']]
mega_list += emoji_list
# get a counter of that list
counter_ = Counter(mega_list)
labels, values = zip(*counter_.items())
indexes = np.arange(len(labels))
Step 2: A histogram of all of the emojis
plt.barh(labels, values, color=['#50514f', '#f25f5c', '#ffe066', '#247ba0'])
plt.yticks(range(len(labels)),
[emoji_index[i][0][1:-1] for i in range(len(labels))], fontsize=14)
plt.xlabel('Emoji Frequency')
plt.ylabel('Emoji Name')
pylab.rcParams['figure.figsize'] = (20, 15)
plt.title('Most Used Emojis in DataSet')
plt.show()
Step 3: A simpler way to visualize.
# print the top 20 emojis and their frequency
top = 20
top_list = counter_.most_common(top)
print 'The top {} sentiments according to deepmoji:'.format(top)
for i in range(len(top_list)):
item_emoji = top_list[i][0]
item_frequency = top_list[i][1]
print i+1,'.',emoji.emojize(emoji_index[item_emoji][0], use_aliases=True), 'with', item_frequency, 'characterizations.'
The top 20 sentiments according to deepmoji: 1 . ♥ with 63981 characterizations. 2 . 💔 with 61907 characterizations. 3 . 😢 with 49634 characterizations. 4 . 👍 with 48096 characterizations. 5 . 😡 with 45714 characterizations. 6 . 💟 with 45350 characterizations. 7 . 😠 with 39212 characterizations. 8 . 🔫 with 35748 characterizations. 9 . 🙏 with 33609 characterizations. 10 . ✌ with 26823 characterizations. 11 . 👊 with 22429 characterizations. 12 . 😜 with 19677 characterizations. 13 . 💪 with 19229 characterizations. 14 . 😕 with 17983 characterizations. 15 . 😳 with 17507 characterizations. 16 . 😞 with 16833 characterizations. 17 . ❤ with 16286 characterizations. 18 . 😐 with 16250 characterizations. 19 . 💙 with 15493 characterizations. 20 . ☺ with 14886 characterizations.
Above you can see the emojis which describes the tweets the most. Some interesting things to note:
Overall this list gives an idea about the sentiment of the event
The goal of this part of the analyis is to see how the emoji characterization evolves over time. For example, does the characterization of a tweet by 'gun' change over time? If yes, how?
Step 1: Call the tweets that we need.
db_my_tweets = sorted(list(tweet_collection.find({}, display_conditions)), key = lambda x: x['date'])
The next step is kind of rough, basically we want to store a matrix, called emoji_grid
, where as rows the various possible emoji(64) are stored, and in the columns a period of time is stored. Therefore, the element in position [i,j] of the emoji_grid
will be equivalent to the normalized number of characterizations on period j
by emoji i
.
In this case, we will look at the characterizations every 4 hours, since the shooting. And see how the classification evolves.
Step 2: A matrix that stores the emoji freqeuncy per time period
# start the matrix
emoji_grid = np.zeros((len(emoji_index.values()), 1))
column = np.zeros((len(emoji_index.values()), 1))
# define important variables before loop
absolute_hour = 0
tweets_in_period = 0
periods = []
hours_passed = 0
period_length = 12
# for every tweet
for t in db_my_tweets:
tweets_in_period += 1 # tweet counter for period of time
tweet_hour = t['date'].hour
tweet_day = t['date'].day
tweet_month = t['date'].month
# exctract the deepmoji classification
tweet_emoji_list = [t['deepmoji']['Emoji_1'], t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'], t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']]
for emoji_number in tweet_emoji_list:
column[emoji_number, 0] += 1
# this counter counts the hours that have passed
if tweet_hour != absolute_hour:
hours_passed += 1
absolute_hour = tweet_hour
# if X number of hours passed, append those depmoji classifications to the master emoji_grid
if hours_passed == period_length:
periods.append('{}/{} at {}'.format(tweet_day,tweet_month, tweet_hour))
emoji_grid = np.hstack((emoji_grid, column / tweets_in_period)) # here we normalize
column = np.zeros((len(emoji_index.values()), 1))
hours_passed = 0
tweets_in_period = 0
emoji_grid = np.delete(emoji_grid, 0, 1) # deletes the redundant first column.
Step 3: Plotting the emoji grid.
# define important variables.
plot_top = 5 # only the most frequent emojis are plotted for simplicity
counter = 0
colors = ['#50514f', '#f25f5c', '#ffe066', '#247ba0', '#70c1b3', '#50514f', '#f25f5c', '#ffe066', '#247ba0', '#70c1b3']
# for every emoji (row of the emoji_grid) that is on the top X, plot it over the periods.
for i in range(emoji_grid.shape[0]):
if i in np.argsort(np.sum(emoji_grid, axis=1))[::-1][:plot_top]:
s = plt.plot(range(emoji_grid.shape[1]), emoji_grid[i, :],
label=emoji_index[i][0][1:-1], linewidth=2, color=colors[counter])
counter += 1
# define titles and axis names
plt.title('Deepmoji Characteristation Every {} Hours'.format(period_length), fontsize=15)
plt.xlabel('Time', fontsize=12)
plt.ylabel('Normalized Sentiment Frequency', fontsize=12)
# some styling and sizing
plt.tick_params(axis='both', which='major', labelsize=12)
plt.axhline(y=0, color='black', linewidth=1.3, alpha=.7)
plt.xticks(range(emoji_grid.shape[1]), periods, rotation='vertical')
plt.legend()
plt.grid()
plt.show()
This figure describes the tweets with emojis over time using the DeepMoji model. Severall things are interesting and worth descibing in this graph, let's mention some of them:
In this project, we have analysed the event Las Vegas shooting. By working on different graphs, we have shown that it is possible to categorize media. By analyzing different communities, we have shown