#!/usr/bin/env python # coding: utf-8 # # THE FAKE NEWS DETECTOR - LAS VEGAS # ## Table of Contents: # # - [Introduction](#intro) # - [Data](#data) # - [Part 1: Getting Started](#gettingStarted) # - [1.1 Basic Statistics](#basicStatistics1) # - [1.2 The most popular tweets](#popularTweets) # - [1.3 Most Popular Sources of Information](#websites) # - [1.4 Chronological Visualization of the Data](#visualizeByTime) # - [Part 2 : NETWORK ANALYSIS](#networkAnalysis) # - [2.1 Mention Graph](#mentionGraph) # - [2.2 Hashtag Graph](#hashGraph) # - [2.3 Sources of Information Graph](#infoGraph) # - [Part 3 : SENTIMENT ANALYSIS](#sentimentAnalysis) # - [3.1 Sentiment Over Time](#sentimentTime) # - [3.2 WordCloud](#wordcloud) # - [3.3 Emoji Analysis](#emoji) # # - [Conclusion](#conclusion) # # # # Introduction # # In this project we have taken a systematic approach to analyze the Las Vegas Shooting. This sad event occured on the night of October 1, 2017. A gunman opened fire on a crowd of concertgoers on the Las Vegas Strip in Nevada, leaving 58 people dead and 546 injured. To analyze this event, we have collected publicly available twitter data under the hashtag '#LasVegasShooting'. This hashtag was specifically chosen as it is biggest hashtag on the topic and it is rather neutral(i.e.people from all different views could easily use). Hashtags such as #PrayForVegas were not collected for this exact reason. # # The focus here was to analyze “alternative narratives” of crisis events. In this type of big events, alternative media might be used to spread rumors. Some conspiracy theories claiming either the event didn’t happen or that it was perpetrated by someone other than the current suspects are spread using this media. By analyzing the publicly available Twitter data, here we attempted to get some insights about the event and the ways media was used to spread information about it. # # # # Data Collection # # This part showw the data collection and its format. Here, the actual data collection code was not included. However given the similar data for other events it is possible to rerun the analysis quite easily. This was why we could easily rerun a similar analysis for another event such as the Sutherland 'Springs Church Shooting, Texas, U.S.A, November 5, 2017' easily. # The official Twitter API sets some limitation in terms of time constraints, for example, it is impossible to get tweets older than a week. Thus, for data collection we have used 'Jefferson-Henrique/GetOldTweets-python' library found in a Github repository. This project was written in Python to get old tweets, it bypassess some limitations of the Offical Twitter API. # # With this tool we have collected all the publicly available tweets under the '#LasVegasShooting' hashtag between the dates 30/09/2017 - 30/11/2017. Remember that the event occured at the night of the 1/10/2017. Thus the very first tweet we have is from next day. # In[4]: # those hashtags will be analyzed query_hashtags = ['#lasvegasshooting'] # Each tweet is stored with some fields of interest. Let's now see the format of our data. First, let's import necessary libraries and reach the database. # In[5]: #IMPORT SOME LIBRARIES NECESSARY FOR THE REST OF THE PROJECT from pymongo import MongoClient import pickle, sys import pymongo import numpy as np import pandas as pd from IPython.display import display, HTML pd.set_option('display.max_colwidth', -1) import matplotlib.pyplot as plt from collections import Counter get_ipython().run_line_magic('matplotlib', 'inline') get_ipython().run_line_magic('pylab', 'inline') pylab.rcParams['figure.figsize'] = (15, 6) # intialize mongo client on MongoDB Atlas client = MongoClient("mongodb://socialgraphs:interactions@socialgraphs-shard-00-00-al7cj.mongodb.net:27017,socialgraphs-shard-00-01-al7cj.mongodb.net:27017,socialgraphs-shard-00-02-al7cj.mongodb.net:27017/test?ssl=true&replicaSet=SocialGraphs-shard-0&authSource=admin") db = client.lasvegas # access tweet collection tweet_collection = db.tweetHistory # Below you can see an example tweet as it is stored in our database. # In[6]: allTweets = list(tweet_collection.find())# A list containing all the tweets # In[7]: exampleTweet = allTweets[603]#Just an interesting tweet with all fields filled # In[8]: for field in exampleTweet: print field,':',exampleTweet[field] # As you can go and see in the permanent link (permalink field), this tweet belongs to the user 'Luma923'. The user has some text in the tweet. Two additional hashtags were used in addition the one we searched for(#falseflag', '#lasvegasshooting'). Also note that the user cited a website in the tweet. In this example it is a video sharing website, however on many occasions, mainstream or alternative media are cited. This tweet instance has no retweet nor favorites. The tweet mentions another user (@cjgmartell) # Moreover, the user tweeted on the '#falseflag' hashtag. When analyzing other tweets from the same user, we see that this user posted multiple times on this hashtag by citing to alternative media websites. This shows that by analyzing the relation between users, and the websites they cite (of course taking into account many other variables), it is possible to analyze **media. ** # # # Part 1: Getting Started # So, let's get started by learning a bit more about our data. # # ## 1.1 Basic Statistics # How many tweets do we have? # In[9]: print 'We have a total of {} tweets.'.format(len(allTweets)) # Let us figure out some very basic statistics about our dataset. # In[10]: # define initial values allUsersList = [] #All the users totalNumberOfWords = 0.0 totalRetweets = 0.0 totalFavorites = 0.0 allHashtagList = [] allCitedWebsite = [] # All the websites tweets have ever cited userCited = dict() # A dictionary which shows which user cited which website # loop over data for tweet in allTweets: user_name = tweet['username'] allUsersList.append(user_name) citedByThisUser = userCited.get(user_name,[])#Get the websites cited by this user citedByThisUser+= tweet['citations_domain_names']# Add citations of this tweet userCited[user_name] = citedByThisUser allHashtagList += tweet['hashtags'] allCitedWebsite += tweet['citations_domain_names'] totalNumberOfWords += len(tweet['text'].split())#Add number of words used in this tweet totalRetweets += tweet['retweets'] totalFavorites += tweet['favorites'] # Get averages averageLength = totalNumberOfWords / len(allTweets) averageRetweets = totalRetweets / len(allTweets) averageFavorites = totalFavorites / len(allTweets) # print results print 'There are {} differrent users.'.format(len(set(allUsersList))) print 'A total of {} differrent hashtags are used.'.format(len(set(allHashtagList))) print 'There are {} citations in total'.format(len(allCitedWebsite)) print 'There are {} different websites cited by users'.format(len(set(allCitedWebsite))) print 'The average lenght of a tweet is {} words.'.format(round(averageLength,2)) print 'The average number of retweets: {}'.format(round(averageRetweets,2)) print 'The average number of favorites: {}'.format(round(averageFavorites,2)) # Let's see who are the top 5 users who tweeted the most using the hashtag '#LasVegasShooting'. # In[11]: for user, nOfTweets in sorted( Counter(allUsersList).iteritems(), key=lambda (user,n):n, reverse = True)[:5]: print user,nOfTweets # We are already getting some insights about the event. If you search for the usernames, you can see that reviewjournal is a local newspaper published in Las Vegas. Also 'nooneishere' and 'ConsciousOptix' look to be supporters of Presiden Donald Trump. # Let's see what hashtags were the most common # In[12]: for hashTag, nOfTweets in sorted( Counter(allHashtagList).iteritems(), key=lambda (hashTag,n):n, reverse = True)[:5]: print hashTag,nOfTweets # The most frequent hashtags also give some insigths about the event. The event occured at the Mandalay Bay Resort Hotel. After the event, there were severall reactions to guncontrol policies. (Note that the most frequent hashtag is the one we queried for, as expected.) # # ## 1.2 The most popular tweets # An interesting thing to look into our database are most popular tweets in terms of retweet and favorites. # In[13]: # create databases and define number of desired elements get_top = 10 display_conditions = {"deepmoji":0,"citations_urls":0,"citations_domain_names":0, "id":0,"date":0, "query_criteria":0, "_id":0, "geo":0, "mentions":0, "hashtags":0} db_by_retweets = tweet_collection.find({}, display_conditions).sort("retweets",pymongo.DESCENDING)[0:get_top] db_by_favorites = tweet_collection.find({}, display_conditions).sort("favorites",pymongo.DESCENDING)[0:get_top] # a function that takes a cursor and pretty prints it. def print_result(database): array = [] for t in database: array.append(t) pandas = pd.DataFrame(array) display(pandas) # In[14]: print 'The most retweeted:' print_result(db_by_retweets) print 'The most favorited:' print_result(db_by_favorites) # As you see between the users who have most popular tweets there are famous politicians and also political activists such as Donald Trump Jr. Also, some important organizations such as Fox News. Generally top tweets do not tend to cite any news but have some emotional content such as this [one](https://twitter.com/paopao619/status/916809797923569665). # # ## 1.3 Most Popular Sources Of Information # People are spreading news using references. They comment on an event depending on an article and they reference this article to spread the word. So it is pretty interesting to see the sources of information. Let's check the most common urls: # In[103]: print 'There are {} differrent websites where people get information.'.format(len(set(allCitedWebsite))) # In[16]: for webSite, nOfCitation in sorted( Counter(allCitedWebsite).iteritems(), key=lambda (w,n):n, reverse = True)[:15]: print webSite,nOfCitation # Here we can see some mainstream media such as New York Times, Fox News, CNN, and the Washingtonpost. However, people also cite some alternative media such as InfoWars and IntelliHub. # # ## 1.4 Visualizing over time # It's also very interesting to understand the tweets in our database from a chronological point of view. So let's see chronological distribution. # In[106]: # get dates and remove seconds for readability purposes dates = list(set([tweet['date'] for tweet in allTweets])) no_seconds = [date.replace( minute=0, second=0, microsecond=0) for date in dates] # count occurences counter = dict(Counter(no_seconds)) # prepare plot x = [] y = [] for element in counter: x.append(element) y.append(counter[element]) # plot nicely plt.title('Number of tweets per date') plt.ylabel('Number of tweets') plt.xlabel('Date') plt.scatter(x, y, c=y, marker='.', s=y) plt.xlim([min(x), max(x)]) plt.ylim([min(y),max(y)]) plt.grid() plt.show() # An interesting fact we see here is that the closer we are to the event, the more tweets we have (exponential distribution actually). However of course in the first couple of hours there are few people who know the event so we have less tweets. Thus if focus on the first two days we see the next plot. # In[107]: #Get first 2 days firstTwoDays = sorted(x)[:48] nOfTweetsInFirstTwoDays = [counter[hour] for hour in firstTwoDays] # plot nicely plt.title('Number of tweets per date in first Two Days ') plt.ylabel('Number of tweets') plt.xlabel('Date') plt.scatter(firstTwoDays, nOfTweetsInFirstTwoDays, c=nOfTweetsInFirstTwoDays, marker='.', s=nOfTweetsInFirstTwoDays) plt.xlim([min(firstTwoDays), max(firstTwoDays)]) plt.grid() plt.show() # # # PART 2. NETWORK ANALYSIS # # We have very rich and interesting data to analyze. There are different networks hidden in our data. # # From the tweets we collected we are going to generate a number of different networks that we will be used throughout the analysis. # # **Network 1:** # For the first network, the *nodes* are the users who have tweeted under the hashtag '#LasVegasShooting'. The *edges* will be constructed through mentions in these tweets. So, when a tweet mentions another user that is also a *node* in the network, there will be an *edge* between these two. We will refer to this network as `mention_graph`. # # **Network 2:** # For the second network, *nodes* are still the users. We define the *edges* between *nodes* if they share a common hashtag, not including the query hashtags. For example if two tweets from different *nodes* use the hashtag **#GunSense**, we will create an *edge* between them. We will refer to this network as `hashtag_graph`. # # **Network 3:** # Finally, for the third network, *nodes* are sources of information. Those are the websites, users are referencing to. We define the *edges* between *nodes* if same user share an article from both of the websites. For example if the user 'DonaldTrumpJr' shared articles from both 'Fox News' and 'CNN' there will be an edge between these nodes. # # Below we will start creating the networks. # # ## 2.1 Mention Graph # Let's create the mention graph first. # In[19]: import networkx as nx from collections import defaultdict from itertools import combinations # start by finding all unique usernames in the tweets usernames = list(set(allUsersList)) # create two separate graphs for mention relations mention_graph = nx.Graph() # add nodes from users that wrote tweets mention_graph.add_nodes_from(usernames) print 'Number of nodes in mention_graph', len(mention_graph.nodes()) # add edges to mention_graph between mentions in tweets # get all tweets with their mentions tweet_mentions = list(tweet_collection.find({'mentions': {'$ne' : [],}}, {'username': 1, 'mentions': 1})) # define a default dictionary to store the unique mentions used per user as a set mentions_dict = defaultdict(set) # populate dict {username: set(mentions)} for tweet in tweet_mentions: # get mentions from without @ (@DonaldTrumpJr -> DonaldTrumpJr) mentions = map(lambda mention: mention[1:], tweet['mentions']) # update dict mentions_dict[tweet['username']].update(mentions) # add edges from mentions_dict for user, mentions in mentions_dict.iteritems(): for to_node in mentions: if mention_graph.has_node(to_node): mention_graph.add_edge(user, to_node) print 'Number of edges in mention_graph', len(mention_graph.edges()) # get degree distributions mention_degree = dict(mention_graph.degree()) # As you see in this graph there are much more nodes than edges. Actually this is expected because people in twitter generally tend to mention other people if there is a direct relation to the event with the mentioned user. # To further analyze this graph let's see some basic statistics about the graph. # ### Basic stats on Mention Graph # ### Degree distribution # # Let's analyze the degree distribution to understand our graph. # In[20]: plt.style.use('fivethirtyeight') # get minumum and maximum degrees min_mention_degree, max_mention_degree = min(mention_degree.values()), max(mention_degree.values()) # plot the degree distributions plt.figure() plt.subplot(211) plt.yscale('log', nonposy='clip') plt.title('Mention graph degree distribution') plt.xlabel('Degree') plt.ylabel('Number of nodes') d = sorted(mention_degree.values(),reverse=True) r = range(min_mention_degree, max_mention_degree + 1) _ = plt.hist(d,r) # degree sequence # Note that the histogram above is in logarithmic scale let's see the distribution in loglog scale also. # In[21]: c = Counter(d) frequency = c.values() degrees_values = c.keys() plt.loglog(degrees_values, frequency, 'ro') plt.xlabel('k') plt.ylabel('count') plt.title('LogLog Plot of Degree Distribution for Mention Graph') plt.show() # As you can see we can safely say that the degree distribution follows power law. Note that there a lot of nodes without any connection. Here it is wiser to look only to GCC. # ### The GCC # ### Plotting the size of components # # Let's get all the connected components. Biggest one will be the GCC. # In[22]: # components_mention = pickle.load(open('componentsMention','rb')) components_mention = sorted(nx.connected_component_subgraphs(mention_graph), key=len, reverse=True) print 'The mention graph has {0} disconnected components'.format(len(components_mention)) # A lot of subgraphs! Let's try to understand their sizes. # In[23]: plt.figure() plt.subplot(211) mention_component_lengths = [len(c) for c in components_mention] plt.yscale('log', nonposy='clip') plt.title('Mention graph components') plt.ylabel('Number of components') plt.xlabel('Number of nodes') max_mcl = max(mention_component_lengths) _ = plt.hist(mention_component_lengths, range(max_mcl + 1)) # So apparently most of the components are actually pretty small. Let's see first 5 biggest component. # In[24]: mention_component_lengths[:5] # Here we can see that GCC is big enough to give us good insight. # ### Examining the GCC # # Since the main graph is so disconnecte, we decide to only work with the GCC of the graph. This allows us to perform more in depth analysis. # In[25]: # get the giant connected component for both graphs mention_gcc = components_mention[0] mention_degree_gcc = dict(nx.degree(mention_gcc)) # number of nodes and edges print 'The GCC of the mention graph has {nodes} nodes and {edges} edges.'.format(nodes=len(mention_gcc.nodes()), edges=len(mention_gcc.edges())) print ' - Average degree:', float(sum(mention_degree_gcc.values())) / len(mention_gcc.nodes()) # draw the graphs nx.draw_networkx(mention_gcc, pos=nx.spring_layout(mention_gcc), node_size=[v * 100 for v in mention_degree.values()], with_labels=False) plt.title('Mention GCC') plt.show() # Here the size of the nodes depend on their degrees. Obviously there are some guys who have lots of connections. Let's see who those are. # In[26]: mention_degree_gcc = dict(nx.degree(mention_gcc)) usersWithMostDegree = sorted(mention_degree_gcc.items(), key = lambda x:x[1], reverse = True)[:5] print usersWithMostDegree # Apparently people like to mention about he media in their tweets. Here we can also see that especially right wing polytical commentators (Alex Jones, TuckerCarlson) was mentioned frequently by users. This might be because the event concerns laws about gun restriction in U.S.A. # ### GCC degree distribution # # Since we are now only looking at the *GCC*, we run the *degree distribution* again. This time we have no *nodes* without *edges*. # In[27]: # get minumum and maximum degrees max_mention_gcc_degree = max(mention_degree_gcc.values()) # plot the degree distributions plt.figure() plt.subplot(211) plt.yscale('log', nonposy='clip') plt.title('Mention GCC degree distribution') plt.xlabel('Degree') plt.ylabel('Number of nodes') _=plt.hist(sorted(mention_degree_gcc.values(),reverse=True), range(max_mention_gcc_degree + 1)) # degree sequence #So let's also print the node in the gcc with the minimum degree to see if it has a link print 'The user with the lowest degree:', min(mention_degree_gcc.items(), key= lambda x: x[1]) # As you see now in GCC all nodes have at least one degree and the distribution is nicer. # ### Community detection # To further analyze our graph, let's now see the communities to understand if some users are especially mentioning between eachother. # In[28]: import community # use the python Louvain implementation to find communities in the networks partition_mention = community.best_partition(mention_gcc) #drawing mention_size = float(len(set(partition_mention.values()))) pos = nx.spring_layout(mention_gcc) count = 0. for com in set(partition_mention.values()) : count = count + 1. list_nodes = [nodes for nodes in partition_mention.keys() if partition_mention[nodes] == com] nx.draw_networkx_nodes(mention_gcc, pos, list_nodes, node_size = 20, node_color = str(count / mention_size)) print 'For the mention GCC we have found {} communities'.format(int(mention_size)) nx.draw_networkx_edges(mention_gcc,pos, alpha=0.4) plt.show() # Let us dive a little deeper in what these communities are. # # First, we will look into the sizes of the communities and the biggest accounts in the communities. This is to get a sense for the kind of accounts we find. # # Then, we will look into the most common hashtags used in every community in the mention graph. This is to get a feeling for the topics that live in every community. # ### Community Analysis # Below you can see a table showing each community size, their mostly cited source of information, the biggest account in the community (in terms of degree). # In[29]: import twitter # look at accounts in each partition with highest degree # twitter api credentials for lookup CONSUMER_KEY='29JF8nm5HttFcbwyNXkIq8S5b' CONSUMER_SECRET='szo1IuEuyHuHCnh93VjLLGb5xg9NcfDVqMsLtOt3DbL5hXxpbt' OAUTH_TOKEN='575782797-w96NPIzKF07TpC3c78nEadEfACLclYvSusuOPl8z' OAUTH_TOKEN_SECRET='h0oitwxLkDjOLSejSQl2AWSrcmNeUwBpEvSUWonYzZTNz' # instantiate API object auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET) twitter_api= twitter.Twitter(auth=auth) # uxilirary function def inverse_partition(partition): components_inv = defaultdict(list) for key, value in partition.iteritems(): components_inv[value].append(key) return components_inv # get top accounts by degree def partition_top_accounts(partition, degree): part_inv = inverse_partition(partition) return {part: max(usernames, key=lambda user: degree[user]) for part, usernames in part_inv.iteritems()} # get data on account def twitter_account(username): return twitter_api.users.lookup(screen_name=username) # Get the most commonly cited website def getMostCommonWebsite(partition, userCited): inv = inverse_partition(partition) out = dict() for com in inv: l = [] for user in inv[com]: l+= list(set(userCited[user]))#Add all the citations only ONCE to a list in this community c = Counter(l) mostFreqSite = max(set(l), key=l.count) out[com] = (mostFreqSite, round(c[mostFreqSite],2)) return out # display data in dataframe def pprint_partition_overview(partition, degree, userCited): data = [] columns = ['Community No', 'Most Cited Website in Community','Percentage of Users who cited', 'Community Size', 'Screen Name', 'Name', 'Url', 'Location', 'Followers', 'Degree'] top_accounts = partition_top_accounts(partition, degree) top_websites = getMostCommonWebsite(partition,userCited) for part_id, account in top_accounts.iteritems(): user = twitter_account(account)[0] # print pprint( user) url = '' try: url = user['entities']['url']['urls'][0]['display_url'] except: pass row = { 'Community No': part_id, 'Most Cited Website in Community': top_websites[part_id][0], 'Percentage of Users who cited': float(top_websites[part_id][1])/len(inverse_partition(partition)[part_id]) * 100, 'Community Size': len(inverse_partition(partition)[part_id]), 'Screen Name': account, 'Name': user['name'], 'Url': url, 'Location': user['location'], 'Followers': user['followers_count'], 'Degree': degree[account] } data.append(row) data.sort(key=lambda row: row['Percentage of Users who cited'], reverse=True) df = pd.DataFrame(data) df = df[columns] display(df) # In[30]: print 'Overview of communities' pprint_partition_overview(partition_mention, mention_degree_gcc,userCited) # This table is built in the following way: # # + Each row corresponds to a community found by Louvain algorithm. In you can notice several characteristics such as partition size, the most popular user (in terms of degree), number of followers of this user. # # + You can also see the most cited domain for each of this communities including the percentage of users in the community that cited the domain. # # + Here by most cited website we mean that cited by most of the users. For example in all the citations of the community, if one person cited a site 50 times, we count it as 1. # # **Example**: In the first row, community number 5, 83% of the 60 users cited the domain 'www.pscp.tv' (periscope). Furthermore the most popular user in this community is Kerry Smith from Cincinnati,OH with 15K followers. # # # Some interesting things are worth noting: # # + Some communities related to a certain source of information(domain) in a very strong way. Notice how on the first two rows, at least 75% of users cited a certain source. # # + When we analyze alternative information sources such as 'intellihub.com' and 'theantimedia.org', we can observe that the citation percentage of their communities quite highly citing these cites (intellihub: 40%, theantimedia: 75%). # # + Also you can see that one of the most common sites is 'youtube.com'. This is expected since people share video news about the event through this site. # # + Another interesting thing we are seeing some of the communities in fact correspond to geographical communites. For example the community 23 is all about the Las Vegas area and its local media (Same situation with Boston community 21). # # A problem we see here is that some communities have youtube as most cited website. This does not tell much about that community. Let's see what are first 3 most cited website in each community. # In[31]: # Get the most commonly cited website def getMostCommonWebsite3(partition, userCited): inv = inverse_partition(partition) out = dict() for com in inv: l = [] for user in inv[com]: l+= list(set(userCited[user]))#Add all the citations only ONCE to a list in this community c = Counter(l) mostFreqSites = sorted(set(l), key=l.count,reverse = True)[:3] percentage = dict() for i in mostFreqSites: percentage[i] = round(float(c[i]) / len(inv[com]) * 100) while len(mostFreqSites)<3: mostFreqSites.append('NaN') percentage['NaN'] = 0 out[com] = (mostFreqSites, percentage) return out def showIt(partition): columns = ['Community No','Community Size', 'M1', 'P1','M2','P2','M3','P3'] inv = inverse_partition(partition) data = [] top3_websites = getMostCommonWebsite3(partition,userCited) for part_id in inv: mostFreqSites = top3_websites[part_id][0] percentage = top3_websites[part_id][1] row = { 'Community No':part_id, 'Community Size': len(inv[part_id]), 'M1': mostFreqSites[0], 'P1': percentage[mostFreqSites[0]], 'M2': mostFreqSites[1], 'P2': percentage[mostFreqSites[1]], 'M3': mostFreqSites[2], 'P3': percentage[mostFreqSites[2]] } data.append(row) data.sort(key=lambda row: row['P1'], reverse=True) df = pd.DataFrame(data) df = df[columns] display(df) showIt(partition_mention) # Here the percentages are percentage of the users who cited the website at least ones. So the for the community 40, (second line), 75.0% of the users cited theantimedia.org, 50% cited youtube.com. # When we think both table above we see some interesting results: # # # + Especially when people are spreading rumors, they tend to share videos. For example, for this event people generally claim that there was multiple shooter and they share videos related to this claim. We can see this in lines 1 and 3 (community no 40 and 6). Most of the users who shared news from alternative media are sharing videos from youtube. # # + There might be a relationship between 'www.intellihub.com', 'yournewswire.com'. (Actually we saw that both are alternative media) # # + In the line forteen we can see that there is a relation between sites, www.reviewjournal.com and lvrj.com. This is actually true since both of them are local media of Las Vegas. # An interesting way to analyze communities is to find which hashtags they tend to use. For example if a community tends to use '#guncontrol' hashtag we can infer that there is a relation with this community and the gun laws. # In[32]: import matplotlib.style import matplotlib as mpl mpl.style.use('classic') def partition_hashtag_analysis(partition): # inverse the partitioning to get a dict with { partitioning_id : [usernames]} components_inv = inverse_partition(partition) # get all hashtags used by users in combination with our query_hashtags components_hashtags = defaultdict(list) for part_id, usernames in components_inv.iteritems(): tweets = tweet_collection.find({ 'username': { '$in': usernames }, 'hashtags': { '$ne': [], '$nin': map(lambda s: '#' + s, query_hashtags) }, }, { 'hashtags': 1 }) # filter query hashtags hashtags for row in tweets: tags = [tag.replace('#', '') for tag in row['hashtags'] if tag not in query_hashtags] components_hashtags[part_id] += tags part_tag_counts = {} for part_id, tags in components_hashtags.iteritems(): counts = Counter(tags) part_tag_counts[part_id] = counts return part_tag_counts mention_com_hashtags = partition_hashtag_analysis(partition_mention) # Heatmap # get most common hashtags in general number_of_tags = 20 hashtags_count = Counter([tags for part_id in mention_com_hashtags.itervalues() for tags in part_id]) most_common_tags = map(lambda tup: str(tup[0]), hashtags_count.most_common(number_of_tags)) # create a matrix of the count of use of the most commn hashatgs in the communities heat_array = np.array([[counts[tag] for tag in most_common_tags] for counts in mention_com_hashtags.values()]) # plot heatmap fig = plt.figure(figsize=(10, 10)) plt.imshow(heat_array, interpolation='nearest') plt.xticks(range(number_of_tags), most_common_tags, rotation='vertical') plt.yticks(range(heat_array.shape[0]), mention_com_hashtags.keys()) # ax.set_facecolor('white') plt.colorbar() plt.clim(100,300) plt.show() # The *heatmap* above displays how often the most common hashtags appear in every community. The brighter the color, the more prevalent that hashtag is in the tweets from that community and by extension from the users. This gives us an idea about the ideas or opinions of these users. The data is a little sparse for most communities since they are small and there are not that many hashtags, but we can see some interesting patterns emerge in the ones that do have data. # # + The most interesting fact we see is that there is a strong correlation betweent the community 1 and the community 3. Actually those communities are the alternative media communities (AntiMedia and IntelliHub communities) # # + Only communities 0,1,3 are using falseflag. This shows that alternative media tend to spread the news that the event was fake. # # # # ## 2.2 Hashtag Graph # # Another interesting graph is hashtag graph. For this second network, *nodes* are still the users. There is an edge between two node if they had ever used same hashtag for more than a limit. Here we have chosen this limit to be 10. Thus if two users have more than 10 tweets under a hashtag there will be an edge between them. # Let's create the graph. # In[33]: usernames = tweet_collection.distinct('username') # create two separate graphs for mention relations and hashtags, one simple, one multi hashtag_graph = nx.MultiGraph() # add nodes from users that wrote tweets hashtag_graph.add_nodes_from(usernames) print 'Number of nodes in hashtag_graph', len(hashtag_graph.nodes()) # add edges to the hashtag_graph # get all tweets with hashtags tweet_hashtags = tweet_collection.find({'entities.hashtags': {'$ne': []}}, {'username': 1, 'hashtags': 1}) # Let's have collection showing which user used which hashtag (of course we are eliminating the hashtag we queried to get the data) # In[34]: # intialize a defaultdict to track the unique hashtags # and how often users are using them hashtags_dict = defaultdict(lambda: defaultdict(int)) # populate the dict {hashtags: set(usernames)} for tweet in tweet_hashtags: username = tweet['username'] # list of hashtags hashtags = map(lambda tag: tag.replace('#', '').lower(), tweet['hashtags']) # remove the query_hashtags new_tags = list(set(set(hashtags) - set(query_hashtags))) if len(new_tags) > 0: for tag in new_tags: if tag: hashtags_dict[tag][username] += 1 # Get the edges. If two user have more than 10 tweets under same hash than there is a link between them # In[35]: i = 0 # add edges between all users with same hashtag if they have used it more than once for tag, userdict in hashtags_dict.iteritems(): # find users who used the tag more than once usernames = [username for username, count in userdict.iteritems() if count >= 10] # create tuples of possible combinations of nodes sets = combinations(usernames, 2) # add edges for combi in sets: hashtag_graph.add_edge(*combi, atrr=tag) print 'Number of edges in hashtag_graph', len(hashtag_graph.edges()) # ### 2.1.1 Basic stats on Hashtag Graph # ### Degree distribution # # Like we did before let's see some statistics about this graph. # In[36]: plt.style.use('fivethirtyeight') # get degree distributions hashtag_degree = dict(nx.degree(hashtag_graph)) # get minumum and maximum degrees min_hashtag_degree, max_hashtag_degree = min(hashtag_degree.values()), max(hashtag_degree.values()) # plot the degree distributions plt.figure() plt.subplot(211) plt.title('Hahstag graph degree distribution') plt.xlabel('Degree') plt.ylabel('Number of nodes') plt.yscale('log', nonposy='clip') _= plt.hist(sorted(hashtag_degree.values(),reverse=True), range(min_hashtag_degree, max_hashtag_degree + 1)) # degree sequence # This is rather an interesting distribution. Apparently nodes either do not have any edge or they have a lot. We can thus think that there are some arount 2000 accounts which tweets a lot on similar topics. Let's print nodes which has maximum and minimum degrees # In[37]: print sorted(hashtag_degree.items(),key = lambda x: x[1], reverse = True)[0:5] print sorted(hashtag_degree.items(),key = lambda x: x[1], reverse = False)[0:5] # So as seen here, most of the nodes do not have any edges, thus here also we should use GCC. # ### The GCC # In[38]: # get all the separate components components_hashtag = sorted(nx.connected_component_subgraphs(hashtag_graph), key=len, reverse=True) # components_hashtag = pickle.load(open('comhtagponenetsHas','rb')) print 'The hashtag graph has {0} disconnected components'.format(len(components_hashtag)) # Let's see the biggest subgraphs. # In[39]: hashtag_component_lengths = [len(c) for c in components_hashtag] hashtag_component_lengths = sorted(hashtag_component_lengths, reverse = True)[:5] print hashtag_component_lengths # So apparently there is a big graph which includes 1697 and other nodes are mostly floating. It is best to work on this graph. # In contrast to mention graph, this graph includes too many edges. Thus printing this graph is not very explanatory. So we directly go for community detection. (If we print it you will only see some black fur) # ### Community detection # In[40]: # get the giant connected component hashtag_gcc = components_hashtag[0] hashtag_degree_gcc = nx.degree(hashtag_gcc) # number of nodes and edges print 'The GCC of the hashtag graph has {nodes} nodes and {edges} edges'.format(nodes=len(hashtag_gcc.nodes()), edges=len(hashtag_gcc.edges())) # In[41]: partition_hashtag = community.best_partition(hashtag_gcc) # So let's make the similar analyze that we did for the mention graph. # In[42]: print 'The hashtag graph partitons with an overview of the accounts with the highest degrees' pprint_partition_overview(partition_hashtag, hashtag_degree_gcc,userCited) # So only 3 community exists in this graph. So to further analyze this graph let's see 3 most cited web pages here # In[43]: showIt(partition_hashtag) # Apparently this graph gives less insight to analyze our data. # # ## Source of Information Graph # The last graph that we think is interesting, shows the relation between sources of information. Here, the nodes are websites. There will be an edge between two nodes if the same user cited from both sites. Previously we had already collected data for this part. So let's remember some numbers and built the graph. # Let's create the graph # In[45]: print 'There are {} citation in the total {} tweet'.format(len(allCitedWebsite), len(allTweets)) print 'There are {} different source of information that users cite'.format(len(set(allCitedWebsite))) # In[47]: infoSource_graph = nx.Graph() # add nodes infoSource_graph.add_nodes_from(set(allCitedWebsite)) for user in userCited: citedByUser = list(set(userCited[user])) sets = combinations(citedByUser, 2) # add edges for combi in sets: infoSource_graph.add_edge(*combi) print 'The information sources graph has {} nodes'.format(len(infoSource_graph.nodes())) print 'The information sources graph has {} edges'.format(len(infoSource_graph.edges())) # Let's draw the graph # In[51]: info_degree = dict(infoSource_graph.degree()) # draw the graphs nx.draw_networkx(infoSource_graph, pos=nx.spring_layout(infoSource_graph), node_size=[v * 5 for v in info_degree.values()], with_labels=False) plt.title('Source of Information Graph') plt.show() # Again let's check the size of the GCC in case it is meaningful we should only use it. # In[52]: components_info = sorted(nx.connected_component_subgraphs(infoSource_graph), key=len, reverse=True) # In[53]: info_component_lengths = [len(c) for c in components_info] info_component_lengths = sorted(info_component_lengths, reverse = True)[:5] print info_component_lengths # Actually we probably should use the gcc because it is much bigger than the second subgraph. # In[54]: info_gcc = components_info[0] info_gcc_degree = dict(info_gcc.degree()) # draw the graphs nx.draw_networkx(info_gcc, pos=nx.spring_layout(info_gcc), node_size=[v * 6 for v in info_gcc_degree.values()], with_labels=False) plt.title('Sources of Information Graph GCC') plt.show() # So let's see some sites which are common citations. For this let's print nodes of high degree # In[59]: data = [] columns = ['Website','Degree'] for w,n in sorted(info_gcc_degree.items(), key = lambda x:x[1], reverse = True)[:20]: row = {'Website':w, 'Degree':n} data.append(row) df = pd.DataFrame(data) df = df[columns] display(df) # Here there are some interesting insights. # # + Apparently people like to share videos about the event. Also they like to share comments on facebook. # + There are many mainstream information sources. People share a lot of article from Fox News, CNN, DailyMail. This does however show how they reference it. When we have analzed data we have seen some examples citing an article from mainstream media to oppose it. # + Also alternative media is common. People share lots of articles from 'intellihub.com','zerohedge.com', 'infowars.com' # + Lastly gofundme.com is a charity site for raising money to help victims # Let's do community analysis here # In[61]: partition_info = community.best_partition(info_gcc) inverse_partition_info = inverse_partition(partition_info) print 'There are {} communities'.format(len(inverse_partition_info)) # Let's analyze those communities a bit further like we always do. # In[47]: data = [] columns = ['Partition', 'Partition Size', 'Node with biggest degree', 'Degree', 'FirstFiveBiggestSites'] for part_id, websitesInCom in inverse_partition_info.iteritems(): degrees = [(w,info_gcc_degree[w]) for w in websitesInCom] maxDegreeSite = max(degrees, key=lambda x:x[1]) firstFive = sorted(degrees, key = lambda x:x[1],reverse = True)[0:4] row = { 'Partition': part_id, 'Partition Size': len(websitesInCom), 'Node with biggest degree':maxDegreeSite[0], 'Degree': maxDegreeSite[1], 'FirstFiveBiggestSites':firstFive } data.append(row) # data.sort(key=lambda row: row['Percentage of Users who cited'], reverse=True) df = pd.DataFrame(data) df = df[columns] display(df) # This table clearly gives incredible insight about which website is similar to which. Similar sites are generally grouped with other similars. # # + We can directly see that in community 13, there are Spanish websites. # + In community 10, there are websites about first aid. # + Community 6 and 7 are alternative media. # + Community 2 is mainstream media # + Community 1 is general visual media (vimeo, youtube) # # Finally we can safely say that thanks to the analysis we performed here, we can categorize media. Thanks to this analysis it is possible to detect alternative media which is used to spread fake news. # # # PART 3: Sentiment Analysis # The sentiment analysis part of the report will focus on two different visualization techniques. On the first part, with the use of sentiment analysis techniques thaught in class, the tweet sentiment will be plotted over time. # # On the second part of the analysis, the deepmoji library will be used to visualize the sentiments, this will be a (hopefully) fun way of understanding what is happening over time. It's also interesting to understand if the results of these two analysis differ in any way. # ## 3.1. The sentiment changes over time # ### 3.1.1. Preparing # The first function that needs to be built is one that cleans a raw tweet. Tweets contain a lot of elements that, even though interesting, are not in the scope of a sentiment analysis. # # **Step 1:** A function that cleans a tweet. # In[62]: def clean_this(raw_tweet): text = raw_tweet # extract text text = text.split('http', 1)[0] # remove links text = text.split('pic.', 1)[0] # remove pictures text = text.lower() # lower case text = re.sub(r'(\s)@\w+', r'\1', text) # remove mentions text = re.sub(r'(\B)#\w+', r'\1', text) # remove hashtags text = nltk.word_tokenize(text) # tokenize text text = [token.lower() for token in text if token.isalpha()] # removes punctuation and numbers text = [word for word in text if word not in stopwords.words('english')] # remove stopwords text = list(set(text)) # only return unique tokens return text # After cleaning a tweet, we will need to calculate it hapiness, for that, we will need a data file from MIT, that was shown during the course, named `Data_set_S1.txt`. But for now, let's admit this data comes in a `happy_data` matrix. # # **Step 2:** A function that calculates the hapiness of a tweet. # In[63]: def how_happy(tokens): happinness_counter = 0.0 happy_word_counter = 0 for word in tokens: if word in happy_words: happy_word_counter += 1 happinness_counter += happy_data[np.where(happy_words == word)[0][0], 2] if happy_word_counter == 0: return happinness_counter else: return happinness_counter/happy_word_counter # After this, we need to import all of our libraries, create a connection to our `mongoDB` database, extract our text file mentioned in **step 2**, as well as some other boring stuff. # # **Step 3:** Importing libraries # In[66]: from pymongo import MongoClient import pymongo from nltk.corpus import stopwords import nltk import re import pandas as pd import numpy as np import matplotlib.pyplot as plt import datetime from pprint import pprint import matplotlib.style as style import emoji import seaborn from collections import Counter get_ipython().run_line_magic('pylab', 'inline') style.use('fivethirtyeight') # In[67]: # Getting the sentiment analysis file and putting it on a matrix path = 'ressources/Data_Set_S1.txt' header = ['words', 'hap.rank', 'hap.avg', 'hap.std', 'tw.rank', 'goog.rank', 'nyt_rank', 'lyr_rank'] happy_data = pd.read_table(filepath_or_buffer=path, header=2).as_matrix() happy_words = happy_data[:, 0] # Boring database stuff, including fields to return. display_conditions = {"query_criteria": 0, "_id": 0, "geo": 0, "mentions": 0, "hashtags": 0, "favorites": 0, "permalink": 0, "username": 0, "id": 0} # ### 3.1.2. Extracting the sentiment # Before analysing, let's explain some things. How can tweets be plotted over time ? # # Of course we could plot every tweet, but this would cause two things: A very weird plot, and one that is hard to understand. Our approach was to sort our database in a chronological way. Also, we decided to **plot the average sentiment of all of the tweets for every hour**. # # Finally, it's interesting to understand if the general sentiment is at all related with the most popular tweets, for this, we establish a threshold (only get tweets with more than X retweets) and we do the same procedure. # # By doing this, the plot becomes both easier to understand and conceptualise. # # **Step 1: **Importing the tweets in a chronological fashion. # In[68]: db_my_tweets = sorted(list(tweet_collection.find({}, display_conditions)), key = lambda x: x['date']) # Last group our tweets by hour and day # In[70]: import itertools tweetsByDay = [list(g) for k, g in itertools.groupby(db_my_tweets, key=lambda t: t['date'].date())] tweetsByHourByDay = [[list(g) for k, g in itertools.groupby(day, key=lambda t: t['date'].hour)] for day in tweetsByDay] # Hopefully here the output will have a sorted list of lists where each innner list would contain tweets from the same hour. Let's check this by printing 5 tweets from first day first hour and second day first hour. # In[71]: for t in tweetsByHourByDay[0][0][:5]: print t['date'] print '--------------------------' for t in tweetsByHourByDay[1][0][:5]: print t['date'] # It seems to be working (note that the earliest tweet we obtained is from 2017-10-02 08:40:22). # **Step 2: **Preparing the processing. # In[72]: sentiment = [] # here the sentiment is stored pop_sentiment = []#Here sentiment of popular tweet is stored periods = [] # here the period is stored, example 11 of november at 23:00 will be stored as '11/11 at 23' happinessInHour = 0.0 #here total happiness in the hour is stored happinessInHourPopular = 0.0 retweet_threshold = 100 pop_tweet_counter = 0 text = '' # In[74]: # sentiment = pickle.load(open('sentiment','rb')) # periods = pickle.load(open('periods','rb')) # pop_sentiment = pickle.load(open('pop_sentiment','rb')) # text = pickle.load(open('allText','rb')) tweet_counter = 0 index = 0 for day in tweetsByHourByDay: for hidx, hour in enumerate(day): for tweet in hour: clean_text = clean_this(tweet['text']) happinessInHour += how_happy(clean_text) text += ' '.join(word for word in clean_text) if tweet['retweets'] >= retweet_threshold: # if the tweet is 'popular' pop_tweet_counter += 1 happinessInHourPopular += how_happy(clean_text) index+=1 print 'Processing tweet {} / {} \r'.format(index, len(db_my_tweets)), averageHourHappiness = happinessInHour / len(hour) sentiment.append(averageHourHappiness) if pop_tweet_counter>0: averageHourPopularHappiness = happinessInHourPopular / pop_tweet_counter pop_sentiment.append(averageHourPopularHappiness) else: pop_sentiment.append(float('nan')) periods.append('{}/{} at {}'.format(tweet['date'].day, tweet['date'].month, tweet['date'].hour)) #Reset values happinessInHour = 0.0 happinessInHourPopular = 0.0 pop_tweet_counter = 0 print 'We got a total of {} sentiment windows.'.format(len(sentiment)) # Now that we have the sentiment vector, we can plot the sentiment over time. Note that the periods and some axis labels dissapeared, this is deliberate in order to increase readibility. # # **Step 3: **Plotting the sentiment of all of the tweets and the sentiment of the popular tweets. # In[84]: # allHours = [hour for hour in day for day in tweetsByHourByDay] x = np.arange(len(sentiment)) style.use('ggplot') # defining titles and axis names plt.title('Sentiment Timeline', fontsize=20) plt.xlabel('Hours after event', fontsize=17) plt.ylabel('Normalized Sentiment Index', fontsize=17) # some styling plt.tick_params(axis='both', which='major', labelsize=12) plt.axhline(y=0, color='black', linewidth=1.3, alpha=.7) # and finally, we plot. plt.plot(x, sentiment, linewidth=2, label= 'General Sentiment', color='#50514f') plt.scatter(x, pop_sentiment, linewidth=2, label= 'Popular Tweet Sentiment', color='#f25f5c') plt.legend(loc=1, prop={'size': 15}) plt.xlim(min(x),max(x)) pylab.rcParams['figure.figsize'] = (30, 10) plt.show() # A couple of things are worth noting: # # + In the first hours after the shooting event there is a large amount of "popular" tweets about the event. This is expected due to the topic's popularity. Moreover, the amount decreases significantly over time. # # + Surprisingly, the variance in the happiness of the general tweets increases over time. This can be due to the fact that the number of tweets in a given hour is much less when the topic is not as hot. # # + Popular tweets have on average a higher normalized sentiment index. This can be due to a number of reasons, one of them might be that popular users are usually more "expressive". # # + In the popular tweets we see some points on the 0 axis. When these tweets are observed, we see that those tweets are the ones which have shared a link or an image. # Let's see the popular tweet with the greatest normalized index. # In[85]: allHours = []#All the hours serialized for day in tweetsByHourByDay: for hour in day: allHours.append(hour) pop_sentiment_nanreplace = [i if str(i) != 'nan' else -1 for i in pop_sentiment ]#replace the nans with -1 to get max ind = pop_sentiment.index(max(pop_sentiment_nanreplace))#get the index of the hour which has maximum sentiment index for tweet in allHours[ind]:#Look for all the popular tweets in that hour if tweet['retweets']>=retweet_threshold: print tweet['text'] # As can be seen by this example, the sentiment analysis tool used in this part, even though correct, is limited. Here since we are not analyzing hashtags and usernames for sentiment analysis, in this particular case only the word 'Cartoon' was seen by the analyzer which gives a high score. However, there are a lot of popular tweets thus overall insight are correct # # ## 3.2. WordCloud # The main idea of this part of the Sentiment Analysis is to have a visual representation of the most used words througout our database. # # To accomplish this task, we will use the very handy **WordCloud** library. # ### 3.2.1. Preparing # Let's start by importing some much needed libraries # # **Step 1: **Importing Libraries # In[86]: from wordcloud import WordCloud, STOPWORDS from PIL import Image import urllib, cStringIO # **Step 2: **Getting all the text from the database # The idea in this part is too put all of the tweets in a long string called `text`. But while we do this, we also have to clean these tweets. # # This string was built in step 2 of part 3.1.2. # ### 3.2.2. Creating the wordcloud # We start by selecting a nice image for the wordcloud countour, in this case an image of Texas. That can be found in the link below. # # **Step 1: ** Get a nice image # In[87]: image_path = cStringIO.StringIO(urllib.urlopen('https://i.imgur.com/kftApxC.png').read()) texas_mask = np.array(Image.open(image_path)) # # **Step 2: **Avoiding obvious words # In[88]: stopwords = set(STOPWORDS) # Now that we have all of the elements to plot it, let's finally do it. # # **Step 3: **Plotting everything nicely. # In[89]: # defining the wordcloud with stopwords and some edgy styling choices. word_cloud = WordCloud(mask=texas_mask, width=800, height=400,background_color="white", collocations=False,colormap='inferno', stopwords=stopwords, normalize_plurals=False).generate(text) # plot it. plt.imshow(word_cloud, interpolation="bilinear") plt.axis("off") plt.show() # A couple of interesting points worth mentionning: # # + Most of the words were expected: 'Vegas', 'Guns', 'Victims' and 'Shooter'. # # + According to these key words we can estimate some of the common content of tweets. For example, the word 'prayer' shows a sentiment. # # + There are also some **less expected** note, some words were not expected, words such as "trump", "white" seem to have a high political connotation. # # ## 3.3. Emoji Analysis # ### 3.3.1. Preparing # The final part of the sentiment analysis is all about emoji. We started by using a very simple version of DeepMoji where an "emoji" score was given to each tweet. In our database, each tweet now posesses the field "deepmoji" where we find the 5 most likely emoji that characterize that tweet and also the "reliability" of each one of them. # # **Step 1: **Creating a dictionnary of emojis from a txt file. # In[90]: emoji_index = {} with open('ressources/emoji.txt') as f: counter = 0 for line in f: # for every line contents = [x.strip() for x in line.split(',')] # split line into 2 emoji_index[counter] = contents # contents = [name of emoji, url to emoji photo] counter += 1 # **Step 2:** A simple example # In[91]: # define what we will not need from Mongo display_conditions = {"query_criteria": 0, "_id": 0, "geo": 0, "mentions": 0, "hashtags": 0, "favorites": 0, "permalink": 0, "username": 0, "retweets": 0, "id": 0} tweets = tweet_collection.find({'deepmoji': {'$exists': True}},display_conditions)[43:45] # for 2 tweets, extract the deepmoji field for t in tweets: emoji_list = [t['deepmoji']['Emoji_1'], t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'], t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']] print 'Tweet: ', t['text'] print 'Emojis:', for emoji_number in emoji_list: print emoji.emojize(emoji_index[emoji_number][0], use_aliases=True), print '\n' # We can note that DeepMoji makes a pretty accurate characterization of the sentences, not perfect, but accurate enough. # ### 3.3.2. The most frequent emoji in the whole dataset. # The first idea for the emoji/sentiment analysis will be to visualize what are the emojis that are the most used in the whole dataset. # # **Step 1: **Count the occurences of every emoji # In[92]: # get emojis in a list db_my_tweets = sorted(list(tweet_collection.find({}, display_conditions)), key = lambda x: x['date']) mega_list = [] for t in db_my_tweets: emoji_list = [t['deepmoji']['Emoji_1'], t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'], t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']] mega_list += emoji_list # get a counter of that list counter_ = Counter(mega_list) labels, values = zip(*counter_.items()) indexes = np.arange(len(labels)) # **Step 2: **A histogram of all of the emojis # In[94]: plt.barh(labels, values, color=['#50514f', '#f25f5c', '#ffe066', '#247ba0']) plt.yticks(range(len(labels)), [emoji_index[i][0][1:-1] for i in range(len(labels))], fontsize=14) plt.xlabel('Emoji Frequency') plt.ylabel('Emoji Name') pylab.rcParams['figure.figsize'] = (20, 15) plt.title('Most Used Emojis in DataSet') plt.show() # **Step 3: ** A simpler way to visualize. # In[82]: # print the top 20 emojis and their frequency top = 20 top_list = counter_.most_common(top) print 'The top {} sentiments according to deepmoji:'.format(top) for i in range(len(top_list)): item_emoji = top_list[i][0] item_frequency = top_list[i][1] print i+1,'.',emoji.emojize(emoji_index[item_emoji][0], use_aliases=True), 'with', item_frequency, 'characterizations.' # Above you can see the emojis which describes the tweets the most. Some interesting things to note: # # + As expected **heart** is the most important because people are generally tweeting their condolences # + 3rd most frequent is crying face, again this shows the feeling about the event. # + 8th place is for gun. This emoji occurs in case the tweet is gun related # # Overall this list gives an idea about the sentiment of the event # # ### 3.3.3. The sentiment over time characterized by emoji. # The goal of this part of the analyis is to see how the emoji characterization evolves over time. For example, does the characterization of a tweet by 'gun' change over time? If yes, how? # **Step 1:** Call the tweets that we need. # In[95]: db_my_tweets = sorted(list(tweet_collection.find({}, display_conditions)), key = lambda x: x['date']) # The next step is kind of rough, basically we want to store a matrix, called `emoji_grid`, where as rows the various possible emoji(64) are stored, and in the columns a period of time is stored. Therefore, the element in position [i,j] of the `emoji_grid` will be equivalent to the normalized number of characterizations on period `j` by emoji `i`. # # In this case, we will look at the characterizations every 4 hours, since the shooting. And see how the classification evolves. # # **Step 2:** A matrix that stores the emoji freqeuncy per time period # In[96]: # start the matrix emoji_grid = np.zeros((len(emoji_index.values()), 1)) column = np.zeros((len(emoji_index.values()), 1)) # define important variables before loop absolute_hour = 0 tweets_in_period = 0 periods = [] hours_passed = 0 period_length = 12 # for every tweet for t in db_my_tweets: tweets_in_period += 1 # tweet counter for period of time tweet_hour = t['date'].hour tweet_day = t['date'].day tweet_month = t['date'].month # exctract the deepmoji classification tweet_emoji_list = [t['deepmoji']['Emoji_1'], t['deepmoji']['Emoji_2'], t['deepmoji']['Emoji_3'], t['deepmoji']['Emoji_4'], t['deepmoji']['Emoji_5']] for emoji_number in tweet_emoji_list: column[emoji_number, 0] += 1 # this counter counts the hours that have passed if tweet_hour != absolute_hour: hours_passed += 1 absolute_hour = tweet_hour # if X number of hours passed, append those depmoji classifications to the master emoji_grid if hours_passed == period_length: periods.append('{}/{} at {}'.format(tweet_day,tweet_month, tweet_hour)) emoji_grid = np.hstack((emoji_grid, column / tweets_in_period)) # here we normalize column = np.zeros((len(emoji_index.values()), 1)) hours_passed = 0 tweets_in_period = 0 emoji_grid = np.delete(emoji_grid, 0, 1) # deletes the redundant first column. # **Step 3: **Plotting the emoji grid. # In[102]: # define important variables. plot_top = 5 # only the most frequent emojis are plotted for simplicity counter = 0 colors = ['#50514f', '#f25f5c', '#ffe066', '#247ba0', '#70c1b3', '#50514f', '#f25f5c', '#ffe066', '#247ba0', '#70c1b3'] # for every emoji (row of the emoji_grid) that is on the top X, plot it over the periods. for i in range(emoji_grid.shape[0]): if i in np.argsort(np.sum(emoji_grid, axis=1))[::-1][:plot_top]: s = plt.plot(range(emoji_grid.shape[1]), emoji_grid[i, :], label=emoji_index[i][0][1:-1], linewidth=2, color=colors[counter]) counter += 1 # define titles and axis names plt.title('Deepmoji Characteristation Every {} Hours'.format(period_length), fontsize=15) plt.xlabel('Time', fontsize=12) plt.ylabel('Normalized Sentiment Frequency', fontsize=12) # some styling and sizing plt.tick_params(axis='both', which='major', labelsize=12) plt.axhline(y=0, color='black', linewidth=1.3, alpha=.7) plt.xticks(range(emoji_grid.shape[1]), periods, rotation='vertical') plt.legend() plt.grid() plt.show() # This figure describes the tweets with emojis over time using the DeepMoji model. Severall things are interesting and worth descibing in this graph, let's mention some of them: # # + At the beginning of the event, there is a great rise in tweets characterized with the **heart** emoji. At this point, the use of **broken heart** goes down dramatically. # + Also interesting is the fact that over almost all 12-hour periods, the **hearts** emoji is the most frequent in characterizing tweets. # # # # Conclusion # In this project, we have analysed the event Las Vegas shooting. By working on different graphs, we have shown that it is possible to categorize media. By analyzing different communities, we have shown # In[ ]: