by Jude Pineda
First, let's import the pandas.
import json
import pandas as pd
# Enable inline plotting
%matplotlib inline
This function takes in a JSON input file, in our case stoonTweets.json
, and fills a pandas dataframe with it.
def pop_tweets(inputFile):
# Project proposal outlines these columns:
# text, author, timestamp, hashtags, retweet count, location (for geotagged tweets), and source
#Declare a new data frame with pandas, with some specific column names
tweets = pd.DataFrame(columns=[
'userHandle','text','timestamp','location','retweet count','source'
])
#Open the text file that contains the tweets we collected
tweets_file = open(inputFile, "r")
#Read the text file line by line
for line in tweets_file:
#Load the JSON information
tweet = json.loads(line)
#If the tweet isn't empty, add it to the data frame
if ('text' in tweet):
tweets.loc[len(tweets)]=[tweet['user']['screen_name'],tweet['text'],\
tweet['created_at'],tweet['place']['full_name'],tweet['retweet_count'],\
tweet['source']
]
return tweets
# Populate the pandas dataframe with our JSON file
yxe_tweets = pop_tweets('stoonTweets.json')
Let's do some cleaning up on the dataframe. We remove the <a href
tags from the source column, and some extraneous information in the timestamp column. Luckily, this is easy to do using pandas:
# Not really. Here's a list of all the regular expressions I tried to properly strip the input.
# yxe_tweets['source'] = yxe_tweets.source.str.replace('\<a href="?"\>,?' , '')
# FULL REGEX: (\<a href\=.+\>)(.+)(\<\/a\>)
# df.sport.str.replace(r'(^.*ball.*$)', 'ball sport')
# df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
# yxe_tweets['source'] = yxe_tweets['source'].str.lstrip('\<a href\=\".+\" rel=\".+\"\>').str.rstrip('\<\/a\>')
yxe_tweets['source'] = yxe_tweets.source.str.replace("\<a href\=\".+\"\s*rel\=\"nofollow\"\>", '')
yxe_tweets['source'] = yxe_tweets.source.str.replace("\<\/a\>", '')
# yxe_tweets['timestamp'] = yxe_tweets.timestamp.str.replace("\+0000 2017", '')
Now let's see our dataframe content.
pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_rows', 50)
yxe_tweets
userHandle | text | timestamp | location | retweet count | source | |
---|---|---|---|---|---|---|
timestamp | ||||||
2017-12-04 03:13:31-06:00 | TheCandyShow | Gord Downie Was Celebrated For Championing Indigenous Rights. Now That He's Gone, Do People Still Care?… https://t.co/eITxoT6skw | 2017-12-04 21:13:31 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 03:11:45-06:00 | LuxquisiteStyle | Sunday Dec 10th 12-4pm! Our last pop-up before Christmas! 🎄 Check out the link for details. https://t.co/07HHHTBGsa https://t.co/gSvHJadsxT | 2017-12-04 21:11:45 | Luxquisite Clothing | 0 | Twitter for iPhone |
2017-12-04 03:08:20-06:00 | notsogoodal | Met a person who would not stop hitting me in the arm with the back of her hand during our entire conversation. Why… https://t.co/1woJhOOkwW | 2017-12-04 21:08:20 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 03:08:06-06:00 | snowded | @arifbobat @Mark_Nilsen_ @thisagileguy @tobiasmayer @alshalloway @RonJeffries Human systems tend to order - it may… https://t.co/6JmX7aH2kr | 2017-12-04 21:08:06 | Saskatoon, Saskatchewan | 0 | Tweetbot for Mac |
2017-12-04 03:08:00-06:00 | Skaboomatude | Drinking an Angus Stout by @9milelegacy @ The Rook & Raven — https://t.co/6c9gWxexPj | 2017-12-04 21:08:00 | Saskatoon, Saskatchewan | 0 | Untappd |
2017-12-04 03:04:57-06:00 | TheCandyShow | Look at what @aircanada hooked me up with for my birthday!! #skyqueen #birthdaygirl @ Sheraton… https://t.co/U92W59yRGM | 2017-12-04 21:04:57 | Saskatoon, Saskatchewan | 0 | |
2017-12-04 03:01:11-06:00 | Stoon_Slar | @BarristerSecret Kinda like the quivalent of, “The dog wrote my homework.” | 2017-12-04 21:01:11 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 03:00:01-06:00 | WASHDUDE | True. Just sayin if you get stuck. It’s hard to dig out with pot cap limit there https://t.co/noDY19TkD4 | 2017-12-04 21:00:01 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:53:11-06:00 | snowded | @Mark_Nilsen_ @arifbobat @thisagileguy @tobiasmayer @alshalloway @RonJeffries Yep - made it more explicit with the… https://t.co/iCk4fB1uUu | 2017-12-04 20:53:11 | Saskatoon, Saskatchewan | 0 | Tweetbot for Mac |
2017-12-04 02:50:31-06:00 | snowded | @alshalloway @Mark_Nilsen_ @arifbobat @thisagileguy @tobiasmayer @RonJeffries And your flex site is good old cybern… https://t.co/PLJERgKk3B | 2017-12-04 20:50:31 | Saskatoon, Saskatchewan | 0 | Tweetbot for Mac |
2017-12-04 02:49:15-06:00 | DarrenUlmerCHS | @SPSTraffic if I get a ticket for failing to stop at a red light from an officer, and a Red-light camera ticket for… https://t.co/cV7TTK9ilZ | 2017-12-04 20:49:15 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:49:13-06:00 | WASHDUDE | Not that I know of. Probably closest is great falls. You know Montana has pot caps on right ? https://t.co/ZRAgcAvj9i | 2017-12-04 20:49:13 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:47:38-06:00 | snowded | @alshalloway @Mark_Nilsen_ @arifbobat @thisagileguy @tobiasmayer @RonJeffries Total nonsense -if you increase const… https://t.co/xl8EFgbkQ2 | 2017-12-04 20:47:38 | Saskatoon, Saskatchewan | 0 | Tweetbot for Mac |
2017-12-04 02:44:18-06:00 | SaskRushLAX | .@TheWHL Teddy Bear Toss games are upon us! 🐻\n\nIn the spirit of the holiday season, we've got a gift for our #Rush… https://t.co/q6QE87THyF | 2017-12-04 20:44:18 | Saskatoon, Saskatchewan | 1 | Twitter Web Client |
2017-12-04 02:42:51-06:00 | sibbsniel | Perhaps I would understand the statements better if you explain the statements. https://t.co/rodwEyAvhs | 2017-12-04 20:42:51 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:27:19-06:00 | WASHDUDE | Lol probably https://t.co/xnqPsSUhE2 | 2017-12-04 20:27:19 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:26:22-06:00 | rjosegodoy | @LifeSite It is the devil’s plan to destroy humanity. First it was abortion, then the non-creative same sex marriag… https://t.co/rHTRiyVIk9 | 2017-12-04 20:26:22 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:23:25-06:00 | yew_jan | ❤️ ditto 🙌🏾 https://t.co/wlTxfbcbJJ | 2017-12-04 20:23:25 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:19:07-06:00 | graison | #GrowYXE https://t.co/jyMkTTUjmG | 2017-12-04 20:19:07 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:19:04-06:00 | sibbsniel | Do you guys in the MDC use cheap cars? Let’s start there? https://t.co/S1ScqX6NEa | 2017-12-04 20:19:04 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:14:30-06:00 | sibbsniel | Please whilst this is true. It’s partial in the sense that during the same time a lot of people were killed by some… https://t.co/jjGkm5PmMh | 2017-12-04 20:14:30 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:12:34-06:00 | rgchernick | @RealAlexD @Rgcoppens @Jenngurl84 @jonathanwhudson @DebStrickland65 @shuds16 @Tcoppens @VincentRule @StricklandJess… https://t.co/eVdMJ6n0eB | 2017-12-04 20:12:34 | Regina, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:09:24-06:00 | Thunderhowl | @PoMoAnachro It makes me a little angry, to be honest. Like, I’m a huge man...have you forgotten that I’m literally… https://t.co/Q5y7TVYXg5 | 2017-12-04 20:09:24 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-12-04 02:04:28-06:00 | sibbsniel | These are quite worrying stats. That factors contributed to these patterns https://t.co/kox9JpuliB | 2017-12-04 20:04:28 | Saskatoon, Saskatchewan | 1 | Twitter for iPhone |
2017-12-04 01:54:08-06:00 | cambird | For those who question why Trump pulled out of the Paris Climate Accord https://t.co/SuVowJQt6w | 2017-12-04 19:54:08 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
... | ... | ... | ... | ... | ... | ... |
2017-11-24 05:38:50-06:00 | lollipopdragon | The 1960s were a simpler time when a good looking Scotsman in short shorts could just grab a maid's key ring withou… https://t.co/gRnaT2WI6P | 2017-11-24 23:38:50 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 05:38:08-06:00 | chefOReilly | Burger Life....\n•\n•\n•\n•\n📷 simonworobec \n•\n•\n•\n•\n#plantbased #plantbasednutrition… https://t.co/N7NBnxthcB | 2017-11-24 23:38:08 | Saskatoon, Saskatchewan | 0 | |
2017-11-24 05:31:50-06:00 | toddintune | I like the bass. | 2017-11-24 23:31:50 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-11-24 05:30:08-06:00 | lollipopdragon | I can't believe they named the eye candy masseuse "Dink" #Goldfinger https://t.co/fDPJnFXrvD | 2017-11-24 23:30:08 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 05:25:24-06:00 | hjnaidu | Totally forgot it is stupid Black Friday today and made a haircut appointment at the mall. Parking is insane. | 2017-11-24 23:25:24 | Saskatoon, Saskatchewan | 0 | Tweetbot for iΟS |
2017-11-24 05:14:09-06:00 | lollipopdragon | This movie is mase by Albert R Broccoli and Harry Saltzman. No word on if Sgt. Pepper was involved. #Goldfinger https://t.co/nFTCifFAKi | 2017-11-24 23:14:09 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 05:13:29-06:00 | UnaSaskatoon | Take advantage of the nice weather and come in and try our feature buffalo chicken pizza!!😋👌🏼🍕 | 2017-11-24 23:13:29 | Saskatoon, Saskatchewan | 1 | Twitter for iPhone |
2017-11-24 05:09:38-06:00 | MattYoungCTV | WATCH! Were recapping the week in politics! Will be LIVE with @seanlesliectv at 5:20 @ctvsaskatoon #yxe | 2017-11-24 23:09:38 | Saskatoon, Saskatchewan | 1 | Twitter for iPhone |
2017-11-24 05:07:36-06:00 | lollipopdragon | James has ditched his wet suit, revealing a perfectly pressed tux, gotten to safety, found a bar, & made eye contac… https://t.co/PVyLGevqYo | 2017-11-24 23:07:36 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 05:06:17-06:00 | Livil | @ryanmeili It's good to see you are strong pro-choice | 2017-11-24 23:06:17 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 05:05:14-06:00 | PaulSeesequasis | ᓂᑕᐋᐧᐦᑖᐤ [nitawâhtâw] - s/he searches for something in an optimistic way | 2017-11-24 23:05:14 | Saskatoon, Saskatchewan | 28 | Twitter for iPhone |
2017-11-24 04:59:57-06:00 | lollipopdragon | Drums marked NITRO, a silly snake tube of plastic explosives, a clock attached to a 9v battery. What could possibly… https://t.co/vGi0OCz5zb | 2017-11-24 22:59:57 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 04:54:36-06:00 | YXEcarpenter | Now when people tell me they want a "wood tone" I have a visual aid to help them get more specific. https://t.co/dmZVIePFjX | 2017-11-24 22:54:36 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-11-24 04:53:05-06:00 | Whiteboy_Slim | The first solo and acoustic Whiteboy Slim show since 2012. At Prairie Ink Restaurant (McNally… https://t.co/JPsrBua6vU | 2017-11-24 22:53:05 | Saskatoon, Saskatchewan | 0 | |
2017-11-24 04:47:01-06:00 | ZedandBreakfast | Did Hellebuyck just headbutt the puck?? #nhljets | 2017-11-24 22:47:01 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 04:45:25-06:00 | lollipopdragon | OMG JAMES BOND IS WEARING A DUCK SNORKEL DISGUISE IN THE FIRST SCENE #007 #Goldfinger | 2017-11-24 22:45:25 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 04:44:29-06:00 | ingloriusbutter | @BobWeeksTSN Lavar Ball | 2017-11-24 22:44:29 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-11-24 04:36:34-06:00 | askmommafran | @fakeGAINER Good luck to you gopher! | 2017-11-24 22:36:34 | Saskatoon, Saskatchewan | 1 | Twitter for iPhone |
2017-11-24 04:34:49-06:00 | lollipopdragon | Starting my late shift off with a viewing of Goldfinger. #007 https://t.co/wE7ruC0Yhb | 2017-11-24 22:34:49 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 04:16:29-06:00 | AprilSora | Runnin’ back to Saskatoon! #BatOutOfHell #Itwasahotsummernight… https://t.co/Nmxkj4IZeA | 2017-11-24 22:16:29 | Toronto, Ontario | 0 | |
2017-11-24 04:12:11-06:00 | MiEnergy_ | Good article on some local #solar projects. Great job @skenvsociety and the solar co-op. https://t.co/tmnT86E3Vh | 2017-11-24 22:12:11 | Saskatoon, Saskatchewan | 3 | Twitter for iPhone |
2017-11-24 04:09:58-06:00 | CarmenBochek | @Shaw_CFL @Simoni_Lawrence @BSinopoli @GreyCupFestival @CFL Love it!! | 2017-11-24 22:09:58 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-11-24 04:09:47-06:00 | Kurxsuki | Welp, we passed out parking spot and when we made a u turn some one fucking took it. How is your Black Friday going? 😤 | 2017-11-24 22:09:47 | Saskatoon, Saskatchewan | 0 | Twitter for iPhone |
2017-11-24 04:08:10-06:00 | crazy_orca_lady | Said goodbye today to my wonderful dog of almost 14 years. The hardest decision I've ever had to make. He was surro… https://t.co/DghWFIXIx0 | 2017-11-24 22:08:10 | Saskatoon, Saskatchewan | 0 | Twitter for Android |
2017-11-24 04:07:12-06:00 | smpchev | Get a FREE $500 Gift Card to Midtown Plaza with any qualifying purchase! Offer is for this… https://t.co/TxiaJkuaNj | 2017-11-24 22:07:12 | Saskatoon, Saskatchewan | 1 |
2690 rows × 6 columns
Let's start off our analysis by finding the most retweeted tweet during the week. This is easily obtained by getting the tweet with the highest retweet_count
. Let's check the top 10 most retweeted tweets of the past week:
most_retweets = yxe_tweets.sort_values('retweet count', ascending=False)
most_retweets.head(10)[['userHandle', 'text', 'retweet count']]
userHandle | text | retweet count | |
---|---|---|---|
2075 | sibbsniel | The Police Commissioner Chihuri & his top brace must GO for all the corrupt ills& brutality of ZRP. If you agree wi… https://t.co/8k09dHHwtY | 268 |
1557 | Bruiser_17 | CONTEST TIME! 👏🏼\n2 SEASON TICKETS FOR @SaskRushLAX HOME OPENER ON SAT. DEC. 23RD! \nALL YOU HAVE TO DO IS RETWEET &… https://t.co/JBymQS054j | 95 |
385 | sibbsniel | President ED, you can no longer keep the incompetent Dr Parirenyatwa in cabinet whilst the people’s voice is a big… https://t.co/ocyPf1iM3I | 91 |
729 | SophieIsZeus | It be ya own sibling #sabbathedition https://t.co/h16SxQticu | 85 |
1841 | emily_complido | Ikaw Lang ang aking mahal ang pag ibig mo'y aking kailangan Guys, pa trend natin pleeeaseee\n\nInigoPascual LiveAtSho… https://t.co/w2cB0iHJrP | 59 |
2349 | PaulSeesequasis | ᓈᑲᑌᔨᒥᓱᐃᐧᐣ [nâkateyimisowin] - The act of looking after oneself; the act of nurturing oneself; self-control. | 57 |
1831 | SaskRushLAX | Time for a friendly competition #RushNation! Let's take a vote to see who you think grew it better? \n\n❤️ for… https://t.co/NYKCO8lpU8 | 57 |
1830 | JarisSwidrovich | Saskatoon folks (and visitors to Saskatoon):\n \nPlease familiarize yourself with this list of warm-up locations avai… https://t.co/OK9qLG3oKu | 55 |
1493 | SkWanderer | I’m not going to lie, I had no idea that #Saskatchewan was home to a company that produces and ships ambulances all… https://t.co/bygXTpmLm5 | 43 |
1880 | PaulSeesequasis | Saskatoon tomorrow for Marlene Bird https://t.co/9BIa2zCwuX | 33 |
Let's find the user with the most tweets during this week, and take the top 10 tweeters:
yxe_tweets['userHandle'].value_counts().head(10)
sibbsniel 75 prxdspxxky 58 SophieIsZeus 57 MattYoungCTV 47 basementgalaxy 46 Denise13F 45 OmayraIssa 41 PaulSeesequasis 39 PocketFullOFish 37 lollipopdragon 34 Name: userHandle, dtype: int64
In a similar fashion, we'll take the most-used platform for publishing tweets:
yxe_tweets['source'].value_counts()
Twitter for iPhone 1575 Twitter for Android 669 Instagram 247 Twitter Web Client 90 Tweetbot for iΟS 35 Twitter for iPad 24 Tariox 11 Tweetbot for Mac 9 Foursquare 9 TweetMyJOBS 8 circlepix 5 Untappd 3 I Heart Locations 2 Hootsuite 1 dlvr.it 1 New Foodpages 1 Name: source, dtype: int64
To analyze this data insight we'll take a look into the natural language toolkit to determine words that are frequently used in tweets:
# from collections import Counter
# import nltk
# import matplotlib
# import matplotlib.pyplot as plt
# top_N = 30
# df['text'] = yxe_tweets['text']
# nltk.download('stopwords')
# stopwords = nltk.corpus.stopwords.words('english')
# # TODO: Filter out punctuation
# # RegEx for stopwords
# RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# # replace '|'-->' ' and drop all stopwords
# words = (df.text
# .str.lower()
# .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
# .str.cat(sep=' ')
# .split()
# )
# # generate DF out of Counter
# rslt = pd.DataFrame(Counter(words).most_common(top_N),
# columns=['Word', 'Frequency']).set_index('Word')
# print(rslt)
# # plot
# # rslt.plot.bar(rot=0, figsize=(16,10), width=0.8)
############################################################
import nltk
from nltk.corpus import stopwords
from nltk import FreqDist
# get english stopwords
stop = stopwords.words('english')
texts = yxe_tweets['text']
yxe_tweets['timestamp'] = pd.to_datetime(pd.Series(yxe_tweets['timestamp']))
tokens = []
# strip words of punctuation marks
for text in texts.values:
tokens.extend([word.lower().strip(':,."-') for word in text.split()])
filtered_tokens = [word for word in tokens if not word in stop]
freq_dist = nltk.FreqDist(filtered_tokens)
print(freq_dist.plot(25))
None
In this data insight, we will take a look at a histogram of when tweets are posted over the course of a day, as well as plotting the locations where the tweets in Saskatoon are coming from (TODO)
import vincent
yxe_tweets['timestamp'] = pd.to_datetime(pd.Series(yxe_tweets['timestamp']))
# set index to 'created_at'
yxe_tweets.set_index('timestamp', drop=False, inplace=True)
yxe_tweets.index = yxe_tweets.index.tz_localize('GMT').tz_convert('America/Regina')
# convert to 12 hour format
yxe_tweets.index = yxe_tweets.index - pd.DateOffset(hours = 12)
# created_at index is formatted to per minute
yxe_tweets_pm = yxe_tweets['timestamp'].resample('5t').count()
# create time series graph via Vincent
vincent.core.initialize_notebook()
area = vincent.Area(yxe_tweets_pm)
area.colors(brew='Spectral')
area.display()
TODO