%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
import pandas as pd
import twitter
A basic twitter grab and do something.
First, we need access to the twitter api, which one gets over at twitter's dev site. Sign up as a dev, then go to the twitter apps site and click create a new app. This gives you four, yes four thingamjigs u need to access the API. Why four? why can't it just one thing?
Now this notebook is in github, so step 1 is to put all four of the secret codes in a file which doesn't get uploaded to github. Twitter has a built in module called configparser which parses config files, so I have a config.ini txt file which looks like:
[twitter]
c_key = this_is_a_fake_to_be_replaced_by_real_thingamajig
c_secret = this_is_a_fake_to_be_replaced_by_real_thingamajig
a_token = this_is_a_fake_to_be_replaced_by_real_thingamajig
a_secret = this_is_a_fake_to_be_replaced_by_real_thingamajig
# api keys are in config.ini to keep them outside of this public notebook
import configparser
config = configparser.ConfigParser()
config.read('config.ini')
print(f'The config file has the following sections: {config.sections()}')
if "twitter" in config:
twit = config['twitter']
# check to see if we got all the keys needed to access the twitter api
[key for key in twit]
The config file has the following sections: ['twitter']
['c_key', 'c_secret', 'a_token', 'a_secret']
Now, there are many twitter api libraries but I'm using the python-twitter module, just cause it seems popular and is the first one listed under python libraries.
## define the necessary keys
cKey = twit["c_key"]
cSecret = twit["c_secret"]
aKey = twit["a_token"]
aSecret = twit["a_secret"]
## create the api object with the twitter-python library
api = twitter.Api(consumer_key=cKey,
consumer_secret=cSecret,
access_token_key=aKey,
access_token_secret=aSecret)
api.VerifyCredentials()
User(ID=7914, ScreenName=KO)
All right! we have a succesful api connection to twitter!
this grabs the tweets alongs with a bunch of metadata for each tweet:
## get the user timeline with screen_name = 'KO'
statuses = api.GetUserTimeline(screen_name = 'KO')
print(f"so we got {len(statuses)} statuses, printing the first:")
status = [s for s in statuses][0]
status
so we got 20 statuses, printing the first:
Status(ID=895177279470489601, ScreenName=KO, Created=Wed Aug 09 06:57:49 +0000 2017, Text='RT @Pinboard: This letter to Google from a potential recruit is a stand on principle, but I’m stuck on the first paragraph. Damn. https://t…')
So each status is an object holding all the info about a tweet.
Now, the status object can be resturned as a dictionary, which is handy since we can use that to build a pandas dataframe:
## create a data frame
## first get a list of panda Series
tweets = [t.AsDict() for t in statuses]
## then create the data frame
data = pd.DataFrame(tweets)
data.head()
created_at | favorite_count | favorited | hashtags | id | id_str | in_reply_to_screen_name | in_reply_to_user_id | lang | media | ... | quoted_status_id | quoted_status_id_str | retweet_count | retweeted | retweeted_status | source | text | urls | user | user_mentions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wed Aug 09 06:57:49 +0000 2017 | NaN | NaN | [] | 895177279470489601 | 895177279470489601 | NaN | NaN | en | NaN | ... | 8.946695e+17 | 894669466675621889 | 15.0 | True | {'created_at': 'Wed Aug 09 06:15:01 +0000 2017... | <a href="http://twitter.com/#!/download/ipad" ... | RT @Pinboard: This letter to Google from a pot... | [] | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 55525953, 'name': 'Pinboard', 'screen_... |
1 | Wed Aug 09 06:57:20 +0000 2017 | NaN | NaN | [] | 895177159039430656 | 895177159039430656 | NaN | NaN | en | NaN | ... | NaN | NaN | 4.0 | True | {'created_at': 'Wed Aug 09 06:28:50 +0000 2017... | <a href="http://twitter.com/#!/download/ipad" ... | RT @glcarlstrom: .@TheEconomist scenario of nu... | [] | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 14346260, 'name': 'Gregg Carlstrom', '... |
2 | Wed Aug 09 06:55:08 +0000 2017 | NaN | NaN | [] | 895176604950855680 | 895176604950855680 | NaN | NaN | en | NaN | ... | NaN | NaN | 73.0 | True | {'created_at': 'Tue Aug 08 22:22:25 +0000 2017... | <a href="http://twitter.com/#!/download/ipad" ... | RT @jonathanshainin: I'm biased, but this is o... | [] | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 46073276, 'name': 'Jonathan Shainin', ... |
3 | Wed Aug 09 06:53:36 +0000 2017 | NaN | NaN | [] | 895176215631462400 | 895176215631462400 | NaN | NaN | en | NaN | ... | NaN | NaN | 50.0 | True | {'created_at': 'Wed Aug 09 03:56:50 +0000 2017... | <a href="http://twitter.com/#!/download/ipad" ... | RT @Pinboard: Unpopular but correct opinion: t... | [] | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 55525953, 'name': 'Pinboard', 'screen_... |
4 | Wed Aug 09 06:36:08 +0000 2017 | NaN | NaN | [] | 895171819946356736 | 895171819946356736 | WorkingCopyApp | 7.993167e+17 | en | NaN | ... | NaN | NaN | NaN | NaN | NaN | <a href="http://twitter.com" rel="nofollow">Tw... | @WorkingCopyApp can the app display jupyter no... | [{'expanded_url': 'http://nbviewer.jupyter.org... | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 799316732280274944, 'name': 'Working C... |
5 rows × 21 columns
Now, there is a bunch of columns, most of which we probably won't need, so for analysis can probably drop some of them:
data.columns
Index(['created_at', 'favorite_count', 'favorited', 'hashtags', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_user_id', 'lang', 'media', 'quoted_status', 'quoted_status_id', 'quoted_status_id_str', 'retweet_count', 'retweeted', 'retweeted_status', 'source', 'text', 'urls', 'user', 'user_mentions'], dtype='object')
See twitter timeline doc - this says you can grab at most 200 tweets in one request, for a max of 3,200 tweets altogether.
Now we only grabbed the first 20 tweets with the above, so we need a function which keeps making requests for tweets until we hit twitters 3,200 tweet limit:
def get_tweets(user="KO", limit=20):
# initial batch of tweets
statuses = api.GetUserTimeline(screen_name = user, count=limit)
## create a data frame
## first get a list of panda Series
pdSeriesList = [t.AsDict() for t in statuses]
## then create the data frame
tweets = pd.DataFrame(pdSeriesList)
# now to grab the older ones
while len(statuses) >= 20:
# get the last tweet id and subtract one to make sure we don't get a duplicate tweet
last_tweet_id = tweets.tail(1)["id"].values[0] -1
statuses = api.GetUserTimeline(screen_name = 'KO', max_id=last_tweet_id, count=limit)
pdSeriesList = [t.AsDict() for t in statuses]
tweets = tweets.append(pdSeriesList, ignore_index=True)
return tweets
tweets = get_tweets()
print(tweets.shape)
tweets.head()
(499, 23)
created_at | favorite_count | favorited | hashtags | id | id_str | in_reply_to_screen_name | in_reply_to_status_id | in_reply_to_user_id | lang | ... | quoted_status_id_str | retweet_count | retweeted | retweeted_status | source | text | truncated | urls | user | user_mentions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wed Aug 09 06:57:49 +0000 2017 | NaN | NaN | [] | 895177279470489601 | 895177279470489601 | NaN | NaN | NaN | en | ... | 894669466675621889 | 15.0 | True | {'created_at': 'Wed Aug 09 06:15:01 +0000 2017... | <a href="http://twitter.com/#!/download/ipad" ... | RT @Pinboard: This letter to Google from a pot... | NaN | [] | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 55525953, 'name': 'Pinboard', 'screen_... |
1 | Wed Aug 09 06:57:20 +0000 2017 | NaN | NaN | [] | 895177159039430656 | 895177159039430656 | NaN | NaN | NaN | en | ... | NaN | 4.0 | True | {'created_at': 'Wed Aug 09 06:28:50 +0000 2017... | <a href="http://twitter.com/#!/download/ipad" ... | RT @glcarlstrom: .@TheEconomist scenario of nu... | NaN | [] | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 14346260, 'name': 'Gregg Carlstrom', '... |
2 | Wed Aug 09 06:55:08 +0000 2017 | NaN | NaN | [] | 895176604950855680 | 895176604950855680 | NaN | NaN | NaN | en | ... | NaN | 73.0 | True | {'created_at': 'Tue Aug 08 22:22:25 +0000 2017... | <a href="http://twitter.com/#!/download/ipad" ... | RT @jonathanshainin: I'm biased, but this is o... | NaN | [] | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 46073276, 'name': 'Jonathan Shainin', ... |
3 | Wed Aug 09 06:53:36 +0000 2017 | NaN | NaN | [] | 895176215631462400 | 895176215631462400 | NaN | NaN | NaN | en | ... | NaN | 50.0 | True | {'created_at': 'Wed Aug 09 03:56:50 +0000 2017... | <a href="http://twitter.com/#!/download/ipad" ... | RT @Pinboard: Unpopular but correct opinion: t... | NaN | [] | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 55525953, 'name': 'Pinboard', 'screen_... |
4 | Wed Aug 09 06:36:08 +0000 2017 | NaN | NaN | [] | 895171819946356736 | 895171819946356736 | WorkingCopyApp | NaN | 7.993167e+17 | en | ... | NaN | NaN | NaN | NaN | <a href="http://twitter.com" rel="nofollow">Tw... | @WorkingCopyApp can the app display jupyter no... | NaN | [{'expanded_url': 'http://nbviewer.jupyter.org... | {'created_at': 'Tue Oct 10 08:35:25 +0000 2006... | [{'id': 799316732280274944, 'name': 'Working C... |
5 rows × 23 columns
Now we can do some analysis. Say we put all the tweets in a list so we can do something with them:
t = [u for u in tweets['text'].values]
t[:3]
['RT @Pinboard: This letter to Google from a potential recruit is a stand on principle, but I’m stuck on the first paragraph. Damn. https://t…', 'RT @glcarlstrom: .@TheEconomist scenario of nuclear war seems far more plausible now than when it was published (a whole week ago!). https:…', "RT @jonathanshainin: I'm biased, but this is one of the best things I've ever read about the psychology of American exceptionalism: https:/…"]
499
pk_search = api.GetSearch("pakistan")
pk = pd.DataFrame([s.AsDict() for s in pk_search])
print(pk.shape)
pk.head()
(15, 18)
created_at | favorite_count | hashtags | id | id_str | lang | media | quoted_status | quoted_status_id | quoted_status_id_str | retweet_count | retweeted_status | source | text | truncated | urls | user | user_mentions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Tue Aug 08 06:04:24 +0000 2017 | 15925.0 | [] | 894801449384910848 | 894801449384910848 | en | [{'display_url': 'pic.twitter.com/DOcW7STnt6',... | NaN | NaN | NaN | 5116.0 | NaN | <a href="http://twitter.com/download/android" ... | It is so satisfying for me to see the reffores... | NaN | [] | {'created_at': 'Fri Mar 12 19:28:06 +0000 2010... | [] |
1 | Mon Aug 07 21:51:54 +0000 2017 | 1113.0 | [] | 894677507370254336 | 894677507370254336 | en | NaN | NaN | NaN | NaN | 585.0 | NaN | <a href="http://twitter.com/download/iphone" r... | The Guardian view on Pakistan and the Panama P... | NaN | [{'expanded_url': 'https://www.theguardian.com... | {'created_at': 'Thu Nov 27 16:37:52 +0000 2008... | [] |
2 | Mon Aug 07 17:51:05 +0000 2017 | 897.0 | [] | 894616901887840257 | 894616901887840257 | en | NaN | {'created_at': 'Mon Aug 07 13:14:19 +0000 2017... | 8.945473e+17 | 894547250860482561 | 326.0 | NaN | <a href="http://twitter.com/download/iphone" r... | Is that why Pakistan's per capita rape ratio i... | True | [{'expanded_url': 'https://twitter.com/i/web/s... | {'created_at': 'Mon Jul 25 11:10:59 +0000 2011... | [] |
3 | Wed Aug 09 07:26:33 +0000 2017 | NaN | [{'text': 'Pakistan'}, {'text': 'CPEC'}, {'tex... | 895184511482376192 | 895184511482376192 | en | NaN | NaN | NaN | NaN | NaN | NaN | <a href="http://twitter.com" rel="nofollow">Tw... | #Pakistan urges South Korea to invest in #CPEC... | NaN | [{'expanded_url': 'http://www.cpecinfo.com/cpe... | {'created_at': 'Tue Jan 26 06:23:32 +0000 2016... | [{'id': 4848532433, 'name': 'CPEC Official', '... |
4 | Wed Aug 09 07:26:32 +0000 2017 | NaN | [] | 895184505815912450 | 895184505815912450 | en | NaN | NaN | NaN | NaN | NaN | NaN | <a href="http://twitter.com/download/android" ... | A Pakistan army major and three soldiers sacri... | NaN | [{'expanded_url': 'https://paktimes.pk/pakista... | {'created_at': 'Mon Dec 05 06:12:04 +0000 2016... | [] |
for t in pk['text'].values:
if "CPEC" in t:
print(t)
#Pakistan urges South Korea to invest in #CPEC #SEZs https://t.co/FLa5LjS1jg via @CPEC_Official @zlj517