In [27]:

%matplotlib inline
import matplotlib.pyplot as plt

import seaborn as sns
from IPython import display

import pandas as pd
import twitter

A basic twitter grab and do something.

make a twitter dev account and get api keys¶

First, we need access to the twitter api, which one gets over at twitter's dev site. Sign up as a dev, then go to the twitter apps site and click create a new app. This gives you four, yes four thingamjigs u need to access the API. Why four? why can't it just one thing?

Now this notebook is in github, so step 1 is to put all four of the secret codes in a file which doesn't get uploaded to github. Twitter has a built in module called configparser which parses config files, so I have a config.ini txt file which looks like:

[twitter]

c_key = this_is_a_fake_to_be_replaced_by_real_thingamajig
c_secret = this_is_a_fake_to_be_replaced_by_real_thingamajig 

a_token = this_is_a_fake_to_be_replaced_by_real_thingamajig
a_secret = this_is_a_fake_to_be_replaced_by_real_thingamajig

Now to read the keys into our python script/notebook¶

In [17]:

# api keys are in config.ini to keep them outside of this public notebook
import configparser
config = configparser.ConfigParser()
config.read('config.ini')

print(f'The config file has the following sections: {config.sections()}')

if "twitter" in config:
    twit = config['twitter']

# check to see if we got all the keys needed to access the twitter api
[key for key in twit]

The config file has the following sections: ['twitter']

Out[17]:

['c_key', 'c_secret', 'a_token', 'a_secret']

using python to access the twitter api¶

Now, there are many twitter api libraries but I'm using the python-twitter module, just cause it seems popular and is the first one listed under python libraries.

In [237]:

## define the necessary keys
cKey = twit["c_key"]
cSecret = twit["c_secret"]
aKey = twit["a_token"]
aSecret = twit["a_secret"]

## create the api object with the twitter-python library
api = twitter.Api(consumer_key=cKey,
                  consumer_secret=cSecret,
                  access_token_key=aKey,
                  access_token_secret=aSecret)
api.VerifyCredentials()

Out[237]:

User(ID=7914, ScreenName=KO)

All right! we have a succesful api connection to twitter!

get tweets from a user¶

this grabs the tweets alongs with a bunch of metadata for each tweet:

In [238]:

## get the user timeline with screen_name = 'KO'
statuses = api.GetUserTimeline(screen_name = 'KO')
print(f"so we got {len(statuses)} statuses, printing the first:")
status = [s for s in statuses][0]
status

so we got 20 statuses, printing the first:

Out[238]:

Status(ID=895177279470489601, ScreenName=KO, Created=Wed Aug 09 06:57:49 +0000 2017, Text='RT @Pinboard: This letter to Google from a potential recruit is a stand on principle, but I’m stuck on the first paragraph. Damn. https://t…')

So each status is an object holding all the info about a tweet.

Now, the status object can be resturned as a dictionary, which is handy since we can use that to build a pandas dataframe:

In [239]:

## create a data frame
## first get a list of panda Series
tweets = [t.AsDict() for t in statuses]

## then create the data frame
data = pd.DataFrame(tweets)

data.head()

Out[239]:

	created_at	favorite_count	favorited	hashtags	id	id_str	in_reply_to_screen_name	in_reply_to_user_id	lang	media	...	quoted_status_id	quoted_status_id_str	retweet_count	retweeted	retweeted_status	source	text	urls	user	user_mentions
0	Wed Aug 09 06:57:49 +0000 2017	NaN	NaN	[]	895177279470489601	895177279470489601	NaN	NaN	en	NaN	...	8.946695e+17	894669466675621889	15.0	True	{'created_at': 'Wed Aug 09 06:15:01 +0000 2017...	<a href="http://twitter.com/#!/download/ipad" ...	RT @Pinboard: This letter to Google from a pot...	[]	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 55525953, 'name': 'Pinboard', 'screen_...
1	Wed Aug 09 06:57:20 +0000 2017	NaN	NaN	[]	895177159039430656	895177159039430656	NaN	NaN	en	NaN	...	NaN	NaN	4.0	True	{'created_at': 'Wed Aug 09 06:28:50 +0000 2017...	<a href="http://twitter.com/#!/download/ipad" ...	RT @glcarlstrom: .@TheEconomist scenario of nu...	[]	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 14346260, 'name': 'Gregg Carlstrom', '...
2	Wed Aug 09 06:55:08 +0000 2017	NaN	NaN	[]	895176604950855680	895176604950855680	NaN	NaN	en	NaN	...	NaN	NaN	73.0	True	{'created_at': 'Tue Aug 08 22:22:25 +0000 2017...	<a href="http://twitter.com/#!/download/ipad" ...	RT @jonathanshainin: I'm biased, but this is o...	[]	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 46073276, 'name': 'Jonathan Shainin', ...
3	Wed Aug 09 06:53:36 +0000 2017	NaN	NaN	[]	895176215631462400	895176215631462400	NaN	NaN	en	NaN	...	NaN	NaN	50.0	True	{'created_at': 'Wed Aug 09 03:56:50 +0000 2017...	<a href="http://twitter.com/#!/download/ipad" ...	RT @Pinboard: Unpopular but correct opinion: t...	[]	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 55525953, 'name': 'Pinboard', 'screen_...
4	Wed Aug 09 06:36:08 +0000 2017	NaN	NaN	[]	895171819946356736	895171819946356736	WorkingCopyApp	7.993167e+17	en	NaN	...	NaN	NaN	NaN	NaN	NaN	<a href="http://twitter.com" rel="nofollow">Tw...	@WorkingCopyApp can the app display jupyter no...	[{'expanded_url': 'http://nbviewer.jupyter.org...	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 799316732280274944, 'name': 'Working C...

5 rows × 21 columns

Now, there is a bunch of columns, most of which we probably won't need, so for analysis can probably drop some of them:

In [240]:

data.columns

Out[240]:

Index(['created_at', 'favorite_count', 'favorited', 'hashtags', 'id', 'id_str',
       'in_reply_to_screen_name', 'in_reply_to_user_id', 'lang', 'media',
       'quoted_status', 'quoted_status_id', 'quoted_status_id_str',
       'retweet_count', 'retweeted', 'retweeted_status', 'source', 'text',
       'urls', 'user', 'user_mentions'],
      dtype='object')

grabbing more tweets¶

See twitter timeline doc - this says you can grab at most 200 tweets in one request, for a max of 3,200 tweets altogether.

Now we only grabbed the first 20 tweets with the above, so we need a function which keeps making requests for tweets until we hit twitters 3,200 tweet limit:

In [241]:

def get_tweets(user="KO", limit=20):
    # initial batch of tweets
    statuses = api.GetUserTimeline(screen_name = user, count=limit)
    
    ## create a data frame
    ## first get a list of panda Series
    pdSeriesList = [t.AsDict() for t in statuses]

    ## then create the data frame
    tweets = pd.DataFrame(pdSeriesList)

    # now to grab the older ones
    
    while len(statuses) >= 20:
        # get the last tweet id and subtract one to make sure we don't get a duplicate tweet
        last_tweet_id = tweets.tail(1)["id"].values[0] -1
        statuses = api.GetUserTimeline(screen_name = 'KO', max_id=last_tweet_id, count=limit)
        
        pdSeriesList = [t.AsDict() for t in statuses]
        tweets = tweets.append(pdSeriesList, ignore_index=True)
        
    return tweets

tweets = get_tweets()

In [242]:

print(tweets.shape)
tweets.head()

(499, 23)

Out[242]:

	created_at	favorite_count	favorited	hashtags	id	id_str	in_reply_to_screen_name	in_reply_to_status_id	in_reply_to_user_id	lang	...	quoted_status_id_str	retweet_count	retweeted	retweeted_status	source	text	truncated	urls	user	user_mentions
0	Wed Aug 09 06:57:49 +0000 2017	NaN	NaN	[]	895177279470489601	895177279470489601	NaN	NaN	NaN	en	...	894669466675621889	15.0	True	{'created_at': 'Wed Aug 09 06:15:01 +0000 2017...	<a href="http://twitter.com/#!/download/ipad" ...	RT @Pinboard: This letter to Google from a pot...	NaN	[]	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 55525953, 'name': 'Pinboard', 'screen_...
1	Wed Aug 09 06:57:20 +0000 2017	NaN	NaN	[]	895177159039430656	895177159039430656	NaN	NaN	NaN	en	...	NaN	4.0	True	{'created_at': 'Wed Aug 09 06:28:50 +0000 2017...	<a href="http://twitter.com/#!/download/ipad" ...	RT @glcarlstrom: .@TheEconomist scenario of nu...	NaN	[]	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 14346260, 'name': 'Gregg Carlstrom', '...
2	Wed Aug 09 06:55:08 +0000 2017	NaN	NaN	[]	895176604950855680	895176604950855680	NaN	NaN	NaN	en	...	NaN	73.0	True	{'created_at': 'Tue Aug 08 22:22:25 +0000 2017...	<a href="http://twitter.com/#!/download/ipad" ...	RT @jonathanshainin: I'm biased, but this is o...	NaN	[]	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 46073276, 'name': 'Jonathan Shainin', ...
3	Wed Aug 09 06:53:36 +0000 2017	NaN	NaN	[]	895176215631462400	895176215631462400	NaN	NaN	NaN	en	...	NaN	50.0	True	{'created_at': 'Wed Aug 09 03:56:50 +0000 2017...	<a href="http://twitter.com/#!/download/ipad" ...	RT @Pinboard: Unpopular but correct opinion: t...	NaN	[]	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 55525953, 'name': 'Pinboard', 'screen_...
4	Wed Aug 09 06:36:08 +0000 2017	NaN	NaN	[]	895171819946356736	895171819946356736	WorkingCopyApp	NaN	7.993167e+17	en	...	NaN	NaN	NaN	NaN	<a href="http://twitter.com" rel="nofollow">Tw...	@WorkingCopyApp can the app display jupyter no...	NaN	[{'expanded_url': 'http://nbviewer.jupyter.org...	{'created_at': 'Tue Oct 10 08:35:25 +0000 2006...	[{'id': 799316732280274944, 'name': 'Working C...

5 rows × 23 columns

we got tweets in a dataframe!¶

Now we can do some analysis. Say we put all the tweets in a list so we can do something with them:

In [243]:

t = [u for u in tweets['text'].values]
t[:3]

Out[243]:

['RT @Pinboard: This letter to Google from a potential recruit is a stand on principle, but I’m stuck on the first paragraph. Damn. https://t…',
 'RT @glcarlstrom: .@TheEconomist scenario of nuclear war seems far more plausible now than when it was published (a whole week ago!). https:…',
 "RT @jonathanshainin: I'm biased, but this is one of the best things I've ever read about the psychology of American exceptionalism: https:/…"]

In [245]:

Out[245]:

Searching¶

In [246]:

pk_search = api.GetSearch("pakistan")

In [250]:

pk = pd.DataFrame([s.AsDict() for s in pk_search])
print(pk.shape)
pk.head()

(15, 18)

Out[250]:

	created_at	favorite_count	hashtags	id	id_str	lang	media	quoted_status	quoted_status_id	quoted_status_id_str	retweet_count	retweeted_status	source	text	truncated	urls	user	user_mentions
0	Tue Aug 08 06:04:24 +0000 2017	15925.0	[]	894801449384910848	894801449384910848	en	[{'display_url': 'pic.twitter.com/DOcW7STnt6',...	NaN	NaN	NaN	5116.0	NaN	<a href="http://twitter.com/download/android" ...	It is so satisfying for me to see the reffores...	NaN	[]	{'created_at': 'Fri Mar 12 19:28:06 +0000 2010...	[]
1	Mon Aug 07 21:51:54 +0000 2017	1113.0	[]	894677507370254336	894677507370254336	en	NaN	NaN	NaN	NaN	585.0	NaN	<a href="http://twitter.com/download/iphone" r...	The Guardian view on Pakistan and the Panama P...	NaN	[{'expanded_url': 'https://www.theguardian.com...	{'created_at': 'Thu Nov 27 16:37:52 +0000 2008...	[]
2	Mon Aug 07 17:51:05 +0000 2017	897.0	[]	894616901887840257	894616901887840257	en	NaN	{'created_at': 'Mon Aug 07 13:14:19 +0000 2017...	8.945473e+17	894547250860482561	326.0	NaN	<a href="http://twitter.com/download/iphone" r...	Is that why Pakistan's per capita rape ratio i...	True	[{'expanded_url': 'https://twitter.com/i/web/s...	{'created_at': 'Mon Jul 25 11:10:59 +0000 2011...	[]
3	Wed Aug 09 07:26:33 +0000 2017	NaN	[{'text': 'Pakistan'}, {'text': 'CPEC'}, {'tex...	895184511482376192	895184511482376192	en	NaN	NaN	NaN	NaN	NaN	NaN	<a href="http://twitter.com" rel="nofollow">Tw...	#Pakistan urges South Korea to invest in #CPEC...	NaN	[{'expanded_url': 'http://www.cpecinfo.com/cpe...	{'created_at': 'Tue Jan 26 06:23:32 +0000 2016...	[{'id': 4848532433, 'name': 'CPEC Official', '...
4	Wed Aug 09 07:26:32 +0000 2017	NaN	[]	895184505815912450	895184505815912450	en	NaN	NaN	NaN	NaN	NaN	NaN	<a href="http://twitter.com/download/android" ...	A Pakistan army major and three soldiers sacri...	NaN	[{'expanded_url': 'https://paktimes.pk/pakista...	{'created_at': 'Mon Dec 05 06:12:04 +0000 2016...	[]

In [253]:

for t in pk['text'].values:
    if "CPEC" in t:
        print(t)

#Pakistan urges South Korea to invest in #CPEC #SEZs 
https://t.co/FLa5LjS1jg via @CPEC_Official @zlj517

In [ ]: