Harvest Social media data (Twitter)

I REST my case

Learning outcomes

Data from social media such as Twitter, Flickr, Instagram etc. potentially is a rich source of spatial information.

During this exercise you learn to apply python to:

  • connect to Twitter using the Twython package
  • fire a query to Twitter
  • parse the results and write it to a file

Connecting to Twitter

Before we harvest tweets, we need to connect to the Twitter platform as a developer. To do this you have to:

  1. Log in using your Twitter account (create one if you don't have)
  2. Go to https://apps.twitter.com/ and go to manage your apps at the bottom of the page
  3. Create a new app by filling in the fields (see figure)
  4. The result of creating (actually registering) a new app is that keys (tokens) are generated which allow you to access the Twitter API (if you don't know what API means, google for it)
  5. You can find the tokens under the leaf Keys and Access Tokens. A consumer key is generated for you.

After you have obtained your access code to twitter we can start coding a script in Python giving us access to Twitter.

The first thing to do is import the necessary libraries. The most important libraries you need are Twython and json. Have at look at the Twython website and answer to your self what it does.

Run the code below: if you don't get an error message the Twython libraries are installed properly. If not, open a terminal and install using conda (conda install --name astrolab -c conda-forge twython), pip ( pip install twython) or easy intstall ( easy_install twython). More info here

In [2]:
from twython import Twython
import json
import datetime 


Before we start the real thing you have to understand something about JSON (if you do know JSON skip this part). JSON is an important data format for many web applications. Also Twitter makes use of JSON. It is a lightweight alternative for XML Make sure that you understand what JSON means and how it structures data: https://en.wikipedia.org/wiki/JSON

Connecting to Twitter using Twython

As mentioned Twython is the library that we use to connect to Twitter. Twitter offers two APIs: The REST api and the streaming API. This tutorial is on using the REST api.

The REST APIs provide programmatic access to read and write Twitter data, author a new Tweet, read author profiles and follower data, and more. The responses are available in JSON. Have a look at https://dev.twitter.com/rest/public to have a glance of what is offered.

If you want real-time (ok almost real-time) access to tweets you can use the STREAMING api. The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data. See for more information: https://dev.twitter.com/streaming/overview


In this example we are going to use the REST api to fire some queries to Twitter, process the JSON response to extract from it what we need, and write it to a simple text file we can use to import in excel, a database, or even a GIS package.

First of all instantiate a Twython object using the following:

In [3]:
##codes to access twitter API. 

##initiating Twython object 

Fire a question to Twitter

Ok we have a connection (at least if you filled in your credentials correctly). Now lets ask Twitter a simple question. For that we use the Twitter SEARCH API (which is part of the REST api). You can find documentation at https://dev.twitter.com/rest/public/search.

The results of the query we store in a variable called search_results.


Have a look at the code below and answer the following questions:

Question 1: What is the meaning of 'q' and what does 'count' mean?

Question 2: What datastructure is search-results?

Question 3: Why do we code like: search_results['statuses']?

Question 4: What datastructure is result?

In [6]:
search_results = twitter.search(q='#amsterdam', count=1)

for result in search_results['statuses']:
{u'contributors': None, u'truncated': False, u'text': u'#Politie #Amsterdam Man zonder rijbewijs maakt brokken: [ Woensdag 20 januari 2016 | Politie ] Man zonder rijb... https://t.co/hxPT9DwM3v', u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 689743974907572224, u'favorite_count': 0, u'source': u'<a href="http://twitterfeed.com" rel="nofollow">twitterfeed</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'user_mentions': [], u'hashtags': [{u'indices': [0, 8], u'text': u'Politie'}, {u'indices': [9, 19], u'text': u'Amsterdam'}], u'urls': [{u'url': u'https://t.co/hxPT9DwM3v', u'indices': [114, 137], u'expanded_url': u'http://bit.ly/1WtvURy', u'display_url': u'bit.ly/1WtvURy'}]}, u'in_reply_to_screen_name': None, u'in_reply_to_user_id': None, u'retweet_count': 0, u'id_str': u'689743974907572224', u'favorited': False, u'user': {u'follow_request_sent': False, u'has_extended_profile': False, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 90489299, u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/641610498/z7hhovezh374pfmh1okp.png', u'verified': False, u'profile_text_color': u'000000', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/557631377164607488/RvjbcG5x_normal.png', u'profile_sidebar_fill_color': u'FF0000', u'entities': {u'url': {u'urls': [{u'url': u'http://t.co/PVPg3sxYGB', u'indices': [0, 22], u'expanded_url': u'http://worldtv.com/police-station/web', u'display_url': u'worldtv.com/police-station\u2026'}]}, u'description': {u'urls': [{u'url': u'http://t.co/Xoz1PYPkhq', u'indices': [112, 134], u'expanded_url': u'http://p2000.nl', u'display_url': u'p2000.nl'}, {u'url': u'http://t.co/NAJn2rjgsH', u'indices': [138, 160], u'expanded_url': u'http://oozo.nl/', u'display_url': u'oozo.nl'}]}}, u'followers_count': 1654, u'profile_sidebar_border_color': u'000000', u'id_str': u'90489299', u'profile_background_color': u'FFFFFF', u'listed_count': 57, u'is_translation_enabled': False, u'utc_offset': 3600, u'statuses_count': 68119, u'description': u'Thanks for Following !! #State #police #Messages - Look @ Favorites  for More #Politie #FBI #CIA Info , Website http://t.co/Xoz1PYPkhq or http://t.co/NAJn2rjgsH', u'friends_count': 1565, u'location': u'Netherlands ', u'profile_link_color': u'000000', u'profile_image_url': u'http://pbs.twimg.com/profile_images/557631377164607488/RvjbcG5x_normal.png', u'following': False, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/90489299/1411431556', u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/641610498/z7hhovezh374pfmh1okp.png', u'screen_name': u'politieregio', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 39, u'name': u'police Scanner', u'notifications': False, u'url': u'http://t.co/PVPg3sxYGB', u'created_at': u'Mon Nov 16 21:31:47 +0000 2009', u'contributors_enabled': False, u'time_zone': u'Amsterdam', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'nl', u'created_at': u'Wed Jan 20 09:39:13 +0000 2016', u'in_reply_to_status_id_str': None, u'place': None, u'metadata': {u'iso_language_code': u'nl', u'result_type': u'recent'}}

Get it out of JSON

Nasty isn' t it? Just the plain output of the JSON containing a lot of information about the user, the location, the tweet etc.

Have a look at the code below. Our task is to select the data we are interested in. The tweet information is available in statuses. So what we have to do is loop over all results, store each separate tweet in a variable (in this case called tweet) and extract specific JSON fields from the tweet. Not so difficult provided that you know what you are looking for and where it is located. The thing is the JSON is a nested structure having various levels. As mentioned the structure you can study at https://dev.twitter.com/overview/api/tweets. In the code below you see how to get it using Python. It helps to realize that the tweet is basically represented in Python as a dictionary with keys and values (thanks to the Twyton and json libraries we imported).


  • Complete and test the code below by pulling the coordinates from the tweet.

Exercise for the real die-hards (optional):

  • In the place object a boundingbox which encloses the place is stored. Write a function (using OGR for example) that calculates the centroid of this bounding box.

Answer these questions while doing the exercise:

Question 5: Have a look at the place object at https://dev.twitter.com/overview/api/tweets. What does Nullable mean?

Question 6: What is the meaning of if tweet['place'] != None: and why do we need to code it like this?

In [8]:
##parsing out 
twitter_data = [] # create empty array
for tweet in search_results["statuses"]:
    username =  tweet['user']['screen_name']
    followers_count =  tweet['user']['followers_count']
    tweettext = tweet['text']
    if tweet['place'] != None:
        full_place_name = tweet['place']['full_name']
        place_type =  tweet['place']['place_type']    
    coordinates = tweet['coordinates']
    if coordinates != None:
        #do it yourself: enter code here to pull out coordinate     
    tweet_data = [username, followers_count, tweettext] 
    # add coordinates to array
    twitter_data += [tweet_data]

print(twitter_data) # print harvested twitter data
#Politie #Amsterdam Man zonder rijbewijs maakt brokken: [ Woensdag 20 januari 2016 | Politie ] Man zonder rijb... https://t.co/hxPT9DwM3v

Get it out of Python

So ok, now you know how to harvest tweets. The only thing you have to do now is to make your data last. In other words writing it to a file or optionally to a database (for harvesting to database see harvesting streaming data). The most simple thing to do is to adapt the code above and write the code to a delimited file (comma or tab). If you use a comma delimited file make sure that you replace existing commas in the tweet text by something else (or by nothing). Otherwise you will screw-up the datastructure of your file. The code sniplet below gives some hints. Try to interlace it yourself with the code above.

In [9]:
import csv

output_file = 'twitter_data.csv'

f = open(output_file, 'w')

with f:
    writer = csv.writer(f)
    for row in twitter_data:

Check if your csv with the twitter data is there. If it is, hooray you just harvested tweets.

Reading a CSV file can also be done with Python (see here for more info). CSV files are a popular import and export data format used in spreadsheets and databases. CSV files have a very simple format and most (if not all) data analysis software can read and write CSV files.

If you want to do queries on your twitter data, you don't need to go to excel or R, but you can use Pandas in Python (Pandas tutorial) to read your CSV files and do queries. The dataframe Pandas provides is similar to a dataframe in R, which makes it easy and intuitive to use for us. Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

After finishing this basic twitter harvest tutorial, you can continue with the next tutorial "Harvesting Real-Time Tweets".