Harvest Social media data (Twitter)

I REST my case

Learning outcomes

Data from social media such as Twitter, Flickt, Instagram etc. potentially is a rich source of spatial information.

During this exercise you learn to apply python to:

  • connect to Twitter using the Twython package
  • fire a query to Twitter
  • parse the results and write it to a file

Connecting to Twitter

Before you can start you need to Twitter data. To do this you have to:

  1. log in using your Twitter account (create one if you don't have)
  2. go to https://apps.twitter.com/ and go to manage your apps at the bottom of the page
  3. create a new app by filling in the fields (see figure)
  4. The result of creating (actually registering) a new app is that keys (tokens) are generated which allow you to accesu the Twitter API (if you don't know what API still means google it)
  5. You can find the tokens under the leaf Keys and Access Tokens. Standard a consumer key is generated. You need to press

After you have obtained you acess code to twitter we can start coding a script in Python giving us access to Twitter.

The first thing to do is import the neccesary libraries. The most important libraries you need are Twython and json. Have at look at the Twython website and answer to your self what is does.

Run the code below: if you don't get an error message the Twython libraries are installed properly. If not open a terminal and install using pip ( pip install twython) or easy intstall ( easy_install twython). More info here

In [2]:
from twython import Twython
import json
import datetime 

Json

Before we start the real thing you have to understand something about JSON (if you do know JSON skip this part). JSON is an important data format for many web applications. Also Twitter makes use of JSON. It is a lightweight alternative for XML Make sure that you understand what JSON means and how it structures data: https://en.wikipedia.org/wiki/JSON

Connecting to Twitter using Twython

As mentioned Twython is the library that we use to connect to Twitter. Twitter offers two APIs: The REST api and the streaming API. This tutorial is on using the REST api.

The REST APIs provide programmatic access to read and write Twitter data, author a new Tweet, read author profiles and follower data, and more. The responses are available in JSON. Have a look a https://dev.twitter.com/rest/public to have a glace of what is offered

If you want real-time (ok almost real-time) access to tweets you can use the STREAMING api. The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data. See for more information: https://dev.twitter.com/streaming/overview

Execise:

In this example we are going to use the REST api to fire some queries to Twitter, process the JSON response to extract from it what we need, and write it to a simple text file we can use to import in excell, a database, or even a GIS package.

First of all instantiate a Twython object using the following:

In [3]:
##codes to access twitter API. 
APP_KEY = 
APP_SECRET = 
OAUTH_TOKEN = 
OAUTH_TOKEN_SECRET = 

##initiating Twython object 
twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

##TODO:  This should work as an alternative but it doesn't. Need to find out why
#twitter = Twython(APP_KEY, APP_SECRET, oauth_version=2)
#ACCESS_TOKEN = twitter.obtain_access_token()
#print ACCESS_TOKEN
#twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)

Fire a question to Twitter

Ok we have connection (at least if you filled in youre credentials correctly). Now lets ask Twitter a simple question. For that we use the Twitter SEARCH API (which is part of the REST api). You can find documentation at https://dev.twitter.com/rest/public/search.

The results of the query we store in a variable called search_results (but you might call it whatever you want)

Have a look at the code below and answer the following questions:

  1. what is the meaning of 'q' and what does 'count' mean?
  2. what datastructure is search-results?
  3. why do we code like: search_results['statuses']?
  4. what datastructure is result?

Exercise:

In [6]:
 
search_results = twitter.search(q='#amsterdam', count=1)


for result in search_results['statuses']:
    print result
{u'contributors': None, u'truncated': False, u'text': u'#Politie #Amsterdam Man zonder rijbewijs maakt brokken: [ Woensdag 20 januari 2016 | Politie ] Man zonder rijb... https://t.co/hxPT9DwM3v', u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 689743974907572224, u'favorite_count': 0, u'source': u'<a href="http://twitterfeed.com" rel="nofollow">twitterfeed</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'user_mentions': [], u'hashtags': [{u'indices': [0, 8], u'text': u'Politie'}, {u'indices': [9, 19], u'text': u'Amsterdam'}], u'urls': [{u'url': u'https://t.co/hxPT9DwM3v', u'indices': [114, 137], u'expanded_url': u'http://bit.ly/1WtvURy', u'display_url': u'bit.ly/1WtvURy'}]}, u'in_reply_to_screen_name': None, u'in_reply_to_user_id': None, u'retweet_count': 0, u'id_str': u'689743974907572224', u'favorited': False, u'user': {u'follow_request_sent': False, u'has_extended_profile': False, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 90489299, u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/641610498/z7hhovezh374pfmh1okp.png', u'verified': False, u'profile_text_color': u'000000', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/557631377164607488/RvjbcG5x_normal.png', u'profile_sidebar_fill_color': u'FF0000', u'entities': {u'url': {u'urls': [{u'url': u'http://t.co/PVPg3sxYGB', u'indices': [0, 22], u'expanded_url': u'http://worldtv.com/police-station/web', u'display_url': u'worldtv.com/police-station\u2026'}]}, u'description': {u'urls': [{u'url': u'http://t.co/Xoz1PYPkhq', u'indices': [112, 134], u'expanded_url': u'http://p2000.nl', u'display_url': u'p2000.nl'}, {u'url': u'http://t.co/NAJn2rjgsH', u'indices': [138, 160], u'expanded_url': u'http://oozo.nl/', u'display_url': u'oozo.nl'}]}}, u'followers_count': 1654, u'profile_sidebar_border_color': u'000000', u'id_str': u'90489299', u'profile_background_color': u'FFFFFF', u'listed_count': 57, u'is_translation_enabled': False, u'utc_offset': 3600, u'statuses_count': 68119, u'description': u'Thanks for Following !! #State #police #Messages - Look @ Favorites  for More #Politie #FBI #CIA Info , Website http://t.co/Xoz1PYPkhq or http://t.co/NAJn2rjgsH', u'friends_count': 1565, u'location': u'Netherlands ', u'profile_link_color': u'000000', u'profile_image_url': u'http://pbs.twimg.com/profile_images/557631377164607488/RvjbcG5x_normal.png', u'following': False, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/90489299/1411431556', u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/641610498/z7hhovezh374pfmh1okp.png', u'screen_name': u'politieregio', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 39, u'name': u'police Scanner', u'notifications': False, u'url': u'http://t.co/PVPg3sxYGB', u'created_at': u'Mon Nov 16 21:31:47 +0000 2009', u'contributors_enabled': False, u'time_zone': u'Amsterdam', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'nl', u'created_at': u'Wed Jan 20 09:39:13 +0000 2016', u'in_reply_to_status_id_str': None, u'place': None, u'metadata': {u'iso_language_code': u'nl', u'result_type': u'recent'}}

Get it out of JSON

Nasty isn' t it? Just the plain output of the JSON containing a lot of information about the user, the location, the tweet.

Have a look at the code below. Our task is to get only those data we are interested in. The tweet information is available in statuses. So what we have to do is loop over all results, store each seperate tweet in a variable (in this case called tweet) and extract specific JSON fields from the tweet. Not so difficult provided that you know what you are looking for and where it is located. The thing is the JSON is a nested structure having various levels. As mentioned the structure you can study at https://dev.twitter.com/overview/api/tweets. In the code below you see how to get it using Python. It helps to realize that the tweet is basically represented in Python as a dictionary with keys and values (thanks to the Twyton and json libraries we imported).

Questions:

  1. have a look at the place object at https://dev.twitter.com/overview/api/tweets. What does Nullable mean?
  2. what is the meaning of if tweet['place'] != None: and why do we need to code it like this?

Exercise:

  • complete and test the code below by pulling the coordinates from the tweet.

Exercise for the real die-hards (optional):

  • in the place object a boundingbox which encloses the place is stored. Write a function (using OGR for example) that calculates the controid of this bounding box.
In [8]:
##parsing out 
for tweet in search_results["statuses"]:
    username =  tweet['user']['screen_name']
    followers_count =  tweet['user']['followers_count']
    tweettext = tweet['text']
    if tweet['place'] != None:
        full_place_name = tweet['place']['full_name']
        place_type =  tweet['place']['place_type']    
    coordinates = tweet['coordinates']
    if coordinates != None:
        print 'oki'
        #do it yourself: enter code her to pull out coordinate     
    print username
    print followers_count
    print tweettext
    #add some some output statements that print lat lon if present
    print '==========================='
politieregio
1654
#Politie #Amsterdam Man zonder rijbewijs maakt brokken: [ Woensdag 20 januari 2016 | Politie ] Man zonder rijb... https://t.co/hxPT9DwM3v
===========================

Get it out of Python

So ok, now you know. The only thing your have to do now is to make it last. In other words write it to a file (or if you like a database; for that see the exercise on harversing streaming data) so you can use your results in other software to analyse. The most simple thing to do is to adapt the code above so it write to a delimited file (comma or tab). If you use a comma delimited file make sure that your replace existing commas in the tweet text by something else (or by nothing). Otherwise you will shrew-up the datastructure of your file. The code sniplet below gives some hints. Try to interlace it yourself with the code above.

In [9]:
output_file = 'result.csv' 


target = open(output_file, 'a')


target.write(username) # t is 
target.write('\n') #produce a tab delimited file
target.close()