Notebook

Using the Twitter STREAMING API¶

A basic example¶

In the previous exercise you learned how to harvest tweets already posted by using the REST api. In this exercise we will continue harvesting tweets posted in (semi) real time. It is just a basic example to get your started

The lines below you know already from the previous excersise:

In [ ]:

from twython import TwythonStreamer
import string, json, pprint
import urllib
from datetime import datetime
from datetime import date
from time import *
import string, os, sys, subprocess, time
import psycopg2

# get access to the twitter API
APP_KEY = ""
APP_SECRET = ""
OAUTH_TOKEN = ""
OAUTH_TOKEN_SECRET = ""

## just some date and time hack to generate an unique filename if needed
output_file = 'result_' + datetime.now().strftime('%Y%m%d-%H%M%S') + '.csv' 

Setting up a streaming class¶

The new thing is that we are not going to use the Twython interface from the library but the TwythonStreamer interface. In the code below you see a Python class (MyStreamer) which inherits from the TwythonStreamer interface.

This class has a number of functions. The main ones are: on_succes and on_error. The on_succes is called when data has been successfully recieved from the stream. The parameter data (a dictionary thanks to Twython) contains the tweet which you can parse-out the way you did previously.

In [ ]:

#Class to process JSON data comming from the twitter stream API. Extract relevant fields
class MyStreamer(TwythonStreamer):
    def on_success(self, data):
         tweet_lat = 0.0
         tweet_lon = 0.0
         tweet_name = ""
         retweet_count = 0

         if 'id' in data:
               tweet_id = data['id']
         if 'text' in data:
               tweet_text = data['text'].replace("'","''").replace(';','')
         if 'coordinates' in data:    
               geo = data['coordinates']
               if not geo is None:
                    latlon = geo['coordinates']
                    tweet_lon = latlon[0]
                    tweet_lat= latlon[1]
         if 'created_at' in data:
                    dt = data['created_at']
                    tweet_datetime = datetime.strptime(dt, '%a %b %d %H:%M:%S +0000 %Y')

         if 'user' in data:
                    users = data['user']
                    tweet_name = users['screen_name']

         if 'retweet_count' in data:
                    retweet_count = data['retweet_count']
                    
         if tweet_lat != 0:
                    #some elementary output to console    
                    string_to_write = str(tweet_datetime)+", "+str(tweet_lat)+", "+str(tweet_lon)+": "+str(tweet_text)
                    print(string_to_write)
                    #write_tweet(string_to_write)
                 
    def on_error(self, status_code, data):
         print("OOPS Error: " +str(status_code))
         #self.disconnect

Fiitering the stream¶

Ok. To do it nicely in a Pythonic way; below you see the main procedure where the MyStreamer class is instantiated (with all authentication tokens) and next only capture those tweets within a certain bounding box. Have a look at https://twython.readthedocs.org/en/latest/api.html#streaming-interface for more information on what and how to filter the incoming tweet stream

In [ ]:

#Main procedure
def main():
    try:
        stream = MyStreamer(APP_KEY, APP_SECRET,OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
        print('Connecting to twitter: will take a minute')
    except ValueError:
        print('OOPS! that hurts, something went wrong while making connection with Twitter: '+str(ValueError))
    #global target
    
    
    # Filter based on bounding box see twitter api documentation for more info
    try:
        stream.statuses.filter(locations='3.00,50.00,7.35,53.65')
    except ValueError:
        print('OOPS! that hurts, something went wrong while getting the stream from Twitter: '+str(ValueError))


                
if __name__ == '__main__':
    main()

Ok just for granted. A basic function to write tweets to a file but probably you figured that out yourself. You can add the basic write function to the Streamer class to write your tweets to a file.

In [ ]:

def write_tweet(tweet, output_file):
    target = open(output_file, 'a')
    target.write(tweet)
    target.write('\n')
    target.close()

Beyond the basics¶

If you are bored and need a challenge, it would be nice not to write to a dull text file but to a real PostGIS database.

PostGIS is installed in OSGEO life (see for a quick start: http://live.osgeo.org/en/quickstart/postgis_quickstart.html. The cool thing is that you directly can connect to QGIS and/or do spatial queries on the database. Another way to view your data is via pgAdmin III, which can be used to inspect and query your PostGis database.

To be able to connect to a PostGIS database in your Python script you need to import the psycopg2 library with Conda. Installing Python libraries is easy. Secondly, you need to change a configuration in your postgres configurations. Open your PostgreSQL settings with the bash command below with the nano text editor:

sudo nano /etc/postgresql/9.5/main/pg_hba.conf

Be careful not to change anything else! Scroll down to the bottom of the text file and change:

local all postgres peer

into:

local all postgres md5

After changing the line press ctrl+x, press y for yes and press enter to save with same filename. Now you can locally connect to the PostGIS database.

Below you can see how to make a connection to the database. We already prepared a PostGIS database for you: dbname = geoscripting, user = geoscripting and passw = geoscripting. If setting up the connection to the database does not work, have a look at this psycopg2 tutorial, this psycopg2 tutorial with PostgreSQL or this Jupyter Notebook with PostgreSQL tutorial.

In [ ]:

## Create connection to PostGis database and create a cursor

try:
    conn = psycopg2.connect("dbname=geoscripting user=geoscripting password=geoscripting host=localhost")
    cur = conn.cursor()
except:
    print("oops error")
# Optionally use host=/var/run/postgresql for running outside of conda/virtual environment

Once you have a connection and a cursor to the database you can execute SQL queries (some SQL examples) from Python: creating a table, inserting data into table, dropping tables, retrieving information from table, etc..

Using Python as a scripting language to do both harvesting via Twython and saving in PostGIS with psycopg2/SQL makes your life easy! Some benefits of PostGIS are: able to write and read features from database at same time, save spatial features in database based on simple features standard and do spatial processing in db based on third party applications discussion about PostGIS vs MySQL.

Go ahead and add a table to create a new table in the PostGIS database to store your tweet data and save some artifical tweet data in the database.

In [ ]:

# Create table in PostGis database
insert_query =  """CREATE TABLE {table_name} 
                (tweet_name varchar(50),
                tweet_lat varchar(50),
                tweet_lon varchar(50),
                tweet_text varchar(500));""".format(table_name = "geoscripting") 
# the format function formats your string to: """CREATE TABLE geoscripting (etc.);"""

# Execute and commit query
cur.execute(insert_query)
conn.commit()  

You can inspect your data in the PostGIS database in QGIS, pgAdmin III or via PSQL in the terminal. Give pgAdmin III a go. Go to Applications --> Programming --> pgAdmin III. Now the software opens. Then add a new connection to a server. Fill in:

name: geoscripting
host: localhost
maintenance db: postgres
username: geoscripting
password: geoscripting

In the object browser you should see a server called geoscripting: open it --> databases --> geoscripting (name of database) --> schemas --> public --> tables --> geoscripting (name of table). You should see four columns.

Add some data to the columns by firing another query.

In [ ]:

# Mock-up data
tweet_name = "Geoscripting"
tweet_lat = 52.1235
tweet_lon = 5.1425
tweet_text = "#Geoscripting"

# Load into postgis database
insert_query =  """INSERT INTO {table_name} VALUES(%s, %s, %s, %s)""".format(table_name = "geoscripting") 
#%s saves data as a string
data = (str(tweet_name), str(tweet_lat), str(tweet_lon), str(tweet_text))

# Execute and commit query
cur.execute(insert_query, data)
conn.commit()

Retrieve some of the data from your PostGIS database by letting the cursor fetch your data.

In [ ]:

# Retrieve data from PostGIS database
cur.execute("SELECT * FROM {table_name}".format(table_name = "geoscripting")) 
twitter_data = cur.fetchall()

If you don't need the connection to the database, close the cursor and connection to the database!

In [ ]:

# Close cursor that executed query
cur.close()

# Close connection to database
conn.close()

We have seen real-time tweet harvesting and saving at work. Now you can get to work to add the script that saves your tweets in the PostGIS database to the Streamer class you used before. This could be the result of your exercise script or you can try something else.

Don't forget to close the connection to the database when you don't need it anymore!!!

In [ ]: