Notebook

Wrangle Report¶

Wrangling of WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.

Introduction¶

This document describes the efforts we did for wrangling and data analysis for the WeRateDogs data. Indeed the purpose of those efforts was to lead us to interesting insights regarding those data.
This document is like an internal document for the wrangling effort. It shows the path we followed and shares all the key technical point during this work.

Data sources and gathering¶

We gather data from 3 sources :

The WeRateDogs twitter archive, made available for us as twitter_archive_enhanced.csv
The tweets images predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network, available at the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
Twitter, especially the WeRateDogs account.

1. WeRateDogs twitter archive¶

We load the csv file - as a dataframe - using read_csv() function from pandas.

2. Tweets images predictions¶

First we use the python requests library to get the file from the provided URL and write it as a local file.
Then, we load this local tsv file - as a dataframe - using pandas read_csv() fonction, with the option sep='\t'.

3. Gather additional informations from twitter¶

"retweet count" and "favorite count" were missing from the WeRateDogs twitter archive provided to us. We gather this additional information using twitter API. Using the tweet IDs from the WeRateDogs twitter archive, we gather all this missing info through queries towards Twitter's API - we used tweepy.

To use the twitter API to connec to twitter, it was necessary to create a twitter account and request a developer account here : https://developer.twitter.com/en/docs/basics/developer-portal/overview

The gathering process was the following:

For each tweet ID in the WeRateDogs archive (df_archive_clean),
We query the Twitter API and get a tweet JSON data
We store the entire set of JSON data in a file called tweet_json.txt, on a new line
Once tweet_json.txt is completed, we read it line by line into a pandas DataFrame (at a minimum with tweet ID, retweet count, favorite count)

Twitter API creation, howto :

In [ ]:

import tweepy

consumer_key = 'YOUR CONSUMER KEY'
consumer_secret = 'YOUR CONSUMER SECRET'
access_token = 'YOUR ACCESS TOKEN'
access_secret = 'YOUR ACCESS SECRET'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

The following parameters were important during the API object creation :

set the wait_on_rate_limit to True to automatically wait for rate limits to refill
set wait_on_rate_limit_notify to True to print a notification when Tweepy is waiting for rate limits to refill

The code is like the following :

In [ ]:

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Due to those rate limits, the json file create takes more or less 30 minutes.

We've got a basic understanding of the tweet json object from here : https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html. So we choose to collect "created_at", "id", "retweet_count", "favorite_count" and "full_text".

Datasets assessments¶

We used a systematic approach to assess each datasets. During this step by step approach, we raised and numbered each issues we encountered. We focus on quality issues (missing data, format, completness) and tidiness issues.
We focus on the issues according to the objective we had for the insights : Try to understand what makes the success for a tweet on WeRateDogs account.

Here is the approach we used :

Get a view on the dataset using head() or sample() functions
Get the dataset size using shape function
Find out if we have duplicates with duplicated()
Identify any missing information with isnull().sum()
Observe the unique values for the most relevant columns for us with unique()
Check the columns - most relevant for us - types with dtypes. Go deeper with the type()function if required.

With this way of walking through, we identified the following issues :

Quality issues identified :¶

The WeRateDgos twitter archive enhanced contains retweets, which could not be part or our study
The WeRateDgos twitter archive enhanced has 59 missing value from the "expanded_urls" column
REMOVED ! - This quality is no longer part of our study -
WeRateDgos twitter archive - in some cases, the relevant rating information, consequently rating_numerator and rating_denominator, have not been extracted.
WeRateDgos twitter archive - We have "None" as name even if, sometimes, there is an available name in the text. So few names are missing and could be retrieved.
WeRateDgos twitter archive - We have missing dog stages, meaning stages not properly extracted. This is the case for "puppo", "doggo" and "pupper".
WeRateDgos twitter archive - "expanded_urls" column contains duplicated urls.
WeRateDgos twitter archive - timestamp is using string type.
Tweets images predictions - We have rows which are not dogs images predictions.
Data retrieved via twitter API - "created_at" column type is not timestamp.
Tweets images predictions - predictions and the associated confidences are spread over several columns.

Tidiness issues identified :¶

WeRateDgos twitter archive - Dog stages have been spread as columns.
Tweets images predictions - predictions and the associated confidences are spread over several columns.

Cleaning¶

Approach¶

First, we address the missing data and completness issues (Quality issues 1, 2, 4, 5, 6, 9)
Then we resolve the tidiness problems we identified.
Finally we will correct the quality issues (Quality issues 7, 8, 10)

For each issues, we used :

Define,
Code
Test.

While cleaning, we were looking for a way to apply a function to any row in a pandas DataFrame. We found the way to do searching "function every row pandas" on a search engine : http://jonathansoma.com/lede/foundations/classes/pandas%20columns%20and%20functions/apply-a-function-to-every-row-in-a-pandas-dataframe/
We used this approach a lot during the cleaning.

For regular expressions, we tested with : https://regex101.com/

On how to use it in a python program : https://docs.python.org/3/library/re.html#re.findall

The cleaning of the quality issue 9, was done with the correction of the tidiness issue 2.

After the cleaning, we ensure the following requirement "we do not keep the tweets beyond August 1st 2017, because we won't be able to have the associated images predictions" by deleting some rows in the twitter's dataframes.

Merging all the 3 datasets for a master dataset¶

We merge the twitter enhanced archive already cleaned df_archive_clean with the data collected via a twitter API, also already cleaned df_twitter_clean.
Then we merge the result to the prediction dataframe df_image_clean. All on the tweet_id column.

We created only one column rating for the rating as the rating_numerator / rating_denominator.

Storing Data¶

We store the master dataframe into a csv files: twitter_archive_master.csv
We also store the dataframes into a SQLite database.

We've learnt the way to do from here : https://stackoverflow.com/questions/50803109/how-to-store-pandas-dataframe-in-sqlite-db#50803252

The last step was to perform analysis and visualization using seaborn.

We tried to develop an understanding of what make a dog picture to be a success.
We use categorical plots, as discovered from here : https://seaborn.pydata.org/tutorial/categorical.html#categorical-scatterplots

In [3]:

# Generate the HTML version of this notebook
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_report.ipynb'])

Out[3]:

In [ ]: