Wrangling of WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.
This document describes the efforts we did for wrangling and data analysis for the WeRateDogs data. Indeed the purpose of those efforts was to lead us to interesting insights regarding those data.
This document is like an internal document for the wrangling effort. It shows the path we followed and shares all the key technical point during this work.
We gather data from 3 sources :
twitter_archive_enhanced.csv
We load the csv file - as a dataframe - using read_csv() function from pandas.
First we use the python requests
library to get the file from the provided URL and write it as a local file.
Then, we load this local tsv file - as a dataframe - using pandas read_csv() fonction, with the option sep='\t'
.
"retweet count" and "favorite count" were missing from the WeRateDogs twitter archive provided to us. We gather this additional information using twitter API. Using the tweet IDs from the WeRateDogs twitter archive, we gather all this missing info through queries towards Twitter's API - we used tweepy.
To use the twitter API to connec to twitter, it was necessary to create a twitter account and request a developer account here : https://developer.twitter.com/en/docs/basics/developer-portal/overview
The gathering process was the following:
df_archive_clean
),tweet_json.txt
, on a new linetweet_json.txt
is completed, we read it line by line into a pandas DataFrame (at a minimum with tweet ID, retweet count, favorite count)Twitter API creation, howto :
import tweepy
consumer_key = 'YOUR CONSUMER KEY'
consumer_secret = 'YOUR CONSUMER SECRET'
access_token = 'YOUR ACCESS TOKEN'
access_secret = 'YOUR ACCESS SECRET'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
The following parameters were important during the API object creation :
The code is like the following :
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
Due to those rate limits, the json file create takes more or less 30 minutes.
We've got a basic understanding of the tweet json object from here : https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html. So we choose to collect "created_at", "id", "retweet_count", "favorite_count" and "full_text".
We used a systematic approach to assess each datasets. During this step by step approach, we raised and numbered each issues we encountered. We focus on quality issues (missing data, format, completness) and tidiness issues.
We focus on the issues according to the objective we had for the insights : Try to understand what makes the success for a tweet on WeRateDogs account.
Here is the approach we used :
head()
or sample()
functionsshape
functionduplicated()
isnull().sum()
unique()
dtypes
. Go deeper with the type()
function if required.With this way of walking through, we identified the following issues :
For each issues, we used :
While cleaning, we were looking for a way to apply a function to any row in a pandas DataFrame. We found the way to do searching "function every row pandas" on a search engine :
http://jonathansoma.com/lede/foundations/classes/pandas%20columns%20and%20functions/apply-a-function-to-every-row-in-a-pandas-dataframe/
We used this approach a lot during the cleaning.
For regular expressions, we tested with : https://regex101.com/
On how to use it in a python program : https://docs.python.org/3/library/re.html#re.findall
The cleaning of the quality issue 9, was done with the correction of the tidiness issue 2.
After the cleaning, we ensure the following requirement "we do not keep the tweets beyond August 1st 2017, because we won't be able to have the associated images predictions" by deleting some rows in the twitter's dataframes.
We merge the twitter enhanced archive already cleaned df_archive_clean
with the data collected via a twitter API, also already cleaned df_twitter_clean
.
Then we merge the result to the prediction dataframe df_image_clean
.
All on the tweet_id
column.
We created only one column rating
for the rating as the rating_numerator / rating_denominator
.
twitter_archive_master.csv
We've learnt the way to do from here : https://stackoverflow.com/questions/50803109/how-to-store-pandas-dataframe-in-sqlite-db#50803252
The last step was to perform analysis and visualization using seaborn.
We tried to develop an understanding of what make a dog picture to be a success.
We use categorical plots, as discovered from here : https://seaborn.pydata.org/tutorial/categorical.html#categorical-scatterplots
# Generate the HTML version of this notebook
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_report.ipynb'])
0