#!/usr/bin/env python
# coding: utf-8
#
#
#
#
#
# # About me: Joris Van den Bossche
#
# - PhD student at Ghent University and VITO, Belgium
# - bio-science engineer, air quality research
# - pandas core dev
#
# ->
#
# - https://github.com/jorisvandenbossche
# - [@jorisvdbossche](https://twitter.com/jorisvdbossche)
#
#
# Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)
#
# # Content of this talk
#
# - Why do you need pandas?
# - Basic introduction to the data structures
# - Guided tour through some of the pandas features with a **case study about air quality**
#
# If you want to follow along, this is a notebook that you can view or run yourself:
#
# - All materials (notebook, data, link to nbviewer): https://github.com/jorisvandenbossche/2015-PyDataParis
# - You need `pandas` > 0.15 (easy solution is using Anaconda)
# Some imports:
# In[1]:
get_ipython().run_line_magic('matplotlib', 'inline')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
pd.options.display.max_rows = 8
# # Let's start with a showcase
# ## Case study: air quality in Europe
#
# AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe
#
# Starting from these hourly data for different stations:
# In[2]:
import airbase
data = airbase.load_data()
# In[3]:
data
# to answering questions about this data in a few lines of code:
#
# **Does the air pollution show a decreasing trend over the years?**
# In[4]:
data['1999':].resample('A').plot(ylim=[0,100])
# **How many exceedances of the limit values?**
#
#
# In[5]:
exceedances = data > 200
exceedances = exceedances.groupby(exceedances.index.year).sum()
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')
# **What is the difference in diurnal profile between weekdays and weekend?**
# In[6]:
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['FR04012'].mean().unstack(level=0)
data_weekend.plot()
# We will come back to these example, and build them up step by step.
# # Why do you need pandas?
# ## Why do you need pandas?
#
# When working with *tabular or structured data* (like R dataframe, SQL table, Excel spreadsheet, ...):
#
# - Import data
# - Clean up messy data
# - Explore data, gain insight into data
# - Process and prepare your data for analysis
# - Analyse your data (together with scikit-learn, statsmodels, ...)
# # Pandas: data analysis in python
#
# For data-intensive work in Python the [Pandas](http://pandas.pydata.org) library has become essential.
#
# What is ``pandas``?
#
# * Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
# * Pandas can also be thought of as `R`'s `data.frame` in Python.
# * Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...
#
# It's documentation: http://pandas.pydata.org/pandas-docs/stable/
# ## Key features
#
# * Fast, easy and flexible input/output for a lot of different data formats
# * Working with missing data (`.dropna()`, `pd.isnull()`)
# * Merging and joining (`concat`, `join`)
# * Grouping: `groupby` functionality
# * Reshaping (`stack`, `pivot`)
# * Powerful time series manipulation (resampling, timezones, ..)
# * Easy plotting
# # Basic data structures
#
# Pandas does this through two fundamental object types, both built upon NumPy arrays: the ``Series`` object, and the ``DataFrame`` object.
# ## Series
#
# A Series is a basic holder for **one-dimensional labeled data**. It can be created much as a NumPy array is created:
# In[7]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s
# ### Attributes of a Series: `index` and `values`
#
# The series has a built-in concept of an **index**, which by default is the numbers *0* through *N - 1*
# In[8]:
s.index
# You can access the underlying numpy array representation with the `.values` attribute:
# In[9]:
s.values
# We can access series values via the index, just like for NumPy arrays:
# In[10]:
s[0]
# Unlike the NumPy array, though, this index can be something other than integers:
# In[11]:
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2
# In[12]:
s2['c']
# In this way, a ``Series`` object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value:
# In[13]:
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 'United Kingdom': 64.9, 'Netherlands': 16.9})
population
# In[14]:
population['France']
# but with the power of numpy arrays:
# In[15]:
population * 1000
# We can index or slice the populations as expected:
# In[16]:
population['Belgium']
# In[17]:
population['Belgium':'Germany']
# Many things you can do with numpy arrays, can also be applied on objects.
# Fancy indexing, like indexing with a list or boolean indexing:
# In[18]:
population[['France', 'Netherlands']]
# In[19]:
population[population > 20]
# Element-wise operations:
# In[20]:
population / 100
# A range of methods:
# In[21]:
population.mean()
# ### Alignment!
#
# Only, pay attention to **alignment**: operations between series will align on the index:
# In[22]:
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]
# In[23]:
s1
# In[24]:
s2
# In[25]:
s1 + s2
# ## DataFrames: Multi-dimensional Data
#
# A DataFrame is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.
#
#
#
#
#
# One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.
#
# Note that in the IPython notebook, the dataframe will display in a rich HTML view:
# In[26]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
# ### Attributes of the DataFrame
#
# A DataFrame has besides a `index` attribute, also a `columns` attribute:
# In[27]:
countries.index
# In[28]:
countries.columns
# To check the data types of the different columns:
# In[29]:
countries.dtypes
# An overview of that information can be given with the `info()` method:
# In[30]:
countries.info()
# Also a DataFrame has a `values` attribute, but attention: when you have heterogeneous data, all values will be upcasted:
# In[31]:
countries.values
# If we don't like what the index looks like, we can reset it and set one of our columns:
# In[32]:
countries = countries.set_index('country')
countries
# To access a Series representing a column in the data, use typical indexing syntax:
# In[33]:
countries['area']
# As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.
#
# Let's compute density of each country:
# In[34]:
countries['population']*1000000 / countries['area']
# Adding a new column to the dataframe is very simple:
# In[35]:
countries['density'] = countries['population']*1000000 / countries['area']
countries
# We can use masking to select certain data:
# In[36]:
countries[countries['density'] > 300]
# And we can do things like sorting the items in the array, and indexing to take the first two rows:
# In[37]:
countries.sort_index(by='density', ascending=False)
# One useful method to use is the ``describe`` method, which computes summary statistics for each column:
# In[38]:
countries.describe()
# The `plot` method can be used to quickly visualize the data in different ways:
# In[39]:
countries.plot()
# However, for this dataset, it does not say that much.
# In[40]:
countries['population'].plot(kind='bar')
# In[41]:
countries.plot(kind='scatter', x='population', y='area')
# The available plotting types: ‘line’ (default), ‘bar’, ‘barh’, ‘hist’, ‘box’ , ‘kde’, ‘area’, ‘pie’, ‘scatter’, ‘hexbin’.
#
#
# In[42]:
countries = countries.drop(['density'], axis=1)
# ## Some notes on selecting data
#
# One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. We now have to distuinguish between:
#
# - selection by label
# - selection by position.
# For a DataFrame, basic indexing selects the columns.
#
# Selecting a single column:
# In[43]:
countries['area']
# or multiple columns:
# In[44]:
countries[['area', 'density']]
# But, slicing accesses the rows:
# In[45]:
countries['France':'Netherlands']
# For more advanced indexing, you have some extra attributes:
#
# * `loc`: selection by label
# * `iloc`: selection by position
# In[46]:
countries.loc['Germany', 'area']
# In[47]:
countries.loc['France':'Germany', :]
# In[49]:
countries.loc[countries['density']>300, ['capital', 'population']]
# Selecting by position with `iloc` works similar as indexing numpy arrays:
# In[50]:
countries.iloc[0:2,1:3]
# The different indexing methods can also be used to assign data:
# In[ ]:
countries.loc['Belgium':'Germany', 'population'] = 10
# In[ ]:
countries
# There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.
# # Case study: air quality data of European monitoring stations (AirBase)
# ## AirBase (The European Air quality dataBase)
#
# AirBase: hourly measurements of all air quality monitoring stations from Europe.
# In[12]:
from IPython.display import HTML
HTML('')
# # Importing and cleaning the data
# ## Importing and exporting data with pandas
#
# A wide range of input/output formats are natively supported by pandas:
#
# * CSV, text
# * SQL database
# * Excel
# * HDF5
# * json
# * html
# * pickle
# * ...
# In[ ]:
pd.read
# In[ ]:
countries.to
# ## Now for our case study
# I downloaded some of the raw data files of AirBase and included it in the repo:
#
# > station code: BETR801, pollutant code: 8 (nitrogen dioxide)
# In[26]:
get_ipython().system('head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012')
# Just reading the tab-delimited data:
# In[43]:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')
# In[44]:
data.head()
# Not really what we want.
# With using some more options of `read_csv`:
# In[45]:
colnames = ['date'] + [item for pair in zip(["{:02d}".format(i) for i in range(24)], ['flag']*24) for item in pair]
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012",
sep='\t', header=None, na_values=[-999, -9999], names=colnames)
# In[46]:
data.head()
# So what did we do:
#
# - specify that the values of -999 and -9999 should be regarded as NaN
# - specified are own column names
# For now, we disregard the 'flag' columns
# In[47]:
data = data.drop('flag', axis=1)
data
# Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index.
# ## Intermezzo: reshaping your data with `stack`, `unstack` and `pivot`
# The docs say:
#
# > Pivot a level of the (possibly hierarchical) column labels, returning a
# DataFrame (or Series in the case of an object with a single level of
# column labels) having a hierarchical index with a new inner-most level
# of row labels.
#
#
# In[48]:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})
df
# To use `stack`/`unstack`, we need the values we want to shift from rows to columns or the other way around as the index:
# In[49]:
df = df.set_index(['A', 'B'])
df
# In[50]:
result = df['C'].unstack()
result
# In[51]:
df = result.stack().reset_index(name='C')
df
# `pivot` is similar to `unstack`, but let you specify column names:
# In[52]:
df.pivot(index='A', columns='B', values='C')
# `pivot_table` is similar as `pivot`, but can work with duplicate indices and let you specify an aggregation function:
# In[53]:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)})
df
# In[54]:
df.pivot_table(index='A', columns='B', values='C', aggfunc='count') #'mean'
# ## Back to our case study
# We can now use `stack` to create a timeseries:
# In[55]:
data = data.set_index('date')
# In[56]:
data_stacked = data.stack()
# In[57]:
data_stacked
# Now, lets combine the two levels of the index:
# In[58]:
data_stacked = data_stacked.reset_index(name='BETR801')
# In[59]:
data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['level_1'], format="%Y-%m-%d%H")
# In[60]:
data_stacked = data_stacked.drop(['date', 'level_1'], axis=1)
# In[61]:
data_stacked
# For this talk, I put the above code in a separate function, and repeated this for some different monitoring stations:
# In[62]:
import airbase
no2 = airbase.load_data()
# - FR04037 (PARIS 13eme): urban background site at Square de Choisy
# - FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia
# - BETR802: urban traffic site in Antwerp, Belgium
# - BETN029: rural background site in Houtem, Belgium
#
# See http://www.eea.europa.eu/themes/air/interactive/no2
# # Exploring the data
# Some useful methods:
#
# `head` and `tail`
# In[63]:
no2.head(3)
# In[64]:
no2.tail()
# `info()`
# In[65]:
no2.info()
# Getting some basic summary statistics about the data with `describe`:
# In[66]:
no2.describe()
# Quickly visualizing the data
# In[67]:
no2.plot(kind='box', ylim=[0,250])
# In[68]:
no2['BETR801'].plot(kind='hist', bins=50)
# In[69]:
no2.plot(figsize=(12,6))
# This does not say too much ..
# We can select part of the data (eg the latest 500 data points):
# In[70]:
no2[-500:].plot(figsize=(12,6))
# Or we can use some more advanced time series features -> next section!
# ## Working with time series data
# When we ensure the DataFrame has a `DatetimeIndex`, time-series related functionality becomes available:
# In[71]:
no2.index
# Indexing a time series works with strings:
# In[72]:
no2["2010-01-01 09:00": "2010-01-01 12:00"]
# A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.
#
# E.g. all data of 2012:
# In[73]:
no2['2012']
# Or all data of January up to March 2012:
# In[74]:
data['2012-01':'2012-03']
# Time and date components can be accessed from the index:
# In[75]:
no2.index.hour
# In[76]:
no2.index.year
# ## The power of pandas: `resample`
# A very powerfull method is **`resample`: converting the frequency of the time series** (e.g. from hourly to daily data).
#
# The time series has a frequency of 1 hour. I want to change this to daily:
# In[77]:
no2.resample('D').head()
# By default, `resample` takes the mean as aggregation function, but other methods can also be specified:
# In[78]:
no2.resample('D', how='max').head()
# The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
# These strings can also be combined with numbers, eg `'10D'`.
# Further exploring the data:
# In[79]:
no2.resample('M').plot() # 'A'
# In[ ]:
# no2['2012'].resample('D').plot()
# In[80]:
no2.loc['2009':, 'FR04037'].resample('M', how=['mean', 'median']).plot()
# #### Question: The evolution of the yearly averages with, and the overall mean of all stations
# In[81]:
no2_1999 = no2['1999':]
no2_1999.resample('A').plot()
no2_1999.mean(axis=1).resample('A').plot(color='k', linestyle='--', linewidth=4)
# # Analysing the data
# ## Intermezzo - the groupby operation (split-apply-combine)
#
# By "group by" we are referring to a process involving one or more of the following steps
#
# * **Splitting** the data into groups based on some criteria
# * **Applying** a function to each group independently
# * **Combining** the results into a data structure
#
#
#
# Similar to SQL `GROUP BY`
# The example of the image in pandas syntax:
# In[82]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
# In[83]:
df.groupby('key').aggregate('sum') # np.sum
# In[84]:
df.groupby('key').sum()
# ## Back to the air quality data
# **Question: how does the *typical monthly profile* look like for the different stations?**
# First, we add a column to the dataframe that indicates the month (integer value of 1 to 12):
# In[85]:
no2['month'] = no2.index.month
# Now, we can calculate the mean of each month over the different years:
# In[86]:
no2.groupby('month').mean()
# In[87]:
no2.groupby('month').mean().plot()
# #### Question: The typical diurnal profile for the different stations
# In[88]:
no2.groupby(no2.index.hour).mean().plot()
# #### Question: What is the difference in the typical diurnal profile between week and weekend days.
# In[89]:
get_ipython().run_line_magic('pinfo', 'no2.index.weekday')
# In[90]:
no2['weekday'] = no2.index.weekday
# Add a column indicating week/weekend
# In[91]:
no2['weekend'] = no2['weekday'].isin([5, 6])
# In[92]:
data_weekend = no2.groupby(['weekend', no2.index.hour]).mean()
data_weekend.head()
# In[93]:
data_weekend_FR04012 = data_weekend['FR04012'].unstack(level=0)
data_weekend_FR04012.head()
# In[94]:
data_weekend_FR04012.plot()
# #### Question: What are the number of exceedances of hourly values above the European limit 200 µg/m3 ?
# In[95]:
exceedances = no2 > 200
# In[96]:
# group by year and count exceedances (sum of boolean)
exceedances = exceedances.groupby(exceedances.index.year).sum()
# In[97]:
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')
# #### Question: Visualize the typical week profile for the different stations as boxplots.
#
# Tip: the boxplot method of a DataFrame expects the data for the different boxes in different columns)
# In[98]:
# add a weekday and week column
no2['weekday'] = no2.index.weekday
no2['week'] = no2.index.week
no2.head()
# In[99]:
# pivot table so that the weekdays are the different columns
data_pivoted = no2['2012'].pivot_table(columns='weekday', index='week', values='FR04037')
data_pivoted.head()
# In[100]:
box = data_pivoted.boxplot()
# **Exercise**: Calculate the correlation between the different stations
# In[101]:
no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].corr()
# In[102]:
no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].resample('D').corr()
# In[103]:
no2 = no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']]
# # Further reading
#
# - the documentation: http://pandas.pydata.org/pandas-docs/stable/
# - Wes McKinney's book "Python for Data Analysis"
# - lots of tutorials on the internet, eg http://github.com/jvns/pandas-cookbook
#
# # What's new in pandas
#
# Some recent enhancements of the last year (versions 0.14 to 0.16):
#
# - Better integration for categorical data (`Categorical` and `CategoricalIndex`)
# - The same for `Timedelta` and `TimedeltaIndex`
# - More flexible SQL interface based on `sqlalchemy`
# - MultiIndexing using slicers
# - `.dt` accessor for accesing datetime-properties from columns
# - Groupby enhancements
# - And a lot of enhancements and bug fixes
# # How can you help?
#
# **We need you!**
#
# Contributions are very welcome and can be in different domains:
#
# - reporting issues
# - improving the documentation
# - testing release candidates and provide feedback
# - triaging and fixing bugs
# - implementing new features
# - spreading the word
#
# -> https://github.com/pydata/pandas
#
# ## Thanks for listening! Questions?
#
#
# - https://github.com/jorisvandenbossche
# -
# - [@jorisvdbossche](https://twitter.com/jorisvdbossche)
#
#
# Slides and data: Source: https://github.com/jorisvandenbossche/2015-PyDataParis
#
#
# Slides presented with 'live reveal' https://github.com/damianavila/RISE
#