Introduction to Pandas

April 3rd, 2015

Joris Van den Bossche

# Source: https://github.com/jorisvandenbossche/2015-PyDataParis #

# # # About me: Joris Van den Bossche # # - PhD student at Ghent University and VITO, Belgium # - bio-science engineer, air quality research # - pandas core dev # # -> # # - https://github.com/jorisvandenbossche # - [@jorisvdbossche](https://twitter.com/jorisvdbossche) # # # Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/) # # # Content of this talk # # - Why do you need pandas? # - Basic introduction to the data structures # - Guided tour through some of the pandas features with a **case study about air quality** # # If you want to follow along, this is a notebook that you can view or run yourself: # # - All materials (notebook, data, link to nbviewer): https://github.com/jorisvandenbossche/2015-PyDataParis # - You need `pandas` > 0.15 (easy solution is using Anaconda) # Some imports: # In[1]: get_ipython().run_line_magic('matplotlib', 'inline') import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn pd.options.display.max_rows = 8 # # Let's start with a showcase # ## Case study: air quality in Europe # # AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe # # Starting from these hourly data for different stations: # In[2]: import airbase data = airbase.load_data() # In[3]: data # to answering questions about this data in a few lines of code: # # **Does the air pollution show a decreasing trend over the years?** # In[4]: data['1999':].resample('A').plot(ylim=[0,100]) # **How many exceedances of the limit values?** # # # In[5]: exceedances = data > 200 exceedances = exceedances.groupby(exceedances.index.year).sum() ax = exceedances.loc[2005:].plot(kind='bar') ax.axhline(18, color='k', linestyle='--') # **What is the difference in diurnal profile between weekdays and weekend?** # In[6]: data['weekday'] = data.index.weekday data['weekend'] = data['weekday'].isin([5, 6]) data_weekend = data.groupby(['weekend', data.index.hour])['FR04012'].mean().unstack(level=0) data_weekend.plot() # We will come back to these example, and build them up step by step. # # Why do you need pandas? # ## Why do you need pandas? # # When working with *tabular or structured data* (like R dataframe, SQL table, Excel spreadsheet, ...): # # - Import data # - Clean up messy data # - Explore data, gain insight into data # - Process and prepare your data for analysis # - Analyse your data (together with scikit-learn, statsmodels, ...) # # Pandas: data analysis in python # # For data-intensive work in Python the [Pandas](http://pandas.pydata.org) library has become essential. # # What is ``pandas``? # # * Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that. # * Pandas can also be thought of as `R`'s `data.frame` in Python. # * Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ... # # It's documentation: http://pandas.pydata.org/pandas-docs/stable/ # ## Key features # # * Fast, easy and flexible input/output for a lot of different data formats # * Working with missing data (`.dropna()`, `pd.isnull()`) # * Merging and joining (`concat`, `join`) # * Grouping: `groupby` functionality # * Reshaping (`stack`, `pivot`) # * Powerful time series manipulation (resampling, timezones, ..) # * Easy plotting # # Basic data structures # # Pandas does this through two fundamental object types, both built upon NumPy arrays: the ``Series`` object, and the ``DataFrame`` object. # ## Series # # A Series is a basic holder for **one-dimensional labeled data**. It can be created much as a NumPy array is created: # In[7]: s = pd.Series([0.1, 0.2, 0.3, 0.4]) s # ### Attributes of a Series: `index` and `values` # # The series has a built-in concept of an **index**, which by default is the numbers *0* through *N - 1* # In[8]: s.index # You can access the underlying numpy array representation with the `.values` attribute: # In[9]: s.values # We can access series values via the index, just like for NumPy arrays: # In[10]: s[0] # Unlike the NumPy array, though, this index can be something other than integers: # In[11]: s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd']) s2 # In[12]: s2['c'] # In this way, a ``Series`` object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value: # In[13]: population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 'United Kingdom': 64.9, 'Netherlands': 16.9}) population # In[14]: population['France'] # but with the power of numpy arrays: # In[15]: population * 1000 # We can index or slice the populations as expected: # In[16]: population['Belgium'] # In[17]: population['Belgium':'Germany'] # Many things you can do with numpy arrays, can also be applied on objects. # Fancy indexing, like indexing with a list or boolean indexing: # In[18]: population[['France', 'Netherlands']] # In[19]: population[population > 20] # Element-wise operations: # In[20]: population / 100 # A range of methods: # In[21]: population.mean() # ### Alignment! # # Only, pay attention to **alignment**: operations between series will align on the index: # In[22]: s1 = population[['Belgium', 'France']] s2 = population[['France', 'Germany']] # In[23]: s1 # In[24]: s2 # In[25]: s1 + s2 # ## DataFrames: Multi-dimensional Data # # A DataFrame is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index. # # #

# # # One of the most common ways of creating a dataframe is from a dictionary of arrays or lists. # # Note that in the IPython notebook, the dataframe will display in a rich HTML view: # In[26]: data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'], 'population': [11.3, 64.3, 81.3, 16.9, 64.9], 'area': [30510, 671308, 357050, 41526, 244820], 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']} countries = pd.DataFrame(data) countries # ### Attributes of the DataFrame # # A DataFrame has besides a `index` attribute, also a `columns` attribute: # In[27]: countries.index # In[28]: countries.columns # To check the data types of the different columns: # In[29]: countries.dtypes # An overview of that information can be given with the `info()` method: # In[30]: countries.info() # Also a DataFrame has a `values` attribute, but attention: when you have heterogeneous data, all values will be upcasted: # In[31]: countries.values # If we don't like what the index looks like, we can reset it and set one of our columns: # In[32]: countries = countries.set_index('country') countries # To access a Series representing a column in the data, use typical indexing syntax: # In[33]: countries['area'] # As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes. # # Let's compute density of each country: # In[34]: countries['population']*1000000 / countries['area'] # Adding a new column to the dataframe is very simple: # In[35]: countries['density'] = countries['population']*1000000 / countries['area'] countries # We can use masking to select certain data: # In[36]: countries[countries['density'] > 300] # And we can do things like sorting the items in the array, and indexing to take the first two rows: # In[37]: countries.sort_index(by='density', ascending=False) # One useful method to use is the ``describe`` method, which computes summary statistics for each column: # In[38]: countries.describe() # The `plot` method can be used to quickly visualize the data in different ways: # In[39]: countries.plot() # However, for this dataset, it does not say that much. # In[40]: countries['population'].plot(kind='bar') # In[41]: countries.plot(kind='scatter', x='population', y='area') # The available plotting types: ‘line’ (default), ‘bar’, ‘barh’, ‘hist’, ‘box’ , ‘kde’, ‘area’, ‘pie’, ‘scatter’, ‘hexbin’. # # # In[42]: countries = countries.drop(['density'], axis=1) # ## Some notes on selecting data # # One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. We now have to distuinguish between: # # - selection by label # - selection by position. # For a DataFrame, basic indexing selects the columns. # # Selecting a single column: # In[43]: countries['area'] # or multiple columns: # In[44]: countries[['area', 'density']] # But, slicing accesses the rows: # In[45]: countries['France':'Netherlands'] # For more advanced indexing, you have some extra attributes: # # * `loc`: selection by label # * `iloc`: selection by position # In[46]: countries.loc['Germany', 'area'] # In[47]: countries.loc['France':'Germany', :] # In[49]: countries.loc[countries['density']>300, ['capital', 'population']] # Selecting by position with `iloc` works similar as indexing numpy arrays: # In[50]: countries.iloc[0:2,1:3] # The different indexing methods can also be used to assign data: # In[ ]: countries.loc['Belgium':'Germany', 'population'] = 10 # In[ ]: countries # There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way. # # Case study: air quality data of European monitoring stations (AirBase) # ## AirBase (The European Air quality dataBase) # # AirBase: hourly measurements of all air quality monitoring stations from Europe. # In[12]: from IPython.display import HTML HTML('') # # Importing and cleaning the data # ## Importing and exporting data with pandas # # A wide range of input/output formats are natively supported by pandas: # # * CSV, text # * SQL database # * Excel # * HDF5 # * json # * html # * pickle # * ... # In[ ]: pd.read # In[ ]: countries.to # ## Now for our case study # I downloaded some of the raw data files of AirBase and included it in the repo: # # > station code: BETR801, pollutant code: 8 (nitrogen dioxide) # In[26]: get_ipython().system('head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012') # Just reading the tab-delimited data: # In[43]: data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t') # In[44]: data.head() # Not really what we want. # With using some more options of `read_csv`: # In[45]: colnames = ['date'] + [item for pair in zip(["{:02d}".format(i) for i in range(24)], ['flag']*24) for item in pair] data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t', header=None, na_values=[-999, -9999], names=colnames) # In[46]: data.head() # So what did we do: # # - specify that the values of -999 and -9999 should be regarded as NaN # - specified are own column names # For now, we disregard the 'flag' columns # In[47]: data = data.drop('flag', axis=1) data # Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index. # ## Intermezzo: reshaping your data with `stack`, `unstack` and `pivot` # The docs say: # # > Pivot a level of the (possibly hierarchical) column labels, returning a # DataFrame (or Series in the case of an object with a single level of # column labels) having a hierarchical index with a new inner-most level # of row labels. # #

# In[48]: df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)}) df # To use `stack`/`unstack`, we need the values we want to shift from rows to columns or the other way around as the index: # In[49]: df = df.set_index(['A', 'B']) df # In[50]: result = df['C'].unstack() result # In[51]: df = result.stack().reset_index(name='C') df # `pivot` is similar to `unstack`, but let you specify column names: # In[52]: df.pivot(index='A', columns='B', values='C') # `pivot_table` is similar as `pivot`, but can work with duplicate indices and let you specify an aggregation function: # In[53]: df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)}) df # In[54]: df.pivot_table(index='A', columns='B', values='C', aggfunc='count') #'mean' # ## Back to our case study # We can now use `stack` to create a timeseries: # In[55]: data = data.set_index('date') # In[56]: data_stacked = data.stack() # In[57]: data_stacked # Now, lets combine the two levels of the index: # In[58]: data_stacked = data_stacked.reset_index(name='BETR801') # In[59]: data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['level_1'], format="%Y-%m-%d%H") # In[60]: data_stacked = data_stacked.drop(['date', 'level_1'], axis=1) # In[61]: data_stacked # For this talk, I put the above code in a separate function, and repeated this for some different monitoring stations: # In[62]: import airbase no2 = airbase.load_data() # - FR04037 (PARIS 13eme): urban background site at Square de Choisy # - FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia # - BETR802: urban traffic site in Antwerp, Belgium # - BETN029: rural background site in Houtem, Belgium # # See http://www.eea.europa.eu/themes/air/interactive/no2 # # Exploring the data # Some useful methods: # # `head` and `tail` # In[63]: no2.head(3) # In[64]: no2.tail() # `info()` # In[65]: no2.info() # Getting some basic summary statistics about the data with `describe`: # In[66]: no2.describe() # Quickly visualizing the data # In[67]: no2.plot(kind='box', ylim=[0,250]) # In[68]: no2['BETR801'].plot(kind='hist', bins=50) # In[69]: no2.plot(figsize=(12,6)) # This does not say too much .. # We can select part of the data (eg the latest 500 data points): # In[70]: no2[-500:].plot(figsize=(12,6)) # Or we can use some more advanced time series features -> next section! # ## Working with time series data # When we ensure the DataFrame has a `DatetimeIndex`, time-series related functionality becomes available: # In[71]: no2.index # Indexing a time series works with strings: # In[72]: no2["2010-01-01 09:00": "2010-01-01 12:00"] # A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string. # # E.g. all data of 2012: # In[73]: no2['2012'] # Or all data of January up to March 2012: # In[74]: data['2012-01':'2012-03'] # Time and date components can be accessed from the index: # In[75]: no2.index.hour # In[76]: no2.index.year # ## The power of pandas: `resample` # A very powerfull method is **`resample`: converting the frequency of the time series** (e.g. from hourly to daily data). # # The time series has a frequency of 1 hour. I want to change this to daily: # In[77]: no2.resample('D').head() # By default, `resample` takes the mean as aggregation function, but other methods can also be specified: # In[78]: no2.resample('D', how='max').head() # The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases # These strings can also be combined with numbers, eg `'10D'`. # Further exploring the data: # In[79]: no2.resample('M').plot() # 'A' # In[ ]: # no2['2012'].resample('D').plot() # In[80]: no2.loc['2009':, 'FR04037'].resample('M', how=['mean', 'median']).plot() # #### Question: The evolution of the yearly averages with, and the overall mean of all stations # In[81]: no2_1999 = no2['1999':] no2_1999.resample('A').plot() no2_1999.mean(axis=1).resample('A').plot(color='k', linestyle='--', linewidth=4) # # Analysing the data # ## Intermezzo - the groupby operation (split-apply-combine) # # By "group by" we are referring to a process involving one or more of the following steps # # * **Splitting** the data into groups based on some criteria # * **Applying** a function to each group independently # * **Combining** the results into a data structure # #

# # Similar to SQL `GROUP BY` # The example of the image in pandas syntax: # In[82]: df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'], 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]}) df # In[83]: df.groupby('key').aggregate('sum') # np.sum # In[84]: df.groupby('key').sum() # ## Back to the air quality data # **Question: how does the *typical monthly profile* look like for the different stations?** # First, we add a column to the dataframe that indicates the month (integer value of 1 to 12): # In[85]: no2['month'] = no2.index.month # Now, we can calculate the mean of each month over the different years: # In[86]: no2.groupby('month').mean() # In[87]: no2.groupby('month').mean().plot() # #### Question: The typical diurnal profile for the different stations # In[88]: no2.groupby(no2.index.hour).mean().plot() # #### Question: What is the difference in the typical diurnal profile between week and weekend days. # In[89]: get_ipython().run_line_magic('pinfo', 'no2.index.weekday') # In[90]: no2['weekday'] = no2.index.weekday # Add a column indicating week/weekend # In[91]: no2['weekend'] = no2['weekday'].isin([5, 6]) # In[92]: data_weekend = no2.groupby(['weekend', no2.index.hour]).mean() data_weekend.head() # In[93]: data_weekend_FR04012 = data_weekend['FR04012'].unstack(level=0) data_weekend_FR04012.head() # In[94]: data_weekend_FR04012.plot() # #### Question: What are the number of exceedances of hourly values above the European limit 200 µg/m3 ? # In[95]: exceedances = no2 > 200 # In[96]: # group by year and count exceedances (sum of boolean) exceedances = exceedances.groupby(exceedances.index.year).sum() # In[97]: ax = exceedances.loc[2005:].plot(kind='bar') ax.axhline(18, color='k', linestyle='--') # #### Question: Visualize the typical week profile for the different stations as boxplots. # # Tip: the boxplot method of a DataFrame expects the data for the different boxes in different columns) # In[98]: # add a weekday and week column no2['weekday'] = no2.index.weekday no2['week'] = no2.index.week no2.head() # In[99]: # pivot table so that the weekdays are the different columns data_pivoted = no2['2012'].pivot_table(columns='weekday', index='week', values='FR04037') data_pivoted.head() # In[100]: box = data_pivoted.boxplot() # **Exercise**: Calculate the correlation between the different stations # In[101]: no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].corr() # In[102]: no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].resample('D').corr() # In[103]: no2 = no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']] # # Further reading # # - the documentation: http://pandas.pydata.org/pandas-docs/stable/ # - Wes McKinney's book "Python for Data Analysis" # - lots of tutorials on the internet, eg http://github.com/jvns/pandas-cookbook # # # What's new in pandas # # Some recent enhancements of the last year (versions 0.14 to 0.16): # # - Better integration for categorical data (`Categorical` and `CategoricalIndex`) # - The same for `Timedelta` and `TimedeltaIndex` # - More flexible SQL interface based on `sqlalchemy` # - MultiIndexing using slicers # - `.dt` accessor for accesing datetime-properties from columns # - Groupby enhancements # - And a lot of enhancements and bug fixes # # How can you help? # # **We need you!** # # Contributions are very welcome and can be in different domains: # # - reporting issues # - improving the documentation # - testing release candidates and provide feedback # - triaging and fixing bugs # - implementing new features # - spreading the word # # -> https://github.com/pydata/pandas # # ## Thanks for listening! Questions? # # # - https://github.com/jorisvandenbossche # - # - [@jorisvdbossche](https://twitter.com/jorisvdbossche) # # # Slides and data: Source: https://github.com/jorisvandenbossche/2015-PyDataParis # # # Slides presented with 'live reveal' https://github.com/damianavila/RISE #