pandas
101¶name from "panel data" in econometrics, created by Wes McKinney
designed for split-apply-combine and time-series manipulations, combines many behaviors of R's plyr
, reshape2
, ... packages into one
generally favor immutability (most operations will / should produce new objects)
fundamental data structures: Series, DataFrame
Already have pandas
installed? Use the script
magic function below to see. If its not installed, get your pip
on and get all the packages from my email.
%%script bash
whoami
pip freeze | grep "pandas\|numpy\|matplotlib"
# standard import naming conventions
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
For some more background on the relationship between matplotlib
, pyplot
, pylab
, check out this link. The inline
magic function displays our plots in an Output notebook cell.
There are two* main structures in pandas
: Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled structure).
* there is also a TimeSeries (a flavor of Series that contains datetimes), Panel (3-dimensional), and Panel4D (4-dimensional). The last two are 'less used,' according to the docs. I haven't experimented with them yet.
Series can hold any data type, and the axis label is called an index. Series is dict-like in that you can get and set values by index label.
s1 = pd.Series([1,3,5,np.nan,6,8])
s1
# by default (without specifying them explicitly), the index label is just an int
s1[5]
Columns can be of different data types. Index and column names are optional. If individual Series have different indexes, the DataFrame index will be the union of the individual ones.
Can create from:
dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Series
another DataFrame
N.B.: there are other helper methods for constructing DataFrames from varying data types; see the docs for more options.
# create a couple more Series
s2, s3 = pd.Series(np.random.randn(6)), pd.Series(np.random.randn(6))
# combine multiple Series into a DataFrame with column labels
df_1 = pd.DataFrame({'A': s1, 'B': s2, 'C': s3})
df_1
# when Series are different lengths, DataFrame fills in gaps with NaN
s4 = pd.Series(np.random.randn(8)) # whoaaaaaa this Series has extra entries!
df1 = pd.DataFrame({'A': s1, 'B': s2, 'C': s3, 'D': s4})
df1
# create a DataFrame from numpy array
df2 = pd.DataFrame(np.random.randn(6,4))
df2 # can only have one 'pretty' output per cell (if it's the last command)
#print df2 # otherwise, can print arb number of results w/o pretty format
#print df1 # (uncomment both of these print statements)
Can inspect your DataFrames with head()
and tail()
methods - takes a number of lines as an argument.
Without specifiying them, DataFrames have default index and column name attributes.
# recall current dataframe
df2.head(2)
But you can assign to those attributes of the DataFrame...
cols = ['a', 'b', 'c', 'd']
# assign columns attribute (names)
df2.columns = cols
# create an index:
# generate a sequence of dates with pandas' data_range() method,
# then assign the index attribute
dates = pd.date_range(start='2013-11-24 13:45:27', freq='W', periods=6)
df2.index = dates
df2
# an aside: inspecting the dates object...
print 'what is a date_range object?\n\n', dates
Do some indexing / subsetting...
# select a row by index label by using .loc
df2.loc['2013-12-01 13:45:27']
# select a single element
df2.loc['2013-12-22 13:45:27','c']
# select a set of rows --- n.b.: this is broken, possibly a bug with DateTimeIndex labels?
df2.loc[['2013-12-01 13:45:27','2013-12-08 13:45:27']]
Sadness!!
To make up for the sadness of that last cell, here's an example that does work from the docs....
# new dataframe with random numbers
df1 = pd.DataFrame(np.random.randn(6,4), index=list('abcdef'),columns=list('ABCD'))
df1
# address two separate rows, and a range of three columns
df1.loc[['d','f'],'A':'C']
DataFrames have a melt
method that behaves very similarly to R's reshape2
package. To convert a 'long' DataFrame back into a 'wide' one, use the pivot
method.
# .core.* apparently isn't imported with the general 'import pandas'
from pandas.core.reshape import melt
# recall the df we're using
df2
# melt and keep 'a' and 'b' columns as id variables (like R)
df3 = melt(df2, id_vars=['a','b'])
df3
Note that the melted DataFrame lost its indexes. If that index was a measurement or otherwise useful parameter, the approach would be to create a new column with the appropriate values in it, then apply the melt
- that column could become a variable just as c
and d
in the example.
In the data/
directory is the sample of parsed twitter data that floats around with gnacs. To create the string of column names, I just used the explain option with all other options.
gnacs_x = "id|postedTime|body|None|['twitter_entiteis:urls:url']|['None']|['actor:languages_list-items']|gnip:language:value|twitter_lang|[u'geo:coordinates_list-items']|geo:type|None|None|None|None|actor:utcOffset|None|None|None|None|None|None|None|None|None|actor:displayName|actor:preferredUsername|actor:id|gnip:klout_score|actor:followersCount|actor:friendsCount|actor:listedCount|actor:statusesCount|Tweet|None|None|None"
colnames = gnacs_x.split('|')
# prevent the automatic compression of wide dataframes (add scroll bar)
pd.set_option("display.max_columns", None)
# get some data, inspect
df1 = pd.read_csv('data/twitter_sample.csv', sep='|', names=colnames)
df1.tail(7)
Since there are so many explain fields that come back with 'None', let's just get rid of them for now.
(In the future, we might try to find a way to make that field more descriptive, too.)
# n.b.: this is an *in-place* delete -- unusual for a pandas structure
del df1['None']
# The command below is how the docs suggest carrying this out (creating a new df).
# But, it doesn't seem to work -- possibly due to multiple cols with same name. Oh well.
#new_df = df1.drop('None', axis=1) # return new df
# have a peek
df1.tail(6)
Subsetting a DataFrame is very similar to the syntax in R. There are two ways to select columns: 'dot' (attribute) notation, and 'square bracket' (index) notation. Sometimes, the column names will dictate which you have to use.
# inspect those rows with twitter-classified lang 'en' (scroll the right to see)
df1[df1.twitter_lang == 'en'].head()
# the colons in the column name below won't allow dot-access to the column, so we can quote them and still filter.
#df1[df1["gnip:language:value"] == 'en'].head()
Let's get a subset of this dataframe that has numerical values so we can eventually do some stuff.
# create new dataframe from numerical columns
df2 = df1[["gnip:klout_score","actor:followersCount", "actor:friendsCount", "actor:listedCount"]]
df2.head()
# because I happen to know the answer, let's check data types of the columns...
df2.dtypes
The object
type means that the column has multiple types of data in it. This is a good opportunity to 'fix' a section of the DataFrame by way of a function & the map()
function
# convert ints / strings to floats, give up on anything else (call it 0.0)
def floatify(val):
if val == None or val == 'None':
return 0.0
else:
return float(val)
# assigning to an existing column overwrites that column
df2['gnip:klout_score'] = df2['gnip:klout_score'].map(floatify)
# check again
df2.dtypes
# use all floats just for fun.
# this only works if the elements can all be converted to floats (e.g. ints or something python can handle)
df2 = df2.astype(float)
df2.dtypes
Since they're all numbers now, we can do math and also add new columns to the DataFrame. Combining values from separate columns occurs on a row-by-row basis, as expected.
# look at some activity ratios - add col to df
df2['fol/fr'] = df2['gnip:klout_score'] / df2['actor:followersCount']
df2.head()
# can also use the built-in describe() method to get quick descriptive stats on the dataframe
#df2.describe()
groupby()
is used for the split-apply-combine process. I'm led to believe that this is one of the stronger aspects of pandas
' approach to DataFrames (versus R's), but haven't yet had a chance to really see the power.
# back to bigger df, without 'None' cols
df1.head()
Use a groupby to collect all rows by language value, and subsequently use some of the methods available to GroupBy
DataFrames. Note that the GroupBy
methods will only act on (and the method call only return) values for columns where numerical calculation makes sense.
# subset df, create new df with only 'popular' accounts -- those matching the filter condition given
pop_df = df1[df1["actor:followersCount"] >= 100]
# fix the klout scores again
pop_df['gnip:klout_score'] = pop_df['gnip:klout_score'].map(floatify)
# in case you need to remind yourself of the dataframe
#pop_df.head()
# use GroupBy methods for stats on each group:
pop_df.groupby("twitter_lang").size() # number of elements per group
#pop_df.groupby("twitter_lang").sum() # sum of elements in each group (obviously doesn't make sense for some cols)
#pop_df.groupby("twitter_lang").mean() # algebraic mean of elements per group
# though this looks like a normal dataframe, the DataFrameGroupBy object has a heirarchical index
# this means it may not act as you might expect.
lang_gb = pop_df[['twitter_lang',\
'gnip:klout_score',\
'actor:followersCount',\
'actor:friendsCount',\
'actor:statusesCount']].groupby('twitter_lang')
# note the new index 'twitter_lang' -- in this case, .head(n) returns <= n elements for each index
lang_gb.head(2)
# see that they type is DataFrameGroupBy object
#lang_gb
# to get a DataFrame object that responds more like I'm used to, create a new one using the
# aggregate method, which results in a single-index DataFrame
lang_gb_mean = lang_gb.aggregate(np.mean)
lang_gb_mean.head()
# verify the single index
#lang_gb_mean.index
As far as I can tell, plotting in Python was not fun in the past. Below is some easy, base matplotlib
, but 'nice' graphics take a lot of code. This situation is changing quite quickly now, with the success of ggplot2
in the R world and the attempts to a) make matplotlib
look less sucky, and b) implement the Grammar of Graphics in Python.
# .plot() is a pandas wrapper for matplotlib's plt.plot()
lang_gb_mean['actor:followersCount'].plot(kind='bar', color='g')
# more base matplotlib
plt.scatter(x=lang_gb_mean['actor:followersCount'],\
y=lang_gb_mean['actor:friendsCount'],\
alpha=0.5,\
s=50,\
color='red',\
marker='o')
# now read the docs and copypasta a neat-looking plot
from pandas.tools.plotting import scatter_matrix
scatter_matrix(lang_gb_mean, alpha=0.5, figsize=(12,12), diagonal='kde', s=100)
Finally, a short taste of some other plotting libraries. My munging + plotting skillz in this world are still a work in progress, so I will definitely return to this section with an actual use-case in the future. For now, we'll make up some data for illustrative purposes.
# make up some data with large-scale patterns and a datetime index
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
df.head()
df.plot()
df.hist()
Now, let's use some other matplotlib wrappers to get a sense of how we can make this look a little better....
prettyplotlib
essentially just fixes a bunch of matplotlib settings behind the scenes so your base methods lead to e.g. ColorBrewer palettes. This essentially overwrites the .matplotlibrc settings for this session with palettes and settings that are slightly nicer. The settings will remain until you start a new session.import prettyplotlib
df.plot()
df.hist()
ggplot
is a very new port of R's ggplot2 into Python and is in very active development. Much of the really powerful aspects of R's ggplot is yet to be implemented, but even the man himself is involved in the development, so be sure to check it out later down the road.df['date'] = pd.date_range('1/1/2000', periods=1000)
df.head()
from ggplot import *
df_lng = pd.melt(df, id_vars='date')
df_lng.head()
ggplot(df_lng, aes(x='date', y='value', color='variable')) + \
geom_line() + \
stat_smooth(color='red')
ggplot(df_lng, aes(x='value', color='variable', alpha=1/2.)) + \
geom_histogram(binwidth=0.1) + \
ggtitle("Holy histogram!") + labs("Count", "Frequency") + \
ylim(0, 15)