Data Wrangling in Pandas¶

This session draws primarily on Chapter 7 in Python for Data Analysis. It covers methods that are used heavily in 'data wrangling', which refers to the data manipulation that is often needed to transform raw data into a form that is useful for analysis. We'll stick to the data and examples used in the book for most of this session, since the examples are clearer on the tiny datasets. After that we will work through some of these methods again using real data.

Key methods covered include:

Merging and Concatenating
Reshaping data
Data transformations
Categorization
Detecting and Filtering Outliers
Creating Dummy Variables

In [ ]:

import pandas as pd
import numpy as np

Merging¶

Merging two datasets is a very common operation in preparing data for analysis. It generally means adding columns from one table to colums from another, where the value of some key, or merge field, matches.

Let's begin by creating two simple DataFrames to be merged.

In [ ]:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
print(df1)
print(df2)

Here is a many to one merge. The join field is implicit, based on what columns it finds in common between the two dataframes. Note that they share some values of the key field (a, b), but do not share key values c and d. What do you expect to happen when we merge them? The result contains the values from both inputs where they both have a value of the merge field, which is 'key' in this example. The default behavior is that the key value has to be in both inputs to be kept. In set terms it would be an intersection of the two sets.

In [ ]:

pd.merge(df1,df2)

Here is the same merge, but making the join field explicit.

In [ ]:

pd.merge(df1,df2, on='key')

In [ ]:

#what if there are more than one value of key in both dataframes? This is a many-to-many merge.
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df3 = pd.DataFrame({'key': ['a', 'b', 'b', 'd'],'data2': range(4)})
print(df1)
print(df3)
pd.merge(df1,df3, on='key')
#This produces a cartesian product of the number of occurrences of each key value in both dataframes:
# (b shows up 3 times in df1 and 2 times in df3, so we get 6 occurrences in the result of the merge)

In [ ]:

# There are several types of joins: left, right, inner, and outer. Let's compare them.
# How does a 'left' join compare to our initial join?  Note that it keeps the result if it shows up in df1,
# regardless of whether it also shows up in df2.  It fills in a value of NaN for the missing value from df2.
pd.merge(df1,df3, on='key', how='left')

In [ ]:

#How does a 'right' join compare?  Same idea, but this time it keeps a result if it shows up in df2, regardless
# of whether it also shows up in df1.
pd.merge(df1,df3, on='key', how='right')

In [ ]:

#How does an 'inner' join compare?
pd.merge(df1,df3, on='key', how='inner')
# seems to be the default argument...

In [ ]:

#How does an 'outer' join compare?  If inner joins are like an intersection of two sets, outer joins are unions.
pd.merge(df1,df3, on='key', how='outer')

In [ ]:

#What if the join fields have different names?  No problem - just specify the names.
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'key_2': ['a', 'b', 'b', 'd'],'data2': range(4)})
pd.merge(df4,df5, left_on='key_1', right_on='key_2')

In [ ]:

# Here is an example that uses a combination of a data column and an index to merge two dataframes.
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'data2': [4,6,8,10]}, index=['a','b','c','d'])
pd.merge(df4,df5, left_on='key_1', right_index=True)

Concatenating¶

In [ ]:

# Concatenating can append rows, or columns, depending on which axis you use. Default is 0
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1, s2, s3])
# Since we are concatenating series on axis 0, this creates a longer series, appending each of the three series

In [ ]:

# What if we concatenate on axis 1?
pd.concat([s1, s2, s3], axis=1)

In [ ]:

# Outer join is the default:
pd.concat([s1, s2, s3], axis=1, join='outer')

In [ ]:

# What would an inner join produce?
pd.concat([s1, s2, s3], axis=1, join='inner')

In [ ]:

# We need some overlapping values to have the inner join produe non-empty results
s4 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
s5 = pd.Series([1, 2, 3], index=['d', 'e', 'f'])
s6 = pd.Series([7, 8, 9, 10], index=['d', 'e', 'f', 'g'])
pd.concat([s4, s5, s6], axis=1, join='outer')

In [ ]:

# Here is the inner join 
pd.concat([s4, s5, s6], axis=1, join='inner')
# Note that it contains only entries that overlap in all three series.

Reshaping with Hierarchical Indexing¶

In [ ]:

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))
data

In [ ]:

# Stack pivots the columns into rows, producing a Series with a hierarchical index:
result = data.stack()
result

In [ ]:

# Unstack reverses this process:
result.unstack()

Data Transformations¶

In [ ]:

# Start with a dataframe containing some duplicate values
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,'k2': [1, 1, 2, 3, 3, 4, 99]})
data

In [ ]:

# How to see which rows contain duplicate values
data.duplicated()

In [ ]:

# How to remove duplicate values
data.drop_duplicates()

In [ ]:

#If 99 is a code for missing data, we could replace any such values with NaNs
data['k2'].replace(99,np.nan)

Categorization (binning)¶

In [ ]:

# Let's look at how to create categories of data using ranges to bin the data using cut
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
type(cats)

In [ ]:

cats.categories

In [ ]:

cats.codes

In [ ]:

pd.value_counts(cats)

In [ ]:

# Consistent with mathematical notation for intervals, a parenthesis means that the side is open while the 
#square bracket means it is closed (inclusive). Which side is closed can be changed by passing right=False:
cats = pd.cut(ages, bins, right=False)
print(ages)
print(pd.value_counts(cats))

Removing Outliers¶

In [ ]:

# Start by creating a dataframe with 4 columns of 1,000 random numbers
# We'll use a fixed seed for the random number generator to get repeatable results
np.random.seed(12345)
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

In [ ]:

# This identifies any values in column 3 with absolute values > 3
col = data[3]
col[np.abs(col) > 3]

In [ ]:

# This identifies all the rows with any column containing absolute values > 3
data[(np.abs(data) > 3).any(1)]

In [ ]:

# Now we can cap the values at -3 to 3 using this:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Computing Dummy Variables¶

In [ ]:

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})
df

In [ ]:

# This generates dummy variables for each value of key
# Dummy variables are useful in statistical modeling, to have 0/1 indicator
# variables for the presence of some condition
pd.get_dummies(df['key'])

In [ ]:

# This generates dummy variables for each value of key and appends these to the dataframe
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Notice that we used join instead of merge. The join method is very similar to merge, but uses indexes to merge, by default. From the documentation:

http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing

Reviewing our earlier application of Data Wrangling to Craigslist Data¶

In [ ]:

# import libraries and read in the csv file
import re as re, pandas as pd, numpy as np, requests, json
df = pd.read_csv('bay.csv')
print(df[:5])

# clean price and neighborhood
df.price = df.price.str.strip('$').astype('float64')
df.neighborhood = df.neighborhood.str.strip().str.strip('(').str.strip(')')

# break out the date into month day year columns
df['month'] = df['date'].str.split().str[0]
df['day'] = df['date'].str.split().str[1].astype('int32')
df['year'] = df['date'].str.split().str[2].astype('int32')
del df['date']

def clean_br(value):
    if isinstance(value, str):
        end = value.find('br')
        if end == -1:
            return None
        else:
            start = value.find('/') + 2
            return int(value[start:end])
df['bedrooms'] = df['bedrooms'].map(clean_br)

def clean_sqft(value):
    if isinstance(value, str):
        end = value.find('ft')
        if end == -1:
            return None
        else:
            if value.find('br') == -1:
                start = value.find('/') + 2
            else:
                start = value.find('-') + 2
            return int(value[start:end])
df['sqft'] = df['sqft'].map(clean_sqft)



df.head()

Let's do some wrangling on this dataset:¶

Find outliers in rent, say below 200 or above 10,000
Analyze the data without missing data
Create a dataset that removes the outliers

In [ ]:

df['price'].dropna().describe()

In [ ]:

df['price'][(df['price'] < 200)].dropna().describe()

In [ ]:

df['price'][(df['price'] > 10000)].dropna().describe()

In [ ]:

# Let's get a quantile value at the 99 percentile to see the value that the top one percent of our records exceed
df['price'].dropna().quantile(.99)

In [ ]:

filtered = df[(df['price'] < 10000) & (df['price'] > 200)]
filtered.dropna().describe()

OK, now on your own:¶

Filter out records with more than 4 bedrooms
Create dummy variables for each bedroom count (e.g. bed_1 would have 1 for rows with 1 bedroom, 0 for others), and merge them with the dataframe
Filter sqft < 500 and > 3000
Create a set of 5 bins for price and do counts of how many records are in each category

In [ ]: