#!/usr/bin/env python # coding: utf-8 # # Data Manipulation Part 1 - Groupby Operations # # Chapter 9 in Python for Data Analysis demonstrates a variety of methods to analyze data via data aggregation and grouping operations. Those are the focus of this session. Our overall goal for this session is to do Exploratory Data Analysis, which is essentially looking at and probing our data to learn about the patterns we can discover in them. Often this can generate a better understanding of problems in the data, as well as revealing relationships that might be worth exploring in more depth. # # We will use Census Summary File 1 data for much of this session to illustrate concepts using real data. P1. POPULATION [1] Total: P0010001 P3. RACE [8] Universe: Total population Total: P0030001 White alone P0030002 Black or African American alone P0030003 American Indian and Alaska Native alone P0030004 Asian alone P0030005 Native Hawaiian and Other Pacific Islander alone P0030006 Some Other Race alone P0030007 Two or More Races P0030008 P4. HISPANIC OR LATINO ORIGIN [3] Universe: Total population Total: P0040001 Not Hispanic or Latino P0040002 Hispanic or Latino P0040003 H1. HOUSING UNITS [1] Universe: Housing units Total H00010001 H3. OCCUPANCY STATUS [3] Universe: Housing units Total: H0030001 Occupied H0030002 Vacant H0030003 H4. TENURE [4] Universe: Occupied housing units Total: H0040001 Owned with a mortgage or a loan H0040002 Owned free and clear H0040003 Renter-occupied H0040004 H5. VACANCY STATUS [8] Universe: Vacant housing units Total: H0050001 For rent H0050002 Rented, not occupied H0050003 For sale only H0050004 Sold, not occupied H0050005 For seasonal, recreational, or occasional use H0050006 For migrant workers H0050007 Other vacant H0050008 # Below we load a table I created previously with an extract of data from the 2010 Census, Summary File 1, for the Bay Area. The file is being loaded from an HDF5 datastore - this and other file formats are covered in Chapter 6. # In[1]: import pandas as pd sf1store = pd.HDFStore('data/bay_sf1_small.h5') sf1 = sf1store['sf1_extract'] sf1['pct_rent'] = sf1['H0040004'] / sf1['H0040001'] * 100 sf1['pct_black'] = sf1['P0030003'] / sf1['P0030001'] * 100 sf1['pct_asian'] = sf1['P0030005'] / sf1['P0030001'] * 100 sf1['pct_white'] = sf1['P0030002'] / sf1['P0030001'] * 100 sf1['pct_hisp'] = sf1['P0040003'] / sf1['P0040001'] * 100 sf1['pct_vacant'] = sf1['H0050001'] / sf1['H00010001'] * 100 sf1['pop_sqmi'] = (sf1['P0010001'] / (sf1['arealand'] / 2589988)) sf1 = sf1[sf1['P0030001']>0] sf1[:5] # ## Groupby and Aggregation Operations # # Groupby is a powerful method in pandas that follows the split-apply-combine approach to data. As shown in Figure 9-1 in the context of a sum operation, the data is first split into groups that share the same key values. Then an operation, in this case a sum, is applied to each group. Then the results are combined. # # The built-in aggregation methods available for groupby operations include: # * count # * sum # * mean # * median # * std, var # * min, max # * first, last # # You can also apply your own functions as aggregation methods. # # # Let's apply this approach to computing total population in each county in our dataset. We can do this in two steps to help explain what is happening. First we create a groupby object, using county codes to group all the census blocks in sf1 into groups that share the same county code. # In[2]: grouped = sf1['P0010001'].groupby(sf1['county']) grouped # Now that we have this grouping object that represents the **split** part of the workflow in the figure above, we can **apply** operations and **combine** the results using methods like sum: # In[3]: grouped.sum() # Doing this in two steps like above is really just to clarify the two parts of the split and apply process that happen within a groupy operation. Normally we would not bother separately creating a groupby object -- we would just do this in one line: # In[4]: sf1['P0010001'].groupby(sf1['county']).sum() # Let's add county names to the dataframe so we get more readable output, and rerun this aggregation. # In[5]: county_names = {'001': 'Alameda', '013': 'Contra Costa', '041': 'Marin', '055': 'Napa', '075': 'San Francisco', '081': 'San Mateo', '085': 'Santa Clara', '095': 'Solano', '097': 'Sonoma'} # Let's add county_name as a column in the dataframe. It would be easy to append it as the last column with a merge, but let's see how to insert it in a specified location so that it is easier to read when we browse the data. We can insert it as the 4th column, between county and tract, like so: # In[6]: sf1.insert(4, 'county_name', sf1['county'].replace(county_names)) sf1[:5] # Now we can print the results of summing population by county_name: # In[7]: print('Total Population by County:') print(sf1['P0010001'].groupby(sf1['county_name']).sum()) # We might want to capture the result in a DataFrame if we want to use it in other processing, like merging the results to the original DataFrame. # In[8]: county_pop = sf1['P0010001'].groupby(sf1['county_name']).sum().to_frame(name='total_population') county_pop # Here we merge the county total population with sf1 and create a new DataFrame. # In[9]: sf2 = pd.merge(sf1,county_pop, left_on='county_name', right_index=True) sf2[:5] # Let's say we wanted to compute the population per square mile by county using the groupby method. We could go ahead and create another dataframe with total area by county than then do the division of total population by total area. # In[10]: county_land = sf1['arealand'].groupby(sf1['county_name']).sum().to_frame(name='total_area') county_land # In[11]: county_pop_per_sqmi = county_pop['total_population'] / county_land['total_area'] * 2589988.11 county_pop_per_sqmi # Or of course we could have done this whole thing in one line: # In[12]: sf1['P0010001'].groupby(sf1['county_name']).sum() / sf1['arealand'].groupby(sf1['county_name']).sum() * 2589988.11 # ## Your turn to practice: # # Count the number of census blocks per county. # # Calculate total households per county. # # Calculate percent renters by county. (Careful not to calculate the mean percent rental across blocks in a county) # # Calculate percent vacant by county. # # Calculate mean, min and max pop_sqmi (at the block level) by county. # # Calculate the 90th percentile of pop_sqmi (at the block level) by county. # In[ ]: # In[ ]: # ## Transforming Data with Groupby # # In some cases you may want to apply a function to your data, by group. An example would be to normalize a column by a mean of each group. Say we wanted to subtract the mean population density of each county from the population density of each census block. We could write a function to subtract the mean from each value, and then use the transform operation to apply this to each group: # In[13]: def demean(arr): return arr - arr.mean() # Now we can apply this tranformation to columns in our dataframe. As examples, let's 'demean' the pop_sqmi and pct_rent columns, subtracting the county-wide mean of these values from the block-specific values, so that the result is transformed to have a mean of zero within each county. # # To check the results, we print the means per county, then the original values for the first 5 rows, then the transformed results. The transformed results we should be able to calculate by subtracting the appropriate county mean from the block value. # In[14]: normalized = sf1[['pop_sqmi', 'pct_rent']].groupby(sf1['county_name']).transform(demean) print(sf1[['pop_sqmi', 'pct_rent']].groupby(sf1['county_name']).mean()) print(sf1[['county_name','pop_sqmi', 'pct_rent']][:5]) print(normalized[:5]) # We can merge these transformed results on to the original DataFrame, and check the means of the original variables and the tranformed ones. The transformed ones should be arbitrarily close to zero. # In[15]: sf2 = pd.merge(sf1,normalized, left_index=True, right_index=True) sf2.groupby('county_name')[['pop_sqmi_x', 'pop_sqmi_y', 'pct_rent_x', 'pct_rent_y']].mean() # Apply is a method we have learned previously, which allows us to apply a function to each row in a DataFrame. We can also combine apply with groupby to apply functions based on group membership. For example, the function 'top' sorts an array and selects the top n rows from it. We provide some defaults for the arguments of how many rows, and the column to use for the selection: # In[16]: def top(df, n=5, column='pop_sqmi'): return df.sort_values(by=column, ascending=False).head(n) # Using this on the full dataset and setting the number of rows and the column to get the top values for, in this case using pct_rent to override the default argument, we get the top 10 blocks in the region in terms of percentage rental. # In[17]: top(sf1, n=10, column='pct_rent') # Below we apply this with groupby and use the defaults for n and column, and it applies the function within each county and concatenates the results, producing the top 5 blocks on pop_sqmi for each county in the region. # In[18]: sf1.groupby('county_name').apply(top) # Here we pass arguments to the function to set n and the column to select the top value from. # In[19]: sf1.groupby('county_name').apply(top, n=1, column='arealand') # ## Experimenting with Rental Listings Merged with SF1 # # Let's read the geocoded rental listings for the Bay Area to begin. We will make sure the fips_block column is read as a string dtype so we can merge properly with the census data. It has leading zeros and is a string in the census data. # In[20]: rentals = pd.read_csv('data/sfbay_geocoded.csv', usecols=['rent', 'bedrooms', 'sqft', 'fips_block', 'longitude', 'latitude'], dtype={'fips_block': str}) #rentals = pd.read_csv('sfbay_geocoded.csv') rentals[:5] # And merge it with the census data using the FIPS block codes, which are named differently in the two DataFrames. # In[21]: rentals_sf1 = pd.merge(rentals, sf1, left_on='fips_block', right_on='blockfips') rentals_sf1[:10] # Now we can begin looking at this merged dataset. Let's start by computing mean rents by county. # In[22]: county_rents = rentals_sf1.groupby(rentals_sf1['county_name'])[['rent']].mean().sort_values(by='rent', ascending=False) county_rents # In[23]: get_ipython().run_line_magic('matplotlib', 'inline') county_rents.plot(kind='bar', color='g') # Those patterns generally conform to our expectations, with San Francisco highest and Solano lowest. # # Let's examine these patterns by county and number of bedrooms. We can just use two columns in the groupby to do this. # In[24]: rentals_sf1.groupby(['county_name', 'bedrooms'])[['rent']].mean() # That output is a bit too long, since there are rows for many bedroom counts (and few units have > 6 bedrooms!). Let's trim this by looking only at units with less than 4 bedrooms. # In[25]: rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean() # Let's plot a bar chart of these rents by county and bedroom. # In[33]: rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean().plot(kind='bar') # That bar chart is not ideal. Too small, and it would be nicer to make it separate colors for each number of bedrooms. Notice that we have a hierarchical index (county, then bedrooms within county). We can use unstack on this to make it easier to work with for the bar charts, or even to look at as a table: # In[32]: rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean().unstack() # Now we can plot a bar chart with the unstacked data, and add a title and set the figure size. # In[35]: rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean().unstack().plot(kind='bar', figsize=(14,6), title='Average Rents by County and Bedrooms') # Pandas also provides a syntax for doing this that is familiar for spreadsheet users: the pivot_table. You can see below that using pivot_table achieves exactly the same result. # In[36]: pd.pivot_table(rentals_sf1[rentals_sf1['bedrooms']<4], values='rent', index=['county_name'], columns=['bedrooms']) # In[38]: pd.pivot_table(rentals_sf1[rentals_sf1['bedrooms']<4], values='rent', index=['county_name'], columns=['bedrooms']).plot(kind='bar', figsize=(14,6), title='Average Rents by County and Bedrooms') # In addition to providing a convenience interface to groupby, pivot_table also can add partial totals, or margins. # In[39]: pd.pivot_table(rentals_sf1[rentals_sf1['bedrooms']<4], values='rent', index=['county_name'], columns=['bedrooms'], margins=True) # Although mean is the default type of aggregation in pivot_table, you can use other methods like len, to get a frequency crosstab with the count of observations in each cell of the table, as well as the margins. Let's remove the filter on number of bedrooms to see the full frequency distribution. # In[40]: pd.pivot_table(rentals_sf1, values='rent', index=['county_name'], columns=['bedrooms'], aggfunc=len, margins=True) # A simpler way to generate a crosstabulation is also built in to pandas: the crosstab method. # In[41]: pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True) # Besides the simpler syntax, the other advantage of crosstab for doing this is that you can normalize the crosstab, dividing all the cell values by the grand total. This tells us the fraction of the region's total listings that are in each combination of county and number of bedrooms. # In[42]: pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True, normalize=True) # We have the option to normalize on the index, which is the county_name in this case, so we see the fraction of the county listings that are of each number of bedrooms. The rows sum to 1. # In[43]: pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True, normalize='index') # The counterpart is to normalize by column, in which case we see the fraction of 1 bedroom listings in the region that are in each county, and so on for different bedroom counts. # # In[44]: pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True, normalize='columns') # What if we want to look at more statistics than just mean? We can combine several aggregation methods and compute them at one time. # In[45]: rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].agg(['mean', 'std', 'min', 'max']) # Remember the 'top' function we defined earlier? We can use it here to get the top rental listing in each county, for each bedroom size. # In[46]: rentals_sf1.groupby(['county_name', 'bedrooms']).apply(top, n=1, column='rent') # ## Exploring Correlations in the Data # # Pandas provides simple ways of computing correlation coefficients among the columns in your DataFrame. If you use corr() on a full DF, it will produce a large correlation table. A bit hard to navigate and you mostly would not be interested in some of these results. # In[47]: rentals_sf1.corr() # It is easy to compute correlation coeffients for a subset of columns. # In[48]: rentals_sf1[['rent', 'sqft']].corr() # And this method can be combined with groupby to compute correlation tables by group. # In[49]: rentals_sf1.groupby('county_name')[['rent', 'sqft']].corr() # In[50]: rentals_sf1[rentals_sf1['bedrooms']<4].groupby('bedrooms')[['rent', 'sqft']].corr() # ## Quantile and Bucket Analysis # # We use below the cut method to create categories for ranges of a variable. In this example we use 4 even intevals using cut. If we used qcut instead, it would generate quantiles for us. # In[55]: factor = pd.cut(rentals_sf1['sqft'], 4) factor[:5] # Define a function to get grouped statistics. # In[52]: def get_stats(group): return{'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()} # Apply the function to the DataFrame for a specified column. # In[53]: grouped = rentals_sf1['rent'].groupby(factor) grouped.apply(get_stats).unstack() # ## Your turn to experiment # # Try practicing these techniques on your own, to do the following: # # * Calculate the mean sqft of rental listings by county # * Calculate the standard deviation (std) of sqft of rental listings by county and bedroom # * Add a new column with a normalized sqft, substracting the mean sqft by bedroom from each listing's sqft # * Compute correlation coefficients among rent, sqft, pct_white, pct_black, pct_asian and pct_hisp, by county and for the region # * Redo the statistical profile on rents by categories of sqft range using 10 quantiles rather than 4 equal bins # # In[ ]: