Chapter 9 in Python for Data Analysis demonstrates a variety of methods to analyze data via data aggregation and grouping operations. Those are the focus of this session. Our overall goal for this session is to do Exploratory Data Analysis, which is essentially looking at and probing our data to learn about the patterns we can discover in them. Often this can generate a better understanding of problems in the data, as well as revealing relationships that might be worth exploring in more depth.
We will use Census Summary File 1 data for much of this session to illustrate concepts using real data.
Below we load a table I created previously with an extract of data from the 2010 Census, Summary File 1, for the Bay Area. The file is being loaded from an HDF5 datastore - this and other file formats are covered in Chapter 6.
import pandas as pd
sf1store = pd.HDFStore('data/bay_sf1_small.h5')
sf1 = sf1store['sf1_extract']
sf1['pct_rent'] = sf1['H0040004'] / sf1['H0040001'] * 100
sf1['pct_black'] = sf1['P0030003'] / sf1['P0030001'] * 100
sf1['pct_asian'] = sf1['P0030005'] / sf1['P0030001'] * 100
sf1['pct_white'] = sf1['P0030002'] / sf1['P0030001'] * 100
sf1['pct_hisp'] = sf1['P0040003'] / sf1['P0040001'] * 100
sf1['pct_vacant'] = sf1['H0050001'] / sf1['H00010001'] * 100
sf1['pop_sqmi'] = (sf1['P0010001'] / (sf1['arealand'] / 2589988))
sf1 = sf1[sf1['P0030001']>0]
sf1[:5]
logrecno | blockfips | state | county | tract | blkgrp | block | arealand | P0010001 | P0020001 | ... | H0050006 | H0050007 | H0050008 | pct_rent | pct_black | pct_asian | pct_white | pct_hisp | pct_vacant | pop_sqmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 26 | 060014271001001 | 06 | 001 | 427100 | 1 | 1001 | 79696 | 113 | 113 | ... | 0 | 0 | 0 | 80.000000 | 13.274336 | 5.309735 | 78.761062 | 1.769912 | 0.000000 | 3672.312839 |
3 | 28 | 060014271001003 | 06 | 001 | 427100 | 1 | 1003 | 19546 | 29 | 29 | ... | 0 | 0 | 2 | 70.000000 | 13.793103 | 27.586207 | 37.931034 | 24.137931 | 23.076923 | 3842.712166 |
4 | 29 | 060014271001004 | 06 | 001 | 427100 | 1 | 1004 | 14364 | 26 | 26 | ... | 0 | 0 | 0 | 75.000000 | 0.000000 | 38.461538 | 34.615385 | 0.000000 | 0.000000 | 4688.087441 |
6 | 31 | 060014271001006 | 06 | 001 | 427100 | 1 | 1006 | 1281 | 2 | 2 | ... | 0 | 0 | 0 | 100.000000 | 0.000000 | 50.000000 | 0.000000 | 100.000000 | 0.000000 | 4043.697112 |
7 | 32 | 060014271001007 | 06 | 001 | 427100 | 1 | 1007 | 19020 | 30 | 30 | ... | 0 | 0 | 0 | 33.333333 | 0.000000 | 43.333333 | 50.000000 | 0.000000 | 10.000000 | 4085.154574 |
5 rows × 49 columns
Groupby is a powerful method in pandas that follows the split-apply-combine approach to data. As shown in Figure 9-1 in the context of a sum operation, the data is first split into groups that share the same key values. Then an operation, in this case a sum, is applied to each group. Then the results are combined.
The built-in aggregation methods available for groupby operations include:
You can also apply your own functions as aggregation methods.
Let's apply this approach to computing total population in each county in our dataset. We can do this in two steps to help explain what is happening. First we create a groupby object, using county codes to group all the census blocks in sf1 into groups that share the same county code.
grouped = sf1['P0010001'].groupby(sf1['county'])
grouped
<pandas.core.groupby.SeriesGroupBy object at 0x103f3fe10>
Now that we have this grouping object that represents the split part of the workflow in the figure above, we can apply operations and combine the results using methods like sum:
grouped.sum()
county 001 1510271 013 1049025 041 252409 055 136484 075 805235 081 718451 085 1781642 095 413344 097 483878 Name: P0010001, dtype: int64
Doing this in two steps like above is really just to clarify the two parts of the split and apply process that happen within a groupy operation. Normally we would not bother separately creating a groupby object -- we would just do this in one line:
sf1['P0010001'].groupby(sf1['county']).sum()
county 001 1510271 013 1049025 041 252409 055 136484 075 805235 081 718451 085 1781642 095 413344 097 483878 Name: P0010001, dtype: int64
Let's add county names to the dataframe so we get more readable output, and rerun this aggregation.
county_names = {'001': 'Alameda', '013': 'Contra Costa', '041': 'Marin', '055': 'Napa', '075': 'San Francisco',
'081': 'San Mateo', '085': 'Santa Clara', '095': 'Solano', '097': 'Sonoma'}
Let's add county_name as a column in the dataframe. It would be easy to append it as the last column with a merge, but let's see how to insert it in a specified location so that it is easier to read when we browse the data. We can insert it as the 4th column, between county and tract, like so:
sf1.insert(4, 'county_name', sf1['county'].replace(county_names))
sf1[:5]
logrecno | blockfips | state | county | county_name | tract | blkgrp | block | arealand | P0010001 | ... | H0050006 | H0050007 | H0050008 | pct_rent | pct_black | pct_asian | pct_white | pct_hisp | pct_vacant | pop_sqmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 26 | 060014271001001 | 06 | 001 | Alameda | 427100 | 1 | 1001 | 79696 | 113 | ... | 0 | 0 | 0 | 80.000000 | 13.274336 | 5.309735 | 78.761062 | 1.769912 | 0.000000 | 3672.312839 |
3 | 28 | 060014271001003 | 06 | 001 | Alameda | 427100 | 1 | 1003 | 19546 | 29 | ... | 0 | 0 | 2 | 70.000000 | 13.793103 | 27.586207 | 37.931034 | 24.137931 | 23.076923 | 3842.712166 |
4 | 29 | 060014271001004 | 06 | 001 | Alameda | 427100 | 1 | 1004 | 14364 | 26 | ... | 0 | 0 | 0 | 75.000000 | 0.000000 | 38.461538 | 34.615385 | 0.000000 | 0.000000 | 4688.087441 |
6 | 31 | 060014271001006 | 06 | 001 | Alameda | 427100 | 1 | 1006 | 1281 | 2 | ... | 0 | 0 | 0 | 100.000000 | 0.000000 | 50.000000 | 0.000000 | 100.000000 | 0.000000 | 4043.697112 |
7 | 32 | 060014271001007 | 06 | 001 | Alameda | 427100 | 1 | 1007 | 19020 | 30 | ... | 0 | 0 | 0 | 33.333333 | 0.000000 | 43.333333 | 50.000000 | 0.000000 | 10.000000 | 4085.154574 |
5 rows × 50 columns
Now we can print the results of summing population by county_name:
print('Total Population by County:')
print(sf1['P0010001'].groupby(sf1['county_name']).sum())
Total Population by County: county_name Alameda 1510271 Contra Costa 1049025 Marin 252409 Napa 136484 San Francisco 805235 San Mateo 718451 Santa Clara 1781642 Solano 413344 Sonoma 483878 Name: P0010001, dtype: int64
We might want to capture the result in a DataFrame if we want to use it in other processing, like merging the results to the original DataFrame.
county_pop = sf1['P0010001'].groupby(sf1['county_name']).sum().to_frame(name='total_population').reset_index()
county_pop
county_name | total_population | |
---|---|---|
0 | Alameda | 1510271 |
1 | Contra Costa | 1049025 |
2 | Marin | 252409 |
3 | Napa | 136484 |
4 | San Francisco | 805235 |
5 | San Mateo | 718451 |
6 | Santa Clara | 1781642 |
7 | Solano | 413344 |
8 | Sonoma | 483878 |
Here we merge the county total population with sf1 and create a new DataFrame.
sf2 = pd.merge(sf1,county_pop, left_index=True, right_index=True)
sf2.shape
(5, 52)
Let's say we wanted to compute the population per square mile by county using the groupby method. We could go ahead and create another dataframe with total area by county than then do the division of total population by total area.
county_land = sf1['arealand'].groupby(sf1['county_name']).sum().to_frame(name='total_area').reset_index()
county_land
county_name | total_area | |
---|---|---|
0 | Alameda | 1190434861 |
1 | Contra Costa | 1095085515 |
2 | Marin | 1046029032 |
3 | Napa | 1556005658 |
4 | San Francisco | 95535946 |
5 | San Mateo | 884654868 |
6 | Santa Clara | 2378681334 |
7 | Solano | 1224964331 |
8 | Sonoma | 3206326062 |
county_pop_per_sqmi = county_pop['total_population'] / county_land['total_area'] * 2589988.11
county_pop_per_sqmi
0 3285.844577 1 2481.050329 2 624.969565 3 227.179082 4 21829.993454 5 2103.396042 6 1939.911635 7 873.948750 8 390.864261 dtype: float64
Or of course we could have done this whole thing in one line:
sf1['P0010001'].groupby(sf1['county_name']).sum() / sf1['arealand'].groupby(sf1['county_name']).sum() * 2589988.11
county_name Alameda 3285.844577 Contra Costa 2481.050329 Marin 624.969565 Napa 227.179082 San Francisco 21829.993454 San Mateo 2103.396042 Santa Clara 1939.911635 Solano 873.948750 Sonoma 390.864261 dtype: float64
Count the number of census blocks per county.
Calculate total households per county.
Calculate percent renters by county. (Careful not to calculate the mean percent rental across blocks in a county)
Calculate percent vacant by county.
Calculate mean, min and max pop_sqmi (at the block level) by county.
Calculate the 90th percentile of pop_sqmi (at the block level) by county.
print('Number of census blocks per county:')
sf1['block'].groupby(sf1['county_name']).count()
Number of census blocks per county:
county_name Alameda 14082 Contra Costa 11351 Marin 2857 Napa 1819 San Francisco 5137 San Mateo 6876 Santa Clara 14947 Solano 5188 Sonoma 6540 Name: block, dtype: int64
print('Total households per county')
sf1['H0030002'].groupby(sf1['county_name']).sum()
Total households per county
county_name Alameda 545138 Contra Costa 375364 Marin 103210 Napa 48876 San Francisco 345811 San Mateo 257837 Santa Clara 604204 Solano 141758 Sonoma 185825 Name: H0030002, dtype: int64
print('percent renters by county')
round(sf1['H0040004'].groupby(sf1['county_name']).sum() / sf1['H0040001'].groupby(sf1['county_name']).sum() * 100, 1)
percent renters by county
county_name Alameda 46.6 Contra Costa 32.9 Marin 37.4 Napa 37.4 San Francisco 64.2 San Mateo 40.6 Santa Clara 42.4 Solano 36.8 Sonoma 39.6 dtype: float64
print('Percent vacant by county')
round(sf1['H0030003'].groupby(sf1['county_name']).sum() / sf1['H0030001'].groupby(sf1['county_name']).sum() * 100, 1)
Percent vacant by county
county_name Alameda 6.3 Contra Costa 6.2 Marin 7.1 Napa 10.6 San Francisco 8.1 San Mateo 4.9 Santa Clara 4.4 Solano 7.0 Sonoma 9.0 dtype: float64
print('Min, Max and Mean Population per SQMI by Census Block')
(sf1['P0010001']/sf1['arealand']* 2589988.11).groupby(sf1['county_name']).agg(['min','max','mean'])
Min, Max and Mean Population per SQMI by Census Block
min | max | mean | |
---|---|---|---|
county_name | |||
Alameda | 0.223726 | 3.172735e+06 | 13753.632628 |
Contra Costa | 0.266695 | 3.647259e+05 | 8081.846587 |
Marin | 0.291499 | 4.192076e+05 | 6338.936421 |
Napa | 0.246661 | 2.929851e+05 | 6245.985286 |
San Francisco | 4.579713 | 5.604169e+05 | 28395.094743 |
San Mateo | 0.474623 | 2.736351e+05 | 11011.638956 |
Santa Clara | 0.240408 | 9.106351e+05 | 10598.597995 |
Solano | 0.195976 | 2.886576e+05 | 7203.793344 |
Sonoma | 0.238658 | 1.685384e+05 | 5415.877218 |
print('90th Percentile of Population per SQMI at block level by County')
(sf1['P0010001']/sf1['arealand']* 2589988.11).groupby(sf1['county_name']).quantile(.9)
90th Percentile of Population per SQMI at block level by County
county_name Alameda 23400.814119 Contra Costa 15173.977895 Marin 11928.112422 Napa 12409.618639 San Francisco 49067.265129 San Mateo 22779.855003 Santa Clara 20424.779379 Solano 12934.963266 Sonoma 12283.580392 dtype: float64
# Here is a compact way to capture several of these results in one table
sf1['total_hh']=sf1['H0030002']
sf1.groupby(['county_name']).agg({'block': 'count',
'total_hh' : 'sum',
'pop_sqmi': ['min','max','mean']})
block | pop_sqmi | total_hh | |||
---|---|---|---|---|---|
count | min | max | mean | sum | |
county_name | |||||
Alameda | 14082 | 0.223726 | 3.172735e+06 | 13753.632044 | 545138 |
Contra Costa | 11351 | 0.266695 | 3.647259e+05 | 8081.846244 | 375364 |
Marin | 2857 | 0.291499 | 4.192076e+05 | 6338.936151 | 103210 |
Napa | 1819 | 0.246661 | 2.929851e+05 | 6245.985021 | 48876 |
San Francisco | 5137 | 4.579713 | 5.604168e+05 | 28395.093537 | 345811 |
San Mateo | 6876 | 0.474623 | 2.736351e+05 | 11011.638488 | 257837 |
Santa Clara | 14947 | 0.240408 | 9.106351e+05 | 10598.597545 | 604204 |
Solano | 5188 | 0.195976 | 2.886576e+05 | 7203.793038 | 141758 |
Sonoma | 6540 | 0.238658 | 1.685384e+05 | 5415.876988 | 185825 |
In some cases you may want to apply a function to your data, by group. An example would be to normalize a column by a mean of each group. Say we wanted to subtract the mean population density of each county from the population density of each census block. We could write a function to subtract the mean from each value, and then use the transform operation to apply this to each group:
def demean(arr):
return arr - arr.mean()
Now we can apply this tranformation to columns in our dataframe. As examples, let's 'demean' the pop_sqmi and pct_rent columns, subtracting the county-wide mean of these values from the block-specific values, so that the result is transformed to have a mean of zero within each county.
To check the results, we print the means per county, then the original values for the first 5 rows, then the transformed results. The transformed results we should be able to calculate by subtracting the appropriate county mean from the block value.
normalized = sf1[['pop_sqmi', 'pct_rent']].groupby(sf1['county_name']).transform(demean)
print(sf1[['pop_sqmi', 'pct_rent']].groupby(sf1['county_name']).mean())
pop_sqmi pct_rent county_name Alameda 13753.632044 37.484398 Contra Costa 8081.846244 27.223329 Marin 6338.936151 33.169454 Napa 6245.985021 32.903209 San Francisco 28395.093537 51.927943 San Mateo 11011.638488 30.274104 Santa Clara 10598.597545 29.811290 Solano 7203.793038 34.598761 Sonoma 5415.876988 35.121698
print(sf1[['county_name','pop_sqmi', 'pct_rent']][:5])
print(normalized[:5])
county_name pop_sqmi pct_rent 1 Alameda 3672.312839 80.000000 3 Alameda 3842.712166 70.000000 4 Alameda 4688.087441 75.000000 6 Alameda 4043.697112 100.000000 7 Alameda 4085.154574 33.333333 pop_sqmi pct_rent 1 -10081.319205 42.515602 3 -9910.919878 32.515602 4 -9065.544603 37.515602 6 -9709.934932 62.515602 7 -9668.477470 -4.151064
We can merge these transformed results on to the original DataFrame, and check the means of the original variables and the tranformed ones. The transformed ones should be arbitrarily close to zero.
sf2 = pd.merge(sf1,normalized, left_index=True, right_index=True)
sf2.groupby('county_name')[['pop_sqmi_x', 'pop_sqmi_y', 'pct_rent_x', 'pct_rent_y']].mean()
pop_sqmi_x | pop_sqmi_y | pct_rent_x | pct_rent_y | |
---|---|---|---|---|
county_name | ||||
Alameda | 13753.632044 | -7.944529e-11 | 37.484398 | 6.856168e-14 |
Contra Costa | 8081.846244 | -7.040712e-12 | 27.223329 | -3.800049e-13 |
Marin | 6338.936151 | 1.461176e-12 | 33.169454 | 9.143583e-15 |
Napa | 6245.985021 | -1.304992e-13 | 32.903209 | -1.685373e-14 |
San Francisco | 28395.093537 | 6.833338e-12 | 51.927943 | -1.349699e-15 |
San Mateo | 11011.638488 | 8.192330e-12 | 30.274104 | -7.197617e-14 |
Santa Clara | 10598.597545 | 5.347880e-11 | 29.811290 | -1.211290e-13 |
Solano | 7203.793038 | 9.815285e-12 | 34.598761 | -5.029091e-14 |
Sonoma | 5415.876988 | 1.308518e-11 | 35.121698 | -5.937600e-14 |
Apply is a method we have learned previously, which allows us to apply a function to each row in a DataFrame. We can also combine apply with groupby to apply functions based on group membership. For example, the function 'top' sorts an array and selects the top n rows from it. We provide some defaults for the arguments of how many rows, and the column to use for the selection:
def top(df, n=5, column='pop_sqmi'):
return df.sort_values(by=column, ascending=False).head(n)
Using this on the full dataset and setting the number of rows and the column to get the top values for, in this case using pct_rent to override the default argument, we get the top 10 blocks in the region in terms of percentage rental.
top(sf1, n=10, column='pct_rent')
logrecno | blockfips | state | county | county_name | tract | blkgrp | block | arealand | P0010001 | ... | H0050007 | H0050008 | pct_rent | pct_black | pct_asian | pct_white | pct_hisp | pct_vacant | pop_sqmi | total_hh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55399 | 593009 | 060750352022006 | 06 | 075 | San Francisco | 035202 | 2 | 2006 | 5420 | 38 | ... | 0 | 0 | 100.0 | 0.000000 | 18.421053 | 68.421053 | 2.631579 | 7.142857 | 18158.587454 | 26 |
55283 | 592876 | 060750332043011 | 06 | 075 | San Francisco | 033204 | 3 | 3011 | 19179 | 160 | ... | 0 | 0 | 100.0 | 10.000000 | 18.125000 | 50.625000 | 23.750000 | 10.000000 | 21606.865843 | 72 |
55270 | 592862 | 060750332042000 | 06 | 075 | San Francisco | 033204 | 2 | 2000 | 3140 | 7 | ... | 0 | 0 | 100.0 | 0.000000 | 100.000000 | 0.000000 | 0.000000 | 0.000000 | 5773.858599 | 3 |
55271 | 592863 | 060750332042001 | 06 | 075 | San Francisco | 033204 | 2 | 2001 | 70847 | 1480 | ... | 0 | 2 | 100.0 | 6.824324 | 34.054054 | 47.567568 | 12.364865 | 8.064516 | 54105.074880 | 627 |
55273 | 592866 | 060750332043001 | 06 | 075 | San Francisco | 033204 | 3 | 3001 | 46198 | 41 | ... | 0 | 0 | 100.0 | 9.756098 | 14.634146 | 48.780488 | 19.512195 | 6.250000 | 2298.573704 | 15 |
55277 | 592870 | 060750332043005 | 06 | 075 | San Francisco | 033204 | 3 | 3005 | 15537 | 85 | ... | 0 | 0 | 100.0 | 9.411765 | 30.588235 | 44.705882 | 14.117647 | 30.769231 | 14169.336423 | 45 |
55278 | 592871 | 060750332043006 | 06 | 075 | San Francisco | 033204 | 3 | 3006 | 2013 | 2 | ... | 0 | 0 | 100.0 | 0.000000 | 0.000000 | 100.000000 | 0.000000 | 0.000000 | 2573.261798 | 1 |
55279 | 592872 | 060750332043007 | 06 | 075 | San Francisco | 033204 | 3 | 3007 | 22700 | 697 | ... | 0 | 3 | 100.0 | 4.304161 | 42.467719 | 39.885222 | 13.199426 | 12.745098 | 79525.182203 | 267 |
55280 | 592873 | 060750332043008 | 06 | 075 | San Francisco | 033204 | 3 | 3008 | 21981 | 211 | ... | 0 | 1 | 100.0 | 5.213270 | 26.540284 | 47.867299 | 19.431280 | 8.000000 | 24861.811019 | 92 |
55281 | 592874 | 060750332043009 | 06 | 075 | San Francisco | 033204 | 3 | 3009 | 15863 | 137 | ... | 0 | 0 | 100.0 | 6.569343 | 29.927007 | 48.175182 | 17.518248 | 12.121212 | 22368.300826 | 58 |
10 rows × 51 columns
Below we apply this with groupby and use the defaults for n and column, and it applies the function within each county and concatenates the results, producing the top 5 blocks on pop_sqmi for each county in the region.
sf1.groupby('county_name').apply(top)
logrecno | blockfips | state | county | county_name | tract | blkgrp | block | arealand | P0010001 | ... | H0050007 | H0050008 | pct_rent | pct_black | pct_asian | pct_white | pct_hisp | pct_vacant | pop_sqmi | total_hh | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
county_name | ||||||||||||||||||||||
Alameda | 8494 | 9073 | 060014311002003 | 06 | 001 | Alameda | 431100 | 2 | 2003 | 320 | 392 | ... | 0 | 1 | 100.000000 | 20.663265 | 10.714286 | 34.438776 | 37.244898 | 8.720930 | 3.172735e+06 | 157 |
4786 | 5117 | 060014419251004 | 06 | 001 | Alameda | 441925 | 1 | 1004 | 1240 | 801 | ... | 1 | 3 | 100.000000 | 4.119850 | 64.544320 | 22.721598 | 5.617978 | 4.591837 | 1.673049e+06 | 374 | |
19003 | 20241 | 060014028002013 | 06 | 001 | Alameda | 402800 | 2 | 2013 | 941 | 370 | ... | 0 | 1 | 100.000000 | 39.729730 | 14.054054 | 36.486486 | 8.378378 | 17.307692 | 1.018380e+06 | 215 | |
9130 | 9766 | 060014351042022 | 06 | 001 | Alameda | 435104 | 2 | 2022 | 1013 | 246 | ... | 0 | 0 | 100.000000 | 2.032520 | 25.609756 | 22.357724 | 42.682927 | 3.225806 | 6.289606e+05 | 60 | |
9702 | 10388 | 060014369001004 | 06 | 001 | Alameda | 436900 | 1 | 1004 | 2148 | 431 | ... | 0 | 0 | 98.876404 | 3.712297 | 0.000000 | 24.593968 | 91.879350 | 1.111111 | 5.196857e+05 | 89 | |
Contra Costa | 37338 | 56116 | 060133551151034 | 06 | 013 | Contra Costa | 355115 | 1 | 1034 | 2386 | 336 | ... | 0 | 0 | 100.000000 | 12.797619 | 36.607143 | 36.904762 | 9.226190 | 3.726708 | 3.647259e+05 | 155 |
27063 | 45017 | 060133150001190 | 06 | 013 | Contra Costa | 315000 | 1 | 1190 | 900 | 126 | ... | 0 | 0 | 12.500000 | 3.174603 | 15.079365 | 65.873016 | 19.841270 | 5.882353 | 3.625983e+05 | 32 | |
25628 | 43440 | 060133131021012 | 06 | 013 | Contra Costa | 313102 | 1 | 1012 | 4525 | 568 | ... | 0 | 2 | 100.000000 | 43.661972 | 3.345070 | 28.345070 | 31.338028 | 3.252033 | 3.251079e+05 | 238 | |
27062 | 45016 | 060133150001170 | 06 | 013 | Contra Costa | 315000 | 1 | 1170 | 932 | 115 | ... | 0 | 1 | 18.421053 | 1.739130 | 11.304348 | 77.391304 | 7.826087 | 7.317073 | 3.195801e+05 | 38 | |
27904 | 45921 | 060133340042011 | 06 | 013 | Contra Costa | 334004 | 2 | 2011 | 799 | 60 | ... | 0 | 0 | 38.461538 | 5.000000 | 21.666667 | 61.666667 | 11.666667 | 13.333333 | 1.944922e+05 | 26 | |
Marin | 46754 | 316228 | 060411302021012 | 06 | 041 | Marin | 130202 | 1 | 1012 | 797 | 129 | ... | 0 | 1 | 16.455696 | 0.775194 | 1.550388 | 95.348837 | 3.875969 | 3.658537 | 4.192076e+05 | 79 |
46762 | 316236 | 060411302021020 | 06 | 041 | Marin | 130202 | 1 | 1020 | 1411 | 101 | ... | 0 | 1 | 17.543860 | 0.000000 | 1.980198 | 92.079208 | 5.940594 | 6.557377 | 1.853925e+05 | 57 | |
46748 | 316222 | 060411302021006 | 06 | 041 | Marin | 130202 | 1 | 1006 | 1744 | 84 | ... | 0 | 1 | 28.846154 | 0.000000 | 2.380952 | 94.047619 | 1.190476 | 16.129032 | 1.247471e+05 | 52 | |
44846 | 314094 | 060411220001000 | 06 | 041 | Marin | 122000 | 1 | 1000 | 103966 | 4854 | ... | 0 | 0 | NaN | 40.667491 | 1.112485 | 39.658014 | 19.736300 | NaN | 1.209222e+05 | 0 | |
45030 | 314297 | 060411060013037 | 06 | 041 | Marin | 106001 | 3 | 3037 | 3050 | 131 | ... | 0 | 0 | 100.000000 | 4.580153 | 13.740458 | 61.068702 | 21.374046 | 4.225352 | 1.112421e+05 | 68 | |
Napa | 47380 | 355511 | 060552010032003 | 06 | 055 | Napa | 201003 | 2 | 2003 | 442 | 50 | ... | 0 | 0 | 76.923077 | 26.000000 | 30.000000 | 22.000000 | 34.000000 | 0.000000 | 2.929851e+05 | 13 |
47402 | 355533 | 060552010032026 | 06 | 055 | Napa | 201003 | 2 | 2026 | 3779 | 125 | ... | 0 | 0 | 97.222222 | 20.000000 | 13.600000 | 31.200000 | 54.400000 | 0.000000 | 8.567042e+04 | 36 | |
48190 | 356373 | 060552005051002 | 06 | 055 | Napa | 200505 | 1 | 1002 | 3916 | 86 | ... | 0 | 0 | 33.333333 | 0.000000 | 87.209302 | 3.488372 | 6.976744 | 0.000000 | 5.687921e+04 | 6 | |
48904 | 357165 | 060552003013027 | 06 | 055 | Napa | 200301 | 3 | 3027 | 3639 | 62 | ... | 0 | 0 | 61.538462 | 1.612903 | 0.000000 | 41.935484 | 87.096774 | 0.000000 | 4.412730e+04 | 13 | |
48388 | 356588 | 060552007043003 | 06 | 055 | Napa | 200704 | 3 | 3003 | 1819 | 30 | ... | 0 | 0 | 83.333333 | 10.000000 | 0.000000 | 43.333333 | 63.333333 | 0.000000 | 4.271558e+04 | 12 | |
San Francisco | 55230 | 592818 | 060750332031009 | 06 | 075 | San Francisco | 033203 | 1 | 1009 | 1502 | 325 | ... | 0 | 0 | 100.000000 | 5.538462 | 38.461538 | 44.923077 | 10.769231 | 11.764706 | 5.604168e+05 | 135 |
50114 | 587161 | 060750125011003 | 06 | 075 | San Francisco | 012501 | 1 | 1003 | 15488 | 1994 | ... | 0 | 8 | 100.000000 | 24.373119 | 12.186560 | 47.693079 | 16.048144 | 9.946237 | 3.334476e+05 | 335 | |
56373 | 594081 | 060750611002003 | 06 | 075 | San Francisco | 061100 | 2 | 2003 | 7217 | 765 | ... | 0 | 3 | 99.415205 | 0.000000 | 99.215686 | 0.261438 | 0.000000 | 7.317073 | 2.745380e+05 | 342 | |
49731 | 586709 | 060750105002008 | 06 | 075 | San Francisco | 010500 | 2 | 2008 | 4426 | 461 | ... | 0 | 24 | 99.716714 | 2.819957 | 24.078091 | 69.631236 | 5.422993 | 18.097448 | 2.697660e+05 | 353 | |
50073 | 587114 | 060750124012001 | 06 | 075 | San Francisco | 012401 | 2 | 2001 | 15312 | 1498 | ... | 0 | 3 | 99.853801 | 6.208278 | 15.153538 | 35.647530 | 58.945260 | 8.921438 | 2.533831e+05 | 684 | |
San Mateo | 60283 | 625772 | 060816102021046 | 06 | 081 | San Mateo | 610202 | 1 | 1046 | 8954 | 946 | ... | 0 | 0 | NaN | 19.344609 | 8.668076 | 36.892178 | 42.389006 | NaN | 2.736351e+05 | 0 |
60573 | 626075 | 060816103042020 | 06 | 081 | San Mateo | 610304 | 2 | 2020 | 2596 | 210 | ... | 0 | 0 | 22.123894 | 0.000000 | 29.047619 | 67.142857 | 8.571429 | 1.739130 | 2.095137e+05 | 113 | |
60370 | 625862 | 060816102032006 | 06 | 081 | San Mateo | 610203 | 2 | 2006 | 5177 | 262 | ... | 0 | 0 | 94.736842 | 2.290076 | 0.763359 | 35.114504 | 92.748092 | 6.557377 | 1.310753e+05 | 57 | |
63541 | 629335 | 060816016011005 | 06 | 081 | San Mateo | 601601 | 1 | 1005 | 8475 | 397 | ... | 0 | 0 | 100.000000 | 5.541562 | 42.065491 | 31.989924 | 24.937028 | 4.575163 | 1.213245e+05 | 146 | |
63840 | 629667 | 060816008003006 | 06 | 081 | San Mateo | 600800 | 3 | 3006 | 19408 | 872 | ... | 0 | 1 | 99.764706 | 10.550459 | 41.857798 | 32.224771 | 18.922018 | 8.405172 | 1.163680e+05 | 425 | |
Santa Clara | 68422 | 646761 | 060855080013002 | 06 | 085 | Santa Clara | 508001 | 3 | 3002 | 438 | 154 | ... | 0 | 1 | 99.270073 | 1.948052 | 10.389610 | 86.363636 | 1.298701 | 17.964072 | 9.106351e+05 | 137 |
72470 | 651175 | 060855113011005 | 06 | 085 | Santa Clara | 511301 | 1 | 1005 | 1653 | 254 | ... | 0 | 0 | 79.695431 | 0.000000 | 4.724409 | 94.881890 | 0.393701 | 14.718615 | 3.979776e+05 | 197 | |
73117 | 651856 | 060855002002002 | 06 | 085 | Santa Clara | 500200 | 2 | 2002 | 4325 | 534 | ... | 0 | 0 | NaN | 9.550562 | 7.677903 | 29.213483 | 63.670412 | NaN | 3.197812e+05 | 0 | |
75335 | 654276 | 060855031084002 | 06 | 085 | Santa Clara | 503108 | 4 | 4002 | 4625 | 523 | ... | 0 | 0 | 100.000000 | 18.164436 | 17.399618 | 46.462715 | 39.961759 | 0.578035 | 2.928786e+05 | 172 | |
73520 | 652286 | 060855009022000 | 06 | 085 | Santa Clara | 500902 | 2 | 2000 | 20082 | 2051 | ... | 0 | 0 | 83.333333 | 11.116529 | 22.866894 | 50.901999 | 17.113603 | 0.000000 | 2.645187e+05 | 6 | |
Solano | 90767 | 701757 | 060952524014012 | 06 | 095 | Solano | 252401 | 4 | 4012 | 3266 | 364 | ... | 0 | 1 | 99.431818 | 29.670330 | 22.252747 | 27.197802 | 20.054945 | 12.000000 | 2.886576e+05 | 176 |
90347 | 701314 | 060952523122011 | 06 | 095 | Solano | 252312 | 2 | 2011 | 7742 | 440 | ... | 0 | 1 | 100.000000 | 17.500000 | 14.090909 | 49.318182 | 13.636364 | 4.435484 | 1.471964e+05 | 237 | |
97423 | 708802 | 060952509001009 | 06 | 095 | Solano | 250900 | 1 | 1009 | 8139 | 293 | ... | 0 | 0 | 100.000000 | 62.798635 | 11.262799 | 12.969283 | 10.580205 | 7.894737 | 9.323830e+04 | 140 | |
90421 | 701392 | 060952523133040 | 06 | 095 | Solano | 252313 | 3 | 3040 | 1171 | 38 | ... | 0 | 0 | 78.125000 | 0.000000 | 2.631579 | 97.368421 | 0.000000 | 13.513514 | 8.404743e+04 | 32 | |
91166 | 702183 | 060952526081007 | 06 | 095 | Solano | 252608 | 1 | 1007 | 3484 | 97 | ... | 0 | 1 | 53.125000 | 47.422680 | 8.247423 | 10.309278 | 36.082474 | 15.789474 | 7.210931e+04 | 32 | |
Sonoma | 106920 | 719441 | 060971514023013 | 06 | 097 | Sonoma | 151402 | 3 | 3013 | 1506 | 98 | ... | 0 | 0 | 14.285714 | 5.102041 | 5.102041 | 40.816327 | 87.755102 | 4.545455 | 1.685384e+05 | 21 |
106731 | 719231 | 060971533001020 | 06 | 097 | Sonoma | 153300 | 1 | 1020 | 1292 | 40 | ... | 0 | 0 | 100.000000 | 0.000000 | 0.000000 | 85.000000 | 25.000000 | 5.555556 | 8.018539e+04 | 17 | |
104931 | 717285 | 060971516023063 | 06 | 097 | Sonoma | 151602 | 3 | 3063 | 383 | 11 | ... | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 100.000000 | 0.000000 | 0.000000 | 7.438608e+04 | 6 | |
104670 | 717002 | 060971514025003 | 06 | 097 | Sonoma | 151402 | 5 | 5003 | 7555 | 213 | ... | 0 | 1 | 100.000000 | 1.877934 | 5.164319 | 61.502347 | 30.046948 | 3.260870 | 7.302018e+04 | 89 | |
106928 | 719450 | 060971514024003 | 06 | 097 | Sonoma | 151402 | 4 | 4003 | 1038 | 29 | ... | 0 | 0 | 100.000000 | 0.000000 | 0.000000 | 24.137931 | 100.000000 | 0.000000 | 7.235997e+04 | 7 |
45 rows × 51 columns
Here we pass arguments to the function to set n and the column to select the top value from.
sf1.groupby('county_name').apply(top, n=1, column='arealand')
logrecno | blockfips | state | county | county_name | tract | blkgrp | block | arealand | P0010001 | ... | H0050007 | H0050008 | pct_rent | pct_black | pct_asian | pct_white | pct_hisp | pct_vacant | pop_sqmi | total_hh | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
county_name | ||||||||||||||||||||||
Alameda | 12544 | 13434 | 060014301021000 | 06 | 001 | Alameda | 430102 | 1 | 1000 | 31296322 | 166 | ... | 0 | 3 | 45.283019 | 1.204819 | 3.614458 | 84.337349 | 6.024096 | 7.017544 | 13.737653 | 53 |
Contra Costa | 37789 | 56579 | 060133551121158 | 06 | 013 | Contra Costa | 355112 | 1 | 1158 | 19441925 | 141 | ... | 0 | 3 | 23.809524 | 2.127660 | 0.709220 | 75.886525 | 24.822695 | 12.500000 | 18.783547 | 42 |
Marin | 42773 | 311828 | 060411322003008 | 06 | 041 | Marin | 132200 | 3 | 3008 | 48884156 | 112 | ... | 6 | 0 | 91.666667 | 0.000000 | 0.892857 | 64.285714 | 66.071429 | 36.842105 | 5.934002 | 36 |
Napa | 47066 | 355191 | 060552018001000 | 06 | 055 | Napa | 201800 | 1 | 1000 | 93228090 | 13 | ... | 1 | 2 | 83.333333 | 0.000000 | 0.000000 | 92.307692 | 0.000000 | 53.846154 | 0.361156 | 6 |
San Francisco | 56139 | 593833 | 060750604001013 | 06 | 075 | San Francisco | 060400 | 1 | 1013 | 1036262 | 3 | ... | 0 | 0 | NaN | 0.000000 | 33.333333 | 66.666667 | 0.000000 | NaN | 7.498069 | 0 |
San Mateo | 57660 | 622919 | 060816138001035 | 06 | 081 | San Mateo | 613800 | 1 | 1035 | 28976148 | 105 | ... | 0 | 4 | 55.172414 | 0.952381 | 0.000000 | 47.619048 | 60.952381 | 19.444444 | 9.385262 | 29 |
Santa Clara | 66451 | 644610 | 060855135001202 | 06 | 085 | Santa Clara | 513500 | 1 | 1202 | 277483160 | 62 | ... | 0 | 1 | 16.666667 | 3.225806 | 0.000000 | 82.258065 | 1.612903 | 14.285714 | 0.578699 | 24 |
Solano | 92499 | 703629 | 060952527026009 | 06 | 095 | Solano | 252702 | 6 | 6009 | 30318073 | 14 | ... | 0 | 0 | 33.333333 | 0.000000 | 35.714286 | 64.285714 | 0.000000 | 14.285714 | 1.195981 | 6 |
Sonoma | 99632 | 711527 | 060971542022000 | 06 | 097 | Sonoma | 154202 | 2 | 2000 | 35658559 | 106 | ... | 0 | 1 | 36.585366 | 0.000000 | 0.000000 | 89.622642 | 14.150943 | 28.070175 | 7.699098 | 41 |
9 rows × 51 columns
Let's read the geocoded rental listings for the Bay Area to begin. We will make sure the fips_block column is read as a string dtype so we can merge properly with the census data. It has leading zeros and is a string in the census data.
rentals = pd.read_csv('data/sfbay_geocoded.csv', usecols=['rent', 'bedrooms', 'sqft', 'fips_block', 'longitude', 'latitude'], dtype={'fips_block': str})
#rentals = pd.read_csv('sfbay_geocoded.csv')
rentals[:5]
rent | bedrooms | sqft | longitude | latitude | fips_block | |
---|---|---|---|---|---|---|
0 | 4500.0 | 2.0 | 1200.0 | -122.438300 | 37.745000 | 060750216002015 |
1 | 2650.0 | 2.0 | 1040.0 | -122.008131 | 37.353699 | 060855085053008 |
2 | 3100.0 | 2.0 | 1000.0 | -122.439743 | 37.731584 | 060750311005011 |
3 | 1850.0 | 1.0 | 792.0 | -122.234294 | 37.491715 | 060816101001026 |
4 | 1325.0 | 1.0 | 642.0 | -122.087751 | 37.923448 | 060133400021004 |
And merge it with the census data using the FIPS block codes, which are named differently in the two DataFrames.
rentals_sf1 = pd.merge(rentals, sf1, left_on='fips_block', right_on='blockfips')
rentals_sf1[:10]
rent | bedrooms | sqft | longitude | latitude | fips_block | logrecno | blockfips | state | county | ... | H0050007 | H0050008 | pct_rent | pct_black | pct_asian | pct_white | pct_hisp | pct_vacant | pop_sqmi | total_hh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4500.0 | 2.0 | 1200.0 | -122.438300 | 37.745000 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
1 | 6250.0 | 3.0 | 1215.0 | -122.438300 | 37.745000 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
2 | 6250.0 | 3.0 | 1215.0 | -122.438300 | 37.745000 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
3 | 6650.0 | 3.0 | 2900.0 | -122.440088 | 37.745296 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
4 | 2600.0 | 1.0 | 615.0 | -122.440088 | 37.745296 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
5 | 2615.0 | 1.0 | 615.0 | -122.440088 | 37.745296 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
6 | 2615.0 | 1.0 | 615.0 | -122.440088 | 37.745296 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
7 | 2600.0 | 1.0 | 615.0 | -122.440088 | 37.745296 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
8 | 3200.0 | NaN | 900.0 | -122.438300 | 37.745000 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
9 | 2600.0 | 1.0 | 615.0 | -122.440088 | 37.745296 | 060750216002015 | 589261 | 060750216002015 | 06 | 075 | ... | 0 | 5 | 88.888889 | 23.445826 | 10.301954 | 46.358792 | 23.268206 | 5.882353 | 24936.524053 | 288 |
10 rows × 57 columns
Now we can begin looking at this merged dataset. Let's start by computing mean rents by county.
county_rents = rentals_sf1.groupby(rentals_sf1['county_name'])[['rent']].mean().sort_values(by='rent', ascending=False)
county_rents
rent | |
---|---|
county_name | |
San Francisco | 3746.737974 |
Marin | 3277.287562 |
San Mateo | 2857.011248 |
Santa Clara | 2665.584276 |
Alameda | 2235.584293 |
Napa | 2117.797398 |
Contra Costa | 1955.075348 |
Sonoma | 1805.140760 |
Solano | 1359.965551 |
%matplotlib inline
county_rents.plot(kind='bar', color='g')
<matplotlib.axes._subplots.AxesSubplot at 0x1193d7470>
Those patterns generally conform to our expectations, with San Francisco highest and Solano lowest.
Let's examine these patterns by county and number of bedrooms. We can just use two columns in the groupby to do this.
rentals_sf1.groupby(['county_name', 'bedrooms'])[['rent']].mean()
rent | ||
---|---|---|
county_name | bedrooms | |
Alameda | 1.0 | 1862.396037 |
2.0 | 2284.538409 | |
3.0 | 2715.874074 | |
4.0 | 3315.346895 | |
5.0 | 4820.488152 | |
6.0 | 5898.392857 | |
7.0 | 7362.500000 | |
8.0 | 8400.000000 | |
Contra Costa | 1.0 | 1596.991079 |
2.0 | 1874.641106 | |
3.0 | 2390.508197 | |
4.0 | 2831.713831 | |
5.0 | 3547.841808 | |
6.0 | 3258.636364 | |
7.0 | 7172.500000 | |
8.0 | 3200.000000 | |
Marin | 1.0 | 2209.910211 |
2.0 | 2998.448575 | |
3.0 | 4476.835979 | |
4.0 | 5837.602273 | |
5.0 | 5145.666667 | |
6.0 | 5250.000000 | |
Napa | 1.0 | 1387.362745 |
2.0 | 1846.914110 | |
3.0 | 2447.333333 | |
4.0 | 2998.593750 | |
5.0 | 3009.000000 | |
San Francisco | 1.0 | 3099.179065 |
2.0 | 4225.756604 | |
3.0 | 5151.170096 | |
... | ... | |
5.0 | 7525.666667 | |
6.0 | 7750.000000 | |
7.0 | 9950.000000 | |
8.0 | 9900.000000 | |
San Mateo | 1.0 | 2266.792148 |
2.0 | 2975.375058 | |
3.0 | 3888.506952 | |
4.0 | 5292.336683 | |
5.0 | 5633.618182 | |
6.0 | 7260.000000 | |
Santa Clara | 1.0 | 2143.095373 |
2.0 | 2721.721386 | |
3.0 | 3451.772947 | |
4.0 | 4280.278307 | |
5.0 | 5642.350211 | |
6.0 | 6704.545455 | |
7.0 | 6028.571429 | |
8.0 | 2550.000000 | |
Solano | 1.0 | 1143.470514 |
2.0 | 1277.712283 | |
3.0 | 1640.565678 | |
4.0 | 2043.212766 | |
5.0 | 2491.250000 | |
7.0 | 3000.000000 | |
Sonoma | 1.0 | 1403.868078 |
2.0 | 1830.113469 | |
3.0 | 2284.307937 | |
4.0 | 2738.465347 | |
5.0 | 3037.500000 | |
6.0 | 3295.000000 |
61 rows × 1 columns
That output is a bit too long, since there are rows for many bedroom counts (and few units have > 6 bedrooms!). Let's trim this by looking only at units with less than 4 bedrooms.
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean()
rent | ||
---|---|---|
county_name | bedrooms | |
Alameda | 1.0 | 1862.396037 |
2.0 | 2284.538409 | |
3.0 | 2715.874074 | |
Contra Costa | 1.0 | 1596.991079 |
2.0 | 1874.641106 | |
3.0 | 2390.508197 | |
Marin | 1.0 | 2209.910211 |
2.0 | 2998.448575 | |
3.0 | 4476.835979 | |
Napa | 1.0 | 1387.362745 |
2.0 | 1846.914110 | |
3.0 | 2447.333333 | |
San Francisco | 1.0 | 3099.179065 |
2.0 | 4225.756604 | |
3.0 | 5151.170096 | |
San Mateo | 1.0 | 2266.792148 |
2.0 | 2975.375058 | |
3.0 | 3888.506952 | |
Santa Clara | 1.0 | 2143.095373 |
2.0 | 2721.721386 | |
3.0 | 3451.772947 | |
Solano | 1.0 | 1143.470514 |
2.0 | 1277.712283 | |
3.0 | 1640.565678 | |
Sonoma | 1.0 | 1403.868078 |
2.0 | 1830.113469 | |
3.0 | 2284.307937 |
Let's plot a bar chart of these rents by county and bedroom.
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1194bba90>
That bar chart is not ideal. Too small, and it would be nicer to make it separate colors for each number of bedrooms. Notice that we have a hierarchical index (county, then bedrooms within county). We can use unstack on this to make it easier to work with for the bar charts, or even to look at as a table:
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean().unstack()
rent | |||
---|---|---|---|
bedrooms | 1.0 | 2.0 | 3.0 |
county_name | |||
Alameda | 1862.396037 | 2284.538409 | 2715.874074 |
Contra Costa | 1596.991079 | 1874.641106 | 2390.508197 |
Marin | 2209.910211 | 2998.448575 | 4476.835979 |
Napa | 1387.362745 | 1846.914110 | 2447.333333 |
San Francisco | 3099.179065 | 4225.756604 | 5151.170096 |
San Mateo | 2266.792148 | 2975.375058 | 3888.506952 |
Santa Clara | 2143.095373 | 2721.721386 | 3451.772947 |
Solano | 1143.470514 | 1277.712283 | 1640.565678 |
Sonoma | 1403.868078 | 1830.113469 | 2284.307937 |
Now we can plot a bar chart with the unstacked data, and add a title and set the figure size.
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean().unstack().plot(kind='bar', figsize=(14,6), title='Average Rents by County and Bedrooms')
<matplotlib.axes._subplots.AxesSubplot at 0x119443198>
Pandas also provides a syntax for doing this that is familiar for spreadsheet users: the pivot_table. You can see below that using pivot_table achieves exactly the same result.
pd.pivot_table(rentals_sf1[rentals_sf1['bedrooms']<4], values='rent', index=['county_name'], columns=['bedrooms'])
bedrooms | 1.0 | 2.0 | 3.0 |
---|---|---|---|
county_name | |||
Alameda | 1862.396037 | 2284.538409 | 2715.874074 |
Contra Costa | 1596.991079 | 1874.641106 | 2390.508197 |
Marin | 2209.910211 | 2998.448575 | 4476.835979 |
Napa | 1387.362745 | 1846.914110 | 2447.333333 |
San Francisco | 3099.179065 | 4225.756604 | 5151.170096 |
San Mateo | 2266.792148 | 2975.375058 | 3888.506952 |
Santa Clara | 2143.095373 | 2721.721386 | 3451.772947 |
Solano | 1143.470514 | 1277.712283 | 1640.565678 |
Sonoma | 1403.868078 | 1830.113469 | 2284.307937 |
pd.pivot_table(rentals_sf1[rentals_sf1['bedrooms']<4], values='rent', index=['county_name'], columns=['bedrooms']).plot(kind='bar', figsize=(14,6), title='Average Rents by County and Bedrooms')
<matplotlib.axes._subplots.AxesSubplot at 0x115825e80>
In addition to providing a convenience interface to groupby, pivot_table also can add partial totals, or margins.
pd.pivot_table(rentals_sf1[rentals_sf1['bedrooms']<4], values='rent', index=['county_name'], columns=['bedrooms'], margins=True)
bedrooms | 1.0 | 2.0 | 3.0 | All |
---|---|---|---|---|
county_name | ||||
Alameda | 1862.396037 | 2284.538409 | 2715.874074 | 2161.681932 |
Contra Costa | 1596.991079 | 1874.641106 | 2390.508197 | 1856.947814 |
Marin | 2209.910211 | 2998.448575 | 4476.835979 | 3061.734740 |
Napa | 1387.362745 | 1846.914110 | 2447.333333 | 1999.539130 |
San Francisco | 3099.179065 | 4225.756604 | 5151.170096 | 3861.692972 |
San Mateo | 2266.792148 | 2975.375058 | 3888.506952 | 2801.402538 |
Santa Clara | 2143.095373 | 2721.721386 | 3451.772947 | 2590.512772 |
Solano | 1143.470514 | 1277.712283 | 1640.565678 | 1300.941814 |
Sonoma | 1403.868078 | 1830.113469 | 2284.307937 | 1775.032962 |
All | 2083.173271 | 2547.455154 | 3271.394562 | 2472.355483 |
Although mean is the default type of aggregation in pivot_table, you can use other methods like len, to get a frequency crosstab with the count of observations in each cell of the table, as well as the margins. Let's remove the filter on number of bedrooms to see the full frequency distribution.
pd.pivot_table(rentals_sf1, values='rent', index=['county_name'], columns=['bedrooms'], aggfunc=len, margins=True)
bedrooms | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | All |
---|---|---|---|---|---|---|---|---|---|
county_name | |||||||||
Alameda | 4290.0 | 4361.0 | 1350.0 | 467.0 | 211.0 | 28.0 | 4.0 | 1.0 | 10712.0 |
Contra Costa | 2242.0 | 3508.0 | 976.0 | 629.0 | 177.0 | 33.0 | 4.0 | 1.0 | 7570.0 |
Marin | 568.0 | 807.0 | 378.0 | 176.0 | 15.0 | 2.0 | NaN | NaN | 1946.0 |
Napa | 102.0 | 163.0 | 195.0 | 64.0 | 5.0 | NaN | NaN | NaN | 529.0 |
San Francisco | 2245.0 | 2120.0 | 729.0 | 207.0 | 15.0 | 4.0 | 1.0 | 1.0 | 5322.0 |
San Mateo | 2598.0 | 2141.0 | 935.0 | 199.0 | 55.0 | 5.0 | NaN | NaN | 5933.0 |
Santa Clara | 7413.0 | 8973.0 | 2484.0 | 945.0 | 237.0 | 11.0 | 7.0 | 1.0 | 20071.0 |
Solano | 797.0 | 1498.0 | 472.0 | 188.0 | 32.0 | NaN | 10.0 | NaN | 2997.0 |
Sonoma | 614.0 | 1225.0 | 315.0 | 101.0 | 4.0 | 1.0 | NaN | NaN | 2260.0 |
All | 20869.0 | 24796.0 | 7834.0 | 2976.0 | 751.0 | 84.0 | 26.0 | 4.0 | 57340.0 |
A simpler way to generate a crosstabulation is also built in to pandas: the crosstab method.
pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True)
bedrooms | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | All |
---|---|---|---|---|---|---|---|---|---|
county_name | |||||||||
Alameda | 4290 | 4361 | 1350 | 467 | 211 | 28 | 4 | 1 | 10712 |
Contra Costa | 2242 | 3508 | 976 | 629 | 177 | 33 | 4 | 1 | 7570 |
Marin | 568 | 807 | 378 | 176 | 15 | 2 | 0 | 0 | 1946 |
Napa | 102 | 163 | 195 | 64 | 5 | 0 | 0 | 0 | 529 |
San Francisco | 2245 | 2120 | 729 | 207 | 15 | 4 | 1 | 1 | 5322 |
San Mateo | 2598 | 2141 | 935 | 199 | 55 | 5 | 0 | 0 | 5933 |
Santa Clara | 7413 | 8973 | 2484 | 945 | 237 | 11 | 7 | 1 | 20071 |
Solano | 797 | 1498 | 472 | 188 | 32 | 0 | 10 | 0 | 2997 |
Sonoma | 614 | 1225 | 315 | 101 | 4 | 1 | 0 | 0 | 2260 |
All | 20869 | 24796 | 7834 | 2976 | 751 | 84 | 26 | 4 | 57340 |
Besides the simpler syntax, the other advantage of crosstab for doing this is that you can normalize the crosstab, dividing all the cell values by the grand total. This tells us the fraction of the region's total listings that are in each combination of county and number of bedrooms.
pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True, normalize=True)
bedrooms | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | All |
---|---|---|---|---|---|---|---|---|---|
county_name | |||||||||
Alameda | 0.074817 | 0.076055 | 0.023544 | 0.008144 | 0.003680 | 0.000488 | 0.000070 | 0.000017 | 0.186815 |
Contra Costa | 0.039100 | 0.061179 | 0.017021 | 0.010970 | 0.003087 | 0.000576 | 0.000070 | 0.000017 | 0.132020 |
Marin | 0.009906 | 0.014074 | 0.006592 | 0.003069 | 0.000262 | 0.000035 | 0.000000 | 0.000000 | 0.033938 |
Napa | 0.001779 | 0.002843 | 0.003401 | 0.001116 | 0.000087 | 0.000000 | 0.000000 | 0.000000 | 0.009226 |
San Francisco | 0.039152 | 0.036972 | 0.012714 | 0.003610 | 0.000262 | 0.000070 | 0.000017 | 0.000017 | 0.092815 |
San Mateo | 0.045309 | 0.037339 | 0.016306 | 0.003471 | 0.000959 | 0.000087 | 0.000000 | 0.000000 | 0.103471 |
Santa Clara | 0.129281 | 0.156488 | 0.043321 | 0.016481 | 0.004133 | 0.000192 | 0.000122 | 0.000017 | 0.350035 |
Solano | 0.013900 | 0.026125 | 0.008232 | 0.003279 | 0.000558 | 0.000000 | 0.000174 | 0.000000 | 0.052267 |
Sonoma | 0.010708 | 0.021364 | 0.005494 | 0.001761 | 0.000070 | 0.000017 | 0.000000 | 0.000000 | 0.039414 |
All | 0.363952 | 0.432438 | 0.136624 | 0.051901 | 0.013097 | 0.001465 | 0.000453 | 0.000070 | 1.000000 |
We have the option to normalize on the index, which is the county_name in this case, so we see the fraction of the county listings that are of each number of bedrooms. The rows sum to 1.
pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True, normalize='index')
bedrooms | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 |
---|---|---|---|---|---|---|---|---|
county_name | ||||||||
Alameda | 0.400485 | 0.407114 | 0.126027 | 0.043596 | 0.019698 | 0.002614 | 0.000373 | 0.000093 |
Contra Costa | 0.296169 | 0.463408 | 0.128930 | 0.083091 | 0.023382 | 0.004359 | 0.000528 | 0.000132 |
Marin | 0.291881 | 0.414697 | 0.194245 | 0.090442 | 0.007708 | 0.001028 | 0.000000 | 0.000000 |
Napa | 0.192817 | 0.308129 | 0.368620 | 0.120983 | 0.009452 | 0.000000 | 0.000000 | 0.000000 |
San Francisco | 0.421834 | 0.398346 | 0.136979 | 0.038895 | 0.002818 | 0.000752 | 0.000188 | 0.000188 |
San Mateo | 0.437890 | 0.360863 | 0.157593 | 0.033541 | 0.009270 | 0.000843 | 0.000000 | 0.000000 |
Santa Clara | 0.369339 | 0.447063 | 0.123761 | 0.047083 | 0.011808 | 0.000548 | 0.000349 | 0.000050 |
Solano | 0.265933 | 0.499833 | 0.157491 | 0.062729 | 0.010677 | 0.000000 | 0.003337 | 0.000000 |
Sonoma | 0.271681 | 0.542035 | 0.139381 | 0.044690 | 0.001770 | 0.000442 | 0.000000 | 0.000000 |
All | 0.363952 | 0.432438 | 0.136624 | 0.051901 | 0.013097 | 0.001465 | 0.000453 | 0.000070 |
The counterpart is to normalize by column, in which case we see the fraction of 1 bedroom listings in the region that are in each county, and so on for different bedroom counts.
pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True, normalize='columns')
bedrooms | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | All |
---|---|---|---|---|---|---|---|---|---|
county_name | |||||||||
Alameda | 0.205568 | 0.175875 | 0.172326 | 0.156922 | 0.280959 | 0.333333 | 0.153846 | 0.25 | 0.186815 |
Contra Costa | 0.107432 | 0.141474 | 0.124585 | 0.211358 | 0.235686 | 0.392857 | 0.153846 | 0.25 | 0.132020 |
Marin | 0.027217 | 0.032546 | 0.048251 | 0.059140 | 0.019973 | 0.023810 | 0.000000 | 0.00 | 0.033938 |
Napa | 0.004888 | 0.006574 | 0.024891 | 0.021505 | 0.006658 | 0.000000 | 0.000000 | 0.00 | 0.009226 |
San Francisco | 0.107576 | 0.085498 | 0.093056 | 0.069556 | 0.019973 | 0.047619 | 0.038462 | 0.25 | 0.092815 |
San Mateo | 0.124491 | 0.086345 | 0.119352 | 0.066868 | 0.073236 | 0.059524 | 0.000000 | 0.00 | 0.103471 |
Santa Clara | 0.355216 | 0.361873 | 0.317079 | 0.317540 | 0.315579 | 0.130952 | 0.269231 | 0.25 | 0.350035 |
Solano | 0.038191 | 0.060413 | 0.060250 | 0.063172 | 0.042610 | 0.000000 | 0.384615 | 0.00 | 0.052267 |
Sonoma | 0.029422 | 0.049403 | 0.040209 | 0.033938 | 0.005326 | 0.011905 | 0.000000 | 0.00 | 0.039414 |
What if we want to look at more statistics than just mean? We can combine several aggregation methods and compute them at one time.
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].agg(['mean', 'std', 'min', 'max'])
rent | |||||
---|---|---|---|---|---|
mean | std | min | max | ||
county_name | bedrooms | ||||
Alameda | 1.0 | 1862.396037 | 393.300923 | 500.0 | 4950.0 |
2.0 | 2284.538409 | 534.999766 | 800.0 | 4998.0 | |
3.0 | 2715.874074 | 806.606556 | 350.0 | 7500.0 | |
Contra Costa | 1.0 | 1596.991079 | 375.069796 | 496.0 | 3125.0 |
2.0 | 1874.641106 | 505.719679 | 689.0 | 3895.0 | |
3.0 | 2390.508197 | 775.568225 | 793.0 | 5995.0 | |
Marin | 1.0 | 2209.910211 | 618.305211 | 430.0 | 5600.0 |
2.0 | 2998.448575 | 950.448793 | 1325.0 | 9500.0 | |
3.0 | 4476.835979 | 1423.768448 | 1657.0 | 9000.0 | |
Napa | 1.0 | 1387.362745 | 224.556583 | 750.0 | 1800.0 |
2.0 | 1846.914110 | 500.127260 | 850.0 | 4050.0 | |
3.0 | 2447.333333 | 997.897308 | 1250.0 | 9775.0 | |
San Francisco | 1.0 | 3099.179065 | 720.636564 | 725.0 | 7500.0 |
2.0 | 4225.756604 | 1284.521670 | 750.0 | 9800.0 | |
3.0 | 5151.170096 | 1738.374226 | 1600.0 | 10250.0 | |
San Mateo | 1.0 | 2266.792148 | 411.746679 | 880.0 | 5000.0 |
2.0 | 2975.375058 | 612.783070 | 832.0 | 6500.0 | |
3.0 | 3888.506952 | 1192.532121 | 1800.0 | 10000.0 | |
Santa Clara | 1.0 | 2143.095373 | 451.835127 | 550.0 | 5202.0 |
2.0 | 2721.721386 | 664.622785 | 900.0 | 9065.0 | |
3.0 | 3451.772947 | 1013.365050 | 650.0 | 9999.0 | |
Solano | 1.0 | 1143.470514 | 214.261987 | 590.0 | 1849.0 |
2.0 | 1277.712283 | 238.327602 | 799.0 | 2300.0 | |
3.0 | 1640.565678 | 408.047945 | 950.0 | 4500.0 | |
Sonoma | 1.0 | 1403.868078 | 280.041020 | 550.0 | 3200.0 |
2.0 | 1830.113469 | 363.751789 | 995.0 | 4500.0 | |
3.0 | 2284.307937 | 545.081841 | 1200.0 | 5000.0 |
Remember the 'top' function we defined earlier? We can use it here to get the top rental listing in each county, for each bedroom size.
rentals_sf1.groupby(['county_name', 'bedrooms']).apply(top, n=1, column='rent')
rent | bedrooms | sqft | longitude | latitude | fips_block | logrecno | blockfips | state | county | ... | H0050007 | H0050008 | pct_rent | pct_black | pct_asian | pct_white | pct_hisp | pct_vacant | pop_sqmi | total_hh | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
county_name | bedrooms | ||||||||||||||||||||||
Alameda | 1.0 | 29483 | 4950.0 | 1.0 | 800.0 | -122.280400 | 37.836500 | 060014251041007 | 18519 | 060014251041007 | 06 | 001 | ... | 0 | 0 | 66.666667 | 57.142857 | 14.285714 | 28.571429 | 0.000000 | 0.000000 | 1965.089530 | 6 |
2.0 | 36889 | 4998.0 | 2.0 | 1400.0 | -122.264292 | 37.879443 | 060014216004005 | 2094 | 060014216004005 | 06 | 001 | ... | 0 | 1 | 62.264151 | 0.000000 | 15.841584 | 76.237624 | 8.910891 | 14.516129 | 13043.569584 | 53 | |
3.0 | 58544 | 7500.0 | 3.0 | 2500.0 | -122.226934 | 37.830734 | 060014261001000 | 25365 | 060014261001000 | 06 | 001 | ... | 0 | 0 | 10.000000 | 1.298701 | 23.376623 | 74.025974 | 1.298701 | 6.250000 | 2159.258077 | 30 | |
4.0 | 43448 | 8200.0 | 4.0 | 3247.0 | -122.229400 | 37.828817 | 060014261001006 | 25371 | 060014261001006 | 06 | 001 | ... | 0 | 1 | 6.000000 | 1.351351 | 29.054054 | 58.783784 | 6.081081 | 5.660377 | 10261.222401 | 50 | |
5.0 | 22134 | 7800.0 | 5.0 | 1856.0 | -122.266723 | 37.863312 | 060014235001001 | 3192 | 060014235001001 | 06 | 001 | ... | 0 | 0 | NaN | 0.000000 | 0.000000 | 100.000000 | 0.000000 | NaN | 907.176182 | 0 | |
6.0 | 45415 | 10200.0 | 6.0 | 2600.0 | -122.255180 | 37.860126 | 060014236011001 | 3257 | 060014236011001 | 06 | 001 | ... | 0 | 1 | 68.493151 | 4.787234 | 12.765957 | 72.340426 | 5.851064 | 3.947368 | 21795.780842 | 73 | |
7.0 | 44877 | 9200.0 | 7.0 | 2600.0 | -122.250000 | 37.857100 | 060014238003015 | 3446 | 060014238003015 | 06 | 001 | ... | 0 | 0 | 10.000000 | 0.000000 | 5.882353 | 94.117647 | 0.000000 | 0.000000 | 10746.838174 | 30 | |
8.0 | 59696 | 8400.0 | 8.0 | 3000.0 | -122.266181 | 37.853644 | 060014239011007 | 3469 | 060014239011007 | 06 | 001 | ... | 0 | 2 | 58.536585 | 7.228916 | 7.228916 | 79.518072 | 6.024096 | 6.818182 | 14735.006100 | 41 | |
Contra Costa | 1.0 | 27811 | 3125.0 | 1.0 | 539.0 | -122.073700 | 37.954000 | 060133230004016 | 49621 | 060133230004016 | 06 | 013 | ... | 0 | 1 | 0.000000 | 0.000000 | 20.689655 | 79.310345 | 6.896552 | 10.000000 | 6952.022584 | 9 |
2.0 | 21614 | 3895.0 | 2.0 | 1050.0 | -122.128766 | 37.856406 | 060133522011000 | 48681 | 060133522011000 | 06 | 013 | ... | 0 | 26 | 59.819121 | 2.034525 | 21.331689 | 66.399507 | 9.556104 | 5.953827 | 3136.152191 | 774 | |
3.0 | 48922 | 5995.0 | 3.0 | 2800.0 | -122.103853 | 37.873058 | 060133512003007 | 47756 | 060133512003007 | 06 | 013 | ... | 0 | 5 | 6.250000 | 0.549451 | 7.142857 | 87.637363 | 5.219780 | 3.030303 | 1547.798629 | 256 | |
4.0 | 47943 | 8900.0 | 4.0 | 2014.0 | -122.294780 | 37.912424 | 060133901003006 | 57494 | 060133901003006 | 06 | 013 | ... | 0 | 3 | 27.500000 | 4.938272 | 13.580247 | 71.604938 | 6.172840 | 6.976744 | 4824.288921 | 40 | |
5.0 | 55529 | 8500.0 | 5.0 | 5000.0 | -121.933726 | 37.778898 | 060133451122037 | 50456 | 060133451122037 | 06 | 013 | ... | 0 | 0 | 7.407407 | 2.200000 | 45.200000 | 48.400000 | 4.200000 | 1.459854 | 2651.698323 | 135 | |
6.0 | 42145 | 6400.0 | 6.0 | 4667.0 | -122.024503 | 37.869687 | 060133461011024 | 44514 | 060133461011024 | 06 | 013 | ... | 0 | 0 | 9.677419 | 0.000000 | 12.244898 | 84.693878 | 1.020408 | 0.000000 | 2843.398651 | 31 | |
7.0 | 21197 | 8630.0 | 7.0 | 5000.0 | -122.298500 | 37.915600 | 060133901003002 | 57490 | 060133901003002 | 06 | 013 | ... | 0 | 1 | 16.666667 | 13.461538 | 15.384615 | 69.230769 | 9.615385 | 4.000000 | 3687.825192 | 24 | |
8.0 | 45048 | 3200.0 | 8.0 | 4400.0 | -121.862755 | 37.991022 | 060133131031020 | 43501 | 060133131031020 | 06 | 013 | ... | 0 | 4 | 15.476190 | 11.870504 | 16.007194 | 48.561151 | 32.553957 | 3.448276 | 5212.752588 | 168 | |
Marin | 1.0 | 40634 | 5600.0 | 1.0 | 850.0 | -122.580085 | 37.861099 | 060411311002063 | 311311 | 060411311002063 | 06 | 041 | ... | 0 | 6 | 28.000000 | 2.688172 | 2.688172 | 91.935484 | 2.150538 | 15.966387 | 937.628859 | 100 |
2.0 | 46206 | 9500.0 | 2.0 | 2000.0 | -122.453098 | 37.873702 | 060411242002006 | 316065 | 060411242002006 | 06 | 041 | ... | 0 | 1 | 23.255814 | 0.000000 | 3.267974 | 94.117647 | 1.307190 | 16.504854 | 4418.986150 | 86 | |
3.0 | 58235 | 9000.0 | 3.0 | 3000.0 | -122.533016 | 37.935281 | 060411200001008 | 313544 | 060411200001008 | 06 | 041 | ... | 0 | 0 | 11.111111 | 0.000000 | 0.000000 | 100.000000 | 19.047619 | 0.000000 | 4554.492380 | 9 | |
4.0 | 55933 | 10000.0 | 4.0 | 3000.0 | -122.491484 | 37.897525 | 060411241004016 | 316051 | 060411241004016 | 06 | 041 | ... | 0 | 0 | 15.384615 | 9.375000 | 6.250000 | 84.375000 | 0.000000 | 7.142857 | 3744.279015 | 13 | |
5.0 | 36041 | 8000.0 | 5.0 | 2965.0 | -122.547233 | 37.944722 | 060411200004018 | 313639 | 060411200004018 | 06 | 041 | ... | 0 | 0 | 84.269663 | 0.684932 | 8.904110 | 85.616438 | 3.424658 | 6.315789 | 5596.741578 | 89 | |
6.0 | 54010 | 5500.0 | 6.0 | 4500.0 | -122.612793 | 38.101920 | 060411031003005 | 312560 | 060411031003005 | 06 | 041 | ... | 0 | 0 | 4.545455 | 1.621622 | 2.162162 | 90.810811 | 7.027027 | 0.000000 | 3398.886162 | 66 | |
Napa | 1.0 | 60137 | 1800.0 | 1.0 | 600.0 | -122.290361 | 38.299445 | 060552002021007 | 355859 | 060552002021007 | 06 | 055 | ... | 0 | 0 | 66.666667 | 0.000000 | 14.285714 | 85.714286 | 7.142857 | 0.000000 | 3619.829490 | 9 |
2.0 | 11868 | 4050.0 | 2.0 | 1200.0 | -122.473290 | 38.499868 | 060552016023023 | 357788 | 060552016023023 | 06 | 055 | ... | 0 | 2 | 26.086957 | 0.000000 | 0.000000 | 94.230769 | 15.384615 | 32.352941 | 3790.902018 | 23 | |
3.0 | 56172 | 9775.0 | 3.0 | 4000.0 | -122.445300 | 38.649300 | 060552018001087 | 355278 | 060552018001087 | 06 | 055 | ... | 0 | 0 | 45.000000 | 0.000000 | 0.000000 | 67.346939 | 36.734694 | 13.043478 | 7.209965 | 20 | |
4.0 | 59306 | 7500.0 | 4.0 | 3500.0 | -122.143250 | 38.342195 | 060552014031023 | 357617 | 060552014031023 | 06 | 055 | ... | 0 | 1 | 10.344828 | 0.000000 | 0.000000 | 100.000000 | 4.918033 | 19.444444 | 7.667160 | 29 | |
5.0 | 60288 | 3950.0 | 5.0 | 4000.0 | -122.242298 | 38.422125 | 060552014012000 | 357550 | 060552014012000 | 06 | 055 | ... | 0 | 5 | 9.600000 | 2.768166 | 2.422145 | 90.311419 | 7.266436 | 10.714286 | 27.513183 | 125 | |
San Francisco | 1.0 | 53478 | 7500.0 | 1.0 | 1526.0 | -122.395813 | 37.791153 | 060750615001009 | 594215 | 060750615001009 | 06 | 075 | ... | 0 | 0 | 100.000000 | 27.272727 | 18.181818 | 54.545455 | 0.000000 | 0.000000 | 1371.749627 | 2 |
2.0 | 16350 | 9800.0 | 2.0 | 1900.0 | -122.409100 | 37.800200 | 060750106003004 | 586748 | 060750106003004 | 06 | 075 | ... | 0 | 0 | 91.044776 | 1.265823 | 46.202532 | 47.468354 | 3.164557 | 12.987013 | 51364.140078 | 67 | |
3.0 | 48749 | 10250.0 | 3.0 | 3598.0 | -122.448787 | 37.759003 | 060750305001000 | 591563 | 060750305001000 | 06 | 075 | ... | 0 | 2 | 7.894737 | 0.000000 | 15.853659 | 78.048780 | 6.097561 | 9.523810 | 9056.674456 | 38 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
5.0 | 37111 | 10000.0 | 5.0 | 2700.0 | -122.449183 | 37.790572 | 060750133001006 | 587484 | 060750133001006 | 06 | 075 | ... | 0 | 1 | 14.285714 | 0.000000 | 11.475410 | 88.524590 | 1.639344 | 3.448276 | 10888.302412 | 28 | |
6.0 | 56555 | 9850.0 | 6.0 | 1654.0 | -122.417557 | 37.798429 | 060750109001004 | 586820 | 060750109001004 | 06 | 075 | ... | 0 | 4 | 63.815789 | 1.229508 | 12.295082 | 83.196721 | 10.245902 | 12.643678 | 40964.352888 | 152 | |
7.0 | 56553 | 9950.0 | 7.0 | 1894.0 | -122.417727 | 37.798408 | 060750109001004 | 586820 | 060750109001004 | 06 | 075 | ... | 0 | 4 | 63.815789 | 1.229508 | 12.295082 | 83.196721 | 10.245902 | 12.643678 | 40964.352888 | 152 | |
8.0 | 49849 | 9900.0 | 8.0 | 2500.0 | -122.453516 | 37.764395 | 060750301021000 | 591271 | 060750301021000 | 06 | 075 | ... | 0 | 5 | 44.140625 | 2.218115 | 13.123845 | 77.634011 | 6.284658 | 5.882353 | 6726.337329 | 256 | |
San Mateo | 1.0 | 50707 | 5000.0 | 1.0 | 1050.0 | -122.220976 | 37.455238 | 060816113001009 | 628573 | 060816113001009 | 06 | 081 | ... | 0 | 0 | 19.047619 | 0.000000 | 0.000000 | 97.826087 | 6.521739 | 4.545455 | 6614.448590 | 21 |
2.0 | 45175 | 6500.0 | 2.0 | 1660.0 | -122.192476 | 37.425639 | 060816130002002 | 625012 | 060816130002002 | 06 | 081 | ... | 0 | 0 | 0.000000 | 0.000000 | 7.407407 | 82.716049 | 13.580247 | 0.000000 | 1953.070130 | 28 | |
3.0 | 34850 | 10000.0 | 3.0 | 2800.0 | -122.186400 | 37.439600 | 060816128001007 | 624939 | 060816128001007 | 06 | 081 | ... | 0 | 4 | 15.909091 | 0.000000 | 16.000000 | 74.400000 | 12.000000 | 8.333333 | 4020.821431 | 44 | |
4.0 | 34871 | 10000.0 | 4.0 | 3200.0 | -122.186400 | 37.439600 | 060816128001007 | 624939 | 060816128001007 | 06 | 081 | ... | 0 | 4 | 15.909091 | 0.000000 | 16.000000 | 74.400000 | 12.000000 | 8.333333 | 4020.821431 | 44 | |
5.0 | 34855 | 10000.0 | 5.0 | 3000.0 | -122.186400 | 37.439600 | 060816128001007 | 624939 | 060816128001007 | 06 | 081 | ... | 0 | 4 | 15.909091 | 0.000000 | 16.000000 | 74.400000 | 12.000000 | 8.333333 | 4020.821431 | 44 | |
6.0 | 54429 | 9500.0 | 6.0 | 3195.0 | -122.192152 | 37.466122 | 060816115002009 | 623360 | 060816115002009 | 06 | 081 | ... | 0 | 0 | 0.000000 | 0.000000 | 18.867925 | 75.471698 | 0.000000 | 6.250000 | 1882.413593 | 15 | |
Santa Clara | 1.0 | 8833 | 5202.0 | 1.0 | 792.0 | -122.016435 | 37.332699 | 060855081013057 | 646985 | 060855081013057 | 06 | 085 | ... | 0 | 1 | 8.333333 | 0.000000 | 50.000000 | 38.095238 | 2.380952 | 20.000000 | 1655.750495 | 12 |
2.0 | 1228 | 9065.0 | 2.0 | 2750.0 | -121.935600 | 37.316300 | 060855021011001 | 647181 | 060855021011001 | 06 | 085 | ... | 0 | 1 | 88.888889 | 2.531646 | 12.658228 | 59.493671 | 45.569620 | 2.173913 | 5552.183111 | 45 | |
3.0 | 59425 | 9999.0 | 3.0 | 2267.0 | -122.137241 | 37.434522 | 060855114004009 | 651397 | 060855114004009 | 06 | 085 | ... | 0 | 0 | 18.181818 | 0.000000 | 8.333333 | 91.666667 | 0.000000 | 0.000000 | 7054.785155 | 11 | |
4.0 | 50057 | 10000.0 | 4.0 | 2700.0 | -122.105673 | 37.381340 | 060855103001018 | 647505 | 060855103001018 | 06 | 085 | ... | 0 | 2 | 3.125000 | 0.000000 | 17.553191 | 75.531915 | 3.191489 | 3.030303 | 4371.287764 | 64 | |
5.0 | 59076 | 10000.0 | 5.0 | 2500.0 | -122.141243 | 37.430643 | 060855114004024 | 651412 | 060855114004024 | 06 | 085 | ... | 0 | 1 | 12.500000 | 0.000000 | 3.846154 | 88.461538 | 0.000000 | 11.111111 | 9717.126696 | 8 | |
6.0 | 26178 | 10000.0 | 6.0 | 3485.0 | -122.134174 | 37.406939 | 060855106003006 | 650612 | 060855106003006 | 06 | 085 | ... | 0 | 1 | 31.250000 | 0.332779 | 29.118136 | 65.224626 | 4.159734 | 2.608696 | 4067.041311 | 224 | |
7.0 | 845 | 9900.0 | 7.0 | 2000.0 | -122.146875 | 37.438339 | 060855114001010 | 651340 | 060855114001010 | 06 | 085 | ... | 0 | 2 | 7.692308 | 0.000000 | 6.896552 | 86.206897 | 0.000000 | 13.333333 | 3187.745183 | 13 | |
8.0 | 31771 | 2550.0 | 8.0 | 774.0 | -122.003174 | 37.321574 | 060855080041001 | 646829 | 060855080041001 | 06 | 085 | ... | 0 | 1 | 20.833333 | 4.878049 | 57.317073 | 25.609756 | 6.097561 | 4.000000 | 13447.667701 | 24 | |
Solano | 1.0 | 22728 | 1849.0 | 1.0 | 862.0 | -122.149733 | 38.060886 | 060952521071006 | 707812 | 060952521071006 | 06 | 095 | ... | 0 | 2 | 69.327731 | 7.104796 | 12.255773 | 65.541741 | 11.722913 | 7.031250 | 6673.057886 | 238 |
2.0 | 57504 | 2300.0 | 2.0 | 1000.0 | -122.215571 | 38.066721 | 060952506042000 | 708407 | 060952506042000 | 06 | 095 | ... | 0 | 2 | 12.195122 | 21.264368 | 19.540230 | 52.873563 | 4.022989 | 8.888889 | 3699.496881 | 82 | |
3.0 | 33286 | 4500.0 | 3.0 | 2300.0 | -122.186894 | 38.214379 | 060952522014011 | 703152 | 060952522014011 | 06 | 095 | ... | 0 | 1 | 9.090909 | 7.954545 | 15.909091 | 73.863636 | 9.090909 | 5.714286 | 18.738056 | 33 | |
4.0 | 60319 | 3495.0 | 4.0 | 3050.0 | -122.154064 | 38.086228 | 060952521043005 | 707728 | 060952521043005 | 06 | 095 | ... | 0 | 1 | 6.756757 | 6.200000 | 22.600000 | 58.000000 | 8.000000 | 1.333333 | 2798.165950 | 148 | |
5.0 | 50440 | 4000.0 | 5.0 | 4088.0 | -122.159041 | 38.250363 | 060952522012030 | 702575 | 060952522012030 | 06 | 095 | ... | 0 | 0 | 0.000000 | 12.820513 | 12.820513 | 71.794872 | 0.000000 | 0.000000 | 537.934272 | 15 | |
7.0 | 38787 | 3200.0 | 7.0 | 2518.0 | -121.988700 | 38.384700 | 060952532041008 | 706468 | 060952532041008 | 06 | 095 | ... | 0 | 1 | 27.150538 | 7.075472 | 6.415094 | 67.735849 | 18.867925 | 4.615385 | 1126.346001 | 372 | |
Sonoma | 1.0 | 41429 | 3200.0 | 1.0 | 875.0 | -122.554700 | 38.416800 | 060971505002007 | 721202 | 060971505002007 | 06 | 097 | ... | 0 | 2 | 16.393443 | 0.000000 | 1.600000 | 88.000000 | 11.200000 | 6.153846 | 2305.129337 | 61 |
2.0 | 57190 | 4500.0 | 2.0 | 1800.0 | -122.420998 | 38.242798 | 060971501003025 | 721599 | 060971501003025 | 06 | 097 | ... | 0 | 0 | 28.571429 | 0.000000 | 0.000000 | 62.857143 | 40.000000 | 6.666667 | 680.782396 | 14 | |
3.0 | 43260 | 5000.0 | 3.0 | 1800.0 | -123.037300 | 38.330900 | 060971543024017 | 715003 | 060971543024017 | 06 | 097 | ... | 1 | 6 | 39.568345 | 0.000000 | 4.391892 | 83.108108 | 20.945946 | 39.301310 | 126.122201 | 139 | |
4.0 | 58319 | 4850.0 | 4.0 | 3932.0 | -122.678661 | 38.232416 | 060971511003001 | 714770 | 060971511003001 | 06 | 097 | ... | 0 | 4 | 40.000000 | 0.954654 | 0.477327 | 93.078759 | 2.863962 | 10.447761 | 162.388156 | 180 | |
5.0 | 41918 | 3750.0 | 5.0 | 3320.0 | -122.586050 | 38.244525 | 060971506112012 | 713335 | 060971506112012 | 06 | 097 | ... | 0 | 0 | 6.250000 | 0.719424 | 7.194245 | 88.489209 | 5.035971 | 4.000000 | 3300.527449 | 48 | |
6.0 | 60559 | 3295.0 | 6.0 | 3400.0 | -122.637792 | 38.463417 | 060971526002043 | 718340 | 060971526002043 | 06 | 097 | ... | 0 | 0 | 5.263158 | 0.000000 | 15.942029 | 82.608696 | 5.072464 | 0.000000 | 6260.283117 | 38 |
61 rows × 57 columns
Pandas provides simple ways of computing correlation coefficients among the columns in your DataFrame. If you use corr() on a full DF, it will produce a large correlation table. A bit hard to navigate and you mostly would not be interested in some of these results.
rentals_sf1.corr()
rent | bedrooms | sqft | longitude | latitude | logrecno | blkgrp | arealand | P0010001 | P0020001 | ... | H0050007 | H0050008 | pct_rent | pct_black | pct_asian | pct_white | pct_hisp | pct_vacant | pop_sqmi | total_hh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rent | 1.000000 | 0.481234 | 0.616653 | -0.189488 | -0.208624 | 0.185653 | -0.011133 | -0.002037 | -0.116857 | -0.116857 | ... | -0.015596 | 0.010655 | -0.137076 | -0.222322 | 0.073785 | 0.191424 | -0.290056 | 0.030535 | 0.020475 | -0.098466 |
bedrooms | 0.481234 | 1.000000 | 0.819533 | 0.035085 | 0.065029 | -0.039371 | 0.028106 | 0.051052 | -0.120218 | -0.120218 | ... | -0.014990 | -0.034599 | -0.330290 | -0.000024 | -0.014917 | 0.040864 | -0.035118 | -0.049176 | -0.125802 | -0.165445 |
sqft | 0.616653 | 0.819533 | 1.000000 | 0.009677 | 0.041737 | -0.016225 | 0.008702 | 0.075221 | -0.113589 | -0.113589 | ... | -0.009650 | -0.031854 | -0.331541 | -0.045550 | -0.010680 | 0.090546 | -0.105976 | -0.018599 | -0.132978 | -0.150321 |
longitude | -0.189488 | 0.035085 | 0.009677 | 1.000000 | -0.548065 | -0.130319 | -0.098020 | -0.043462 | 0.094442 | 0.094442 | ... | 0.030678 | -0.099010 | -0.067935 | -0.023140 | 0.235135 | -0.284017 | 0.156813 | -0.135010 | -0.072233 | 0.054147 |
latitude | -0.208624 | 0.065029 | 0.041737 | -0.548065 | 1.000000 | -0.306019 | 0.069752 | 0.080858 | -0.105149 | -0.105149 | ... | 0.010343 | 0.072333 | -0.013489 | 0.265711 | -0.407035 | 0.253876 | -0.003537 | 0.123641 | -0.011506 | -0.084385 |
logrecno | 0.185653 | -0.039371 | -0.016225 | -0.130319 | -0.306019 | 1.000000 | 0.050819 | -0.011586 | 0.020363 | 0.020363 | ... | -0.029056 | -0.085817 | 0.069324 | -0.245896 | 0.140300 | -0.032165 | 0.055755 | -0.018806 | 0.069862 | 0.007127 |
blkgrp | -0.011133 | 0.028106 | 0.008702 | -0.098020 | 0.069752 | 0.050819 | 1.000000 | -0.001086 | -0.018809 | -0.018809 | ... | -0.009702 | -0.023974 | 0.015964 | -0.039342 | 0.005227 | 0.018223 | -0.009363 | 0.032782 | 0.030694 | -0.011555 |
arealand | -0.002037 | 0.051052 | 0.075221 | -0.043462 | 0.080858 | -0.011586 | -0.001086 | 1.000000 | 0.069781 | 0.069781 | ... | 0.011564 | 0.030822 | -0.084488 | -0.038112 | -0.063535 | 0.094328 | -0.026265 | 0.022772 | -0.087006 | 0.056401 |
P0010001 | -0.116857 | -0.120218 | -0.113589 | 0.094442 | -0.105149 | 0.020363 | -0.018809 | 0.069781 | 1.000000 | 1.000000 | ... | 0.086167 | 0.175744 | 0.278962 | 0.018317 | 0.203882 | -0.231680 | 0.056364 | 0.008534 | 0.206183 | 0.921130 |
P0020001 | -0.116857 | -0.120218 | -0.113589 | 0.094442 | -0.105149 | 0.020363 | -0.018809 | 0.069781 | 1.000000 | 1.000000 | ... | 0.086167 | 0.175744 | 0.278962 | 0.018317 | 0.203882 | -0.231680 | 0.056364 | 0.008534 | 0.206183 | 0.921130 |
P0020002 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
P0020003 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
P0020004 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
P0020005 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
P0020006 | -0.116857 | -0.120218 | -0.113589 | 0.094442 | -0.105149 | 0.020363 | -0.018809 | 0.069781 | 1.000000 | 1.000000 | ... | 0.086167 | 0.175744 | 0.278962 | 0.018317 | 0.203882 | -0.231680 | 0.056364 | 0.008534 | 0.206183 | 0.921130 |
P0030001 | -0.116857 | -0.120218 | -0.113589 | 0.094442 | -0.105149 | 0.020363 | -0.018809 | 0.069781 | 1.000000 | 1.000000 | ... | 0.086167 | 0.175744 | 0.278962 | 0.018317 | 0.203882 | -0.231680 | 0.056364 | 0.008534 | 0.206183 | 0.921130 |
P0030002 | -0.066322 | -0.093093 | -0.073550 | -0.026435 | 0.017446 | 0.004837 | -0.014263 | 0.132031 | 0.854742 | 0.854742 | ... | 0.047394 | 0.185614 | 0.173508 | -0.083516 | -0.071235 | 0.110595 | 0.012326 | 0.044099 | 0.126198 | 0.801007 |
P0030003 | -0.201005 | -0.064711 | -0.088316 | 0.025098 | 0.127130 | -0.166137 | -0.008567 | 0.008576 | 0.537023 | 0.537023 | ... | 0.031352 | 0.152642 | 0.229637 | 0.520310 | -0.039777 | -0.289330 | 0.119065 | 0.064829 | 0.126377 | 0.487629 |
P0030004 | -0.165203 | -0.090338 | -0.115262 | 0.082768 | -0.052450 | 0.051150 | 0.040074 | 0.024509 | 0.598688 | 0.598688 | ... | 0.062006 | 0.083645 | 0.218691 | 0.069526 | -0.005272 | -0.176332 | 0.270649 | -0.029922 | 0.125312 | 0.521462 |
P0030005 | -0.026636 | -0.102957 | -0.088583 | 0.166245 | -0.259710 | 0.080939 | -0.024014 | 0.004551 | 0.787608 | 0.787608 | ... | 0.064970 | 0.109526 | 0.237828 | -0.071530 | 0.555367 | -0.424008 | -0.135453 | -0.034732 | 0.214667 | 0.754255 |
P0030006 | -0.143584 | -0.066546 | -0.090380 | 0.059618 | -0.051114 | -0.031959 | 0.005681 | 0.030011 | 0.447145 | 0.447145 | ... | 0.070441 | 0.056157 | 0.151555 | 0.151273 | 0.021010 | -0.230884 | 0.178386 | -0.036454 | 0.070447 | 0.389712 |
P0030007 | -0.192221 | -0.074380 | -0.106523 | 0.122080 | -0.059414 | 0.035007 | 0.009119 | 0.008469 | 0.641693 | 0.641693 | ... | 0.139980 | 0.050649 | 0.237256 | 0.071188 | -0.066213 | -0.215288 | 0.479125 | -0.030471 | 0.133771 | 0.501186 |
P0030008 | -0.171300 | -0.118709 | -0.115239 | 0.094721 | -0.027325 | -0.047382 | -0.038164 | 0.049658 | 0.892282 | 0.892282 | ... | 0.076637 | 0.163941 | 0.257777 | 0.111917 | 0.080781 | -0.213042 | 0.102429 | 0.021904 | 0.150980 | 0.792140 |
P0040001 | -0.116857 | -0.120218 | -0.113589 | 0.094442 | -0.105149 | 0.020363 | -0.018809 | 0.069781 | 1.000000 | 1.000000 | ... | 0.086167 | 0.175744 | 0.278962 | 0.018317 | 0.203882 | -0.231680 | 0.056364 | 0.008534 | 0.206183 | 0.921130 |
P0040002 | -0.063643 | -0.117265 | -0.095832 | 0.063237 | -0.105363 | 0.010651 | -0.024778 | 0.080094 | 0.956944 | 0.956944 | ... | 0.057461 | 0.193467 | 0.247470 | -0.005573 | 0.283091 | -0.211665 | -0.121733 | 0.020981 | 0.204262 | 0.918629 |
P0040003 | -0.199757 | -0.079945 | -0.112001 | 0.132892 | -0.063399 | 0.035869 | 0.003009 | 0.017469 | 0.712001 | 0.712001 | ... | 0.121817 | 0.063953 | 0.245416 | 0.068926 | -0.067680 | -0.189243 | 0.465083 | -0.025047 | 0.129977 | 0.565955 |
H00010001 | -0.092694 | -0.167299 | -0.150461 | 0.041489 | -0.072383 | 0.002331 | -0.015419 | 0.056135 | 0.914833 | 0.914833 | ... | 0.070336 | 0.264189 | 0.343574 | 0.024092 | 0.190429 | -0.173990 | -0.041511 | 0.078374 | 0.269096 | 0.995539 |
H0030001 | -0.092694 | -0.167299 | -0.150461 | 0.041489 | -0.072383 | 0.002331 | -0.015419 | 0.056135 | 0.914833 | 0.914833 | ... | 0.070336 | 0.264189 | 0.343574 | 0.024092 | 0.190429 | -0.173990 | -0.041511 | 0.078374 | 0.269096 | 0.995539 |
H0030002 | -0.098466 | -0.165445 | -0.150321 | 0.054147 | -0.084385 | 0.007127 | -0.011555 | 0.056401 | 0.921130 | 0.921130 | ... | 0.067513 | 0.229187 | 0.338230 | 0.017712 | 0.197087 | -0.180456 | -0.035022 | 0.017050 | 0.264373 | 1.000000 |
H0030003 | -0.012294 | -0.120409 | -0.095463 | -0.074482 | 0.049902 | -0.036622 | -0.040349 | 0.033087 | 0.523654 | 0.523654 | ... | 0.066522 | 0.443568 | 0.257462 | 0.065759 | 0.066542 | -0.057760 | -0.077547 | 0.535213 | 0.206244 | 0.588844 |
H0040001 | -0.098466 | -0.165445 | -0.150321 | 0.054147 | -0.084385 | 0.007127 | -0.011555 | 0.056401 | 0.921130 | 0.921130 | ... | 0.067513 | 0.229187 | 0.338230 | 0.017712 | 0.197087 | -0.180456 | -0.035022 | 0.017050 | 0.264373 | 1.000000 |
H0040002 | 0.003280 | 0.062530 | 0.087009 | 0.043090 | -0.005836 | -0.054497 | -0.067187 | 0.155032 | 0.471111 | 0.471111 | ... | -0.003670 | 0.185045 | -0.328432 | -0.069112 | 0.061866 | 0.029861 | -0.115989 | -0.046065 | -0.097770 | 0.494101 |
H0040003 | 0.048578 | 0.066642 | 0.089497 | -0.070018 | 0.001907 | -0.021529 | -0.018293 | 0.175840 | 0.292418 | 0.292418 | ... | -0.006291 | 0.153030 | -0.336150 | -0.101425 | -0.012773 | 0.124782 | -0.128222 | -0.051179 | -0.130538 | 0.303376 |
H0040004 | -0.117207 | -0.216751 | -0.209725 | 0.052274 | -0.093749 | 0.029572 | 0.012649 | -0.007570 | 0.848889 | 0.848889 | ... | 0.078426 | 0.179623 | 0.531463 | 0.054046 | 0.202412 | -0.226591 | 0.013496 | 0.040480 | 0.346601 | 0.929069 |
H0050001 | -0.012294 | -0.120409 | -0.095463 | -0.074482 | 0.049902 | -0.036622 | -0.040349 | 0.033087 | 0.523654 | 0.523654 | ... | 0.066522 | 0.443568 | 0.257462 | 0.065759 | 0.066542 | -0.057760 | -0.077547 | 0.535213 | 0.206244 | 0.588844 |
H0050002 | -0.083147 | -0.128702 | -0.119278 | -0.024962 | 0.033684 | 0.000885 | -0.032613 | 0.006208 | 0.615918 | 0.615918 | ... | 0.094438 | 0.236658 | 0.356730 | 0.081402 | 0.054431 | -0.089416 | -0.011010 | 0.308367 | 0.205073 | 0.628722 |
H0050003 | 0.005707 | -0.099994 | -0.077529 | 0.006925 | -0.052954 | 0.013185 | -0.007182 | -0.002627 | 0.280286 | 0.280286 | ... | 0.060043 | 0.164427 | 0.222865 | -0.008653 | 0.064074 | -0.027387 | -0.060657 | 0.203786 | 0.103148 | 0.341604 |
H0050004 | 0.024617 | -0.014218 | -0.000359 | -0.017959 | 0.016459 | -0.055300 | -0.051306 | 0.003200 | 0.087417 | 0.087417 | ... | -0.004086 | 0.051370 | -0.019252 | 0.047000 | 0.029440 | -0.026150 | -0.040381 | 0.399948 | 0.041855 | 0.126174 |
H0050005 | 0.058702 | -0.016908 | -0.002828 | -0.065529 | 0.020884 | 0.002664 | -0.025691 | 0.031964 | 0.114451 | 0.114451 | ... | -0.002102 | 0.055245 | -0.088754 | -0.004807 | 0.042804 | -0.000322 | -0.074880 | 0.182020 | 0.067681 | 0.148416 |
H0050006 | 0.087468 | -0.078880 | -0.044212 | -0.109002 | 0.033156 | 0.022854 | 0.041022 | 0.079438 | 0.188282 | 0.188282 | ... | 0.014174 | 0.205880 | 0.104297 | -0.059490 | 0.048751 | 0.055863 | -0.138841 | 0.233885 | 0.178326 | 0.281515 |
H0050007 | -0.015596 | -0.014990 | -0.009650 | 0.030678 | 0.010343 | -0.029056 | -0.009702 | 0.011564 | 0.086167 | 0.086167 | ... | 1.000000 | -0.007758 | 0.071269 | -0.001634 | 0.042330 | -0.052361 | 0.013735 | 0.041681 | 0.010471 | 0.067513 |
H0050008 | 0.010655 | -0.034599 | -0.031854 | -0.099010 | 0.072333 | -0.085817 | -0.023974 | 0.030822 | 0.175744 | 0.175744 | ... | -0.007758 | 1.000000 | 0.066034 | 0.046249 | 0.006824 | -0.013860 | -0.049633 | 0.392266 | 0.036943 | 0.229187 |
pct_rent | -0.137076 | -0.330290 | -0.331541 | -0.067935 | -0.013489 | 0.069324 | 0.015964 | -0.084488 | 0.278962 | 0.278962 | ... | 0.071269 | 0.066034 | 1.000000 | 0.176972 | 0.039507 | -0.218981 | 0.187086 | 0.107243 | 0.354959 | 0.338230 |
pct_black | -0.222322 | -0.000024 | -0.045550 | -0.023140 | 0.265711 | -0.245896 | -0.039342 | -0.038112 | 0.018317 | 0.018317 | ... | -0.001634 | 0.046249 | 0.176972 | 1.000000 | -0.173398 | -0.369745 | 0.084477 | 0.103343 | 0.052245 | 0.017712 |
pct_asian | 0.073785 | -0.014917 | -0.010680 | 0.235135 | -0.407035 | 0.140300 | 0.005227 | -0.063535 | 0.203882 | 0.203882 | ... | 0.042330 | 0.006824 | 0.039507 | -0.173398 | 1.000000 | -0.694488 | -0.325704 | -0.061251 | 0.156089 | 0.197087 |
pct_white | 0.191424 | 0.040864 | 0.090546 | -0.284017 | 0.253876 | -0.032165 | 0.018223 | 0.094328 | -0.231680 | -0.231680 | ... | -0.052361 | -0.013860 | -0.218981 | -0.369745 | -0.694488 | 1.000000 | -0.209620 | 0.031902 | -0.177767 | -0.180456 |
pct_hisp | -0.290056 | -0.035118 | -0.105976 | 0.156813 | -0.003537 | 0.055755 | -0.009363 | -0.026265 | 0.056364 | 0.056364 | ... | 0.013735 | -0.049633 | 0.187086 | 0.084477 | -0.325704 | -0.209620 | 1.000000 | -0.048769 | 0.001433 | -0.035022 |
pct_vacant | 0.030535 | -0.049176 | -0.018599 | -0.135010 | 0.123641 | -0.018806 | 0.032782 | 0.022772 | 0.008534 | 0.008534 | ... | 0.041681 | 0.392266 | 0.107243 | 0.103343 | -0.061251 | 0.031902 | -0.048769 | 1.000000 | 0.039410 | 0.017050 |
pop_sqmi | 0.020475 | -0.125802 | -0.132978 | -0.072233 | -0.011506 | 0.069862 | 0.030694 | -0.087006 | 0.206183 | 0.206183 | ... | 0.010471 | 0.036943 | 0.354959 | 0.052245 | 0.156089 | -0.177767 | 0.001433 | 0.039410 | 1.000000 | 0.264373 |
total_hh | -0.098466 | -0.165445 | -0.150321 | 0.054147 | -0.084385 | 0.007127 | -0.011555 | 0.056401 | 0.921130 | 0.921130 | ... | 0.067513 | 0.229187 | 0.338230 | 0.017712 | 0.197087 | -0.180456 | -0.035022 | 0.017050 | 0.264373 | 1.000000 |
50 rows × 50 columns
It is easy to compute correlation coeffients for a subset of columns.
rentals_sf1[['rent', 'sqft']].corr()
rent | sqft | |
---|---|---|
rent | 1.000000 | 0.616653 |
sqft | 0.616653 | 1.000000 |
And this method can be combined with groupby to compute correlation tables by group.
rentals_sf1.groupby('county_name')[['rent', 'sqft']].corr()
rent | sqft | ||
---|---|---|---|
county_name | |||
Alameda | rent | 1.000000 | 0.784616 |
sqft | 0.784616 | 1.000000 | |
Contra Costa | rent | 1.000000 | 0.688621 |
sqft | 0.688621 | 1.000000 | |
Marin | rent | 1.000000 | 0.799108 |
sqft | 0.799108 | 1.000000 | |
Napa | rent | 1.000000 | 0.711565 |
sqft | 0.711565 | 1.000000 | |
San Francisco | rent | 1.000000 | 0.740660 |
sqft | 0.740660 | 1.000000 | |
San Mateo | rent | 1.000000 | 0.816791 |
sqft | 0.816791 | 1.000000 | |
Santa Clara | rent | 1.000000 | 0.771930 |
sqft | 0.771930 | 1.000000 | |
Solano | rent | 1.000000 | 0.816569 |
sqft | 0.816569 | 1.000000 | |
Sonoma | rent | 1.000000 | 0.823559 |
sqft | 0.823559 | 1.000000 |
rentals_sf1[rentals_sf1['bedrooms']<4].groupby('bedrooms')[['rent', 'sqft']].corr()
rent | sqft | ||
---|---|---|---|
bedrooms | |||
1.0 | rent | 1.000000 | 0.350723 |
sqft | 0.350723 | 1.000000 | |
2.0 | rent | 1.000000 | 0.485617 |
sqft | 0.485617 | 1.000000 | |
3.0 | rent | 1.000000 | 0.445744 |
sqft | 0.445744 | 1.000000 |
We use below the cut method to create categories for ranges of a variable. In this example we use 4 even intevals using cut. If we used qcut instead, it would generate quantiles for us.
factor = pd.cut(rentals_sf1['sqft'], 4)
factor[:5]
0 (216.064, 1455.0] 1 (216.064, 1455.0] 2 (216.064, 1455.0] 3 (2689.0, 3923.0] 4 (216.064, 1455.0] Name: sqft, dtype: category Categories (4, interval[float64]): [(216.064, 1455.0] < (1455.0, 2689.0] < (2689.0, 3923.0] < (3923.0, 5157.0]]
Define a function to get grouped statistics.
def get_stats(group):
return{'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}
Apply the function to the DataFrame for a specified column.
grouped = rentals_sf1['rent'].groupby(factor)
grouped.apply(get_stats).unstack()
count | max | mean | min | |
---|---|---|---|---|
sqft | ||||
(216.064, 1455.0] | 53054.0 | 8500.0 | 2317.507521 | 350.0 |
(1455.0, 2689.0] | 7003.0 | 10200.0 | 3742.959732 | 295.0 |
(2689.0, 3923.0] | 983.0 | 10250.0 | 5334.371312 | 750.0 |
(3923.0, 5157.0] | 163.0 | 10000.0 | 5913.877301 | 1300.0 |
Try practicing these techniques on your own, to do the following:
print('Mean sqft of rental listings by county')
rentals_sf1.groupby(['county_name'])[['sqft']].agg(['mean'])
Mean sqft of rental listings by county
sqft | |
---|---|
mean | |
county_name | |
Alameda | 994.309424 |
Contra Costa | 1093.182721 |
Marin | 1195.318408 |
Napa | 1276.574349 |
San Francisco | 993.078210 |
San Mateo | 1018.273606 |
Santa Clara | 1015.654471 |
Solano | 1036.303606 |
Sonoma | 1080.441710 |
print('Standard deviation (std) of sqft of rental listings by county and bedroom')
# The line below does this for bedrooms < 4
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['sqft']].agg('std')
# The line below does this for all bedrooms
#rentals_sf1.groupby(['county_name', 'bedrooms'])[['sqft']].agg('std')
Standard deviation (std) of sqft of rental listings by county and bedroom
sqft | ||
---|---|---|
county_name | bedrooms | |
Alameda | 1.0 | 145.339524 |
2.0 | 182.889661 | |
3.0 | 391.567824 | |
Contra Costa | 1.0 | 125.849542 |
2.0 | 196.317132 | |
3.0 | 401.888005 | |
Marin | 1.0 | 151.617528 |
2.0 | 262.306487 | |
3.0 | 377.109246 | |
Napa | 1.0 | 108.029275 |
2.0 | 160.604700 | |
3.0 | 453.097339 | |
San Francisco | 1.0 | 197.229158 |
2.0 | 275.711931 | |
3.0 | 430.562968 | |
San Mateo | 1.0 | 164.091809 |
2.0 | 220.181812 | |
3.0 | 430.838329 | |
Santa Clara | 1.0 | 136.437022 |
2.0 | 182.463528 | |
3.0 | 355.383819 | |
Solano | 1.0 | 267.559314 |
2.0 | 143.200480 | |
3.0 | 365.966708 | |
Sonoma | 1.0 | 140.078349 |
2.0 | 280.080267 | |
3.0 | 402.328830 |
# Add a new column with a normalized sqft, substracting the mean sqft by bedroom from each listing's sqft
rentals_sf1['sqft'].groupby(rentals_sf1['bedrooms']).apply()
#rentals_sf1['normalized_sqft']= rentals_sf1['sqft'].groupby(rentals_sf1['bedrooms']).transform(demean)
#rentals_sf1.head()
<pandas.core.groupby.SeriesGroupBy object at 0x10d0cc160>
# Compute correlation coefficients among rent, sqft, pct_white, pct_black, pct_asian and pct_hisp, by county and for the region
print('Correlations by county')
rentals_sf1.groupby('county_name')[['rent', 'sqft','pct_white', 'pct_black','pct_asian','pct_hisp']].corr()
Correlations by county
rent | sqft | pct_white | pct_black | pct_asian | pct_hisp | ||
---|---|---|---|---|---|---|---|
county_name | |||||||
Alameda | rent | 1.000000 | 0.784616 | 0.170268 | -0.096534 | 0.053136 | -0.307377 |
sqft | 0.784616 | 1.000000 | 0.086466 | -0.065043 | 0.040900 | -0.142701 | |
pct_white | 0.170268 | 0.086466 | 1.000000 | -0.446842 | -0.557046 | -0.192825 | |
pct_black | -0.096534 | -0.065043 | -0.446842 | 1.000000 | -0.302742 | -0.008881 | |
pct_asian | 0.053136 | 0.040900 | -0.557046 | -0.302742 | 1.000000 | -0.285257 | |
pct_hisp | -0.307377 | -0.142701 | -0.192825 | -0.008881 | -0.285257 | 1.000000 | |
Contra Costa | rent | 1.000000 | 0.688621 | 0.254659 | -0.228891 | 0.059985 | -0.354806 |
sqft | 0.688621 | 1.000000 | 0.013136 | 0.027911 | -0.036097 | -0.008240 | |
pct_white | 0.254659 | 0.013136 | 1.000000 | -0.681387 | -0.523865 | -0.432210 | |
pct_black | -0.228891 | 0.027911 | -0.681387 | 1.000000 | -0.072675 | 0.310377 | |
pct_asian | 0.059985 | -0.036097 | -0.523865 | -0.072675 | 1.000000 | -0.250950 | |
pct_hisp | -0.354806 | -0.008240 | -0.432210 | 0.310377 | -0.250950 | 1.000000 | |
Marin | rent | 1.000000 | 0.799108 | 0.316734 | -0.236780 | 0.012182 | -0.302109 |
sqft | 0.799108 | 1.000000 | 0.229527 | -0.235192 | 0.009825 | -0.161819 | |
pct_white | 0.316734 | 0.229527 | 1.000000 | -0.539401 | -0.430646 | -0.699893 | |
pct_black | -0.236780 | -0.235192 | -0.539401 | 1.000000 | 0.064810 | 0.088286 | |
pct_asian | 0.012182 | 0.009825 | -0.430646 | 0.064810 | 1.000000 | -0.018038 | |
pct_hisp | -0.302109 | -0.161819 | -0.699893 | 0.088286 | -0.018038 | 1.000000 | |
Napa | rent | 1.000000 | 0.711565 | 0.268350 | -0.016171 | -0.095403 | -0.276374 |
sqft | 0.711565 | 1.000000 | 0.036981 | 0.147153 | 0.224958 | -0.350268 | |
pct_white | 0.268350 | 0.036981 | 1.000000 | -0.382098 | -0.723362 | -0.388602 | |
pct_black | -0.016171 | 0.147153 | -0.382098 | 1.000000 | 0.320864 | -0.025858 | |
pct_asian | -0.095403 | 0.224958 | -0.723362 | 0.320864 | 1.000000 | -0.276874 | |
pct_hisp | -0.276374 | -0.350268 | -0.388602 | -0.025858 | -0.276874 | 1.000000 | |
San Francisco | rent | 1.000000 | 0.740660 | 0.264373 | -0.038178 | -0.227130 | -0.118455 |
sqft | 0.740660 | 1.000000 | -0.008948 | -0.011674 | 0.009207 | 0.001649 | |
pct_white | 0.264373 | -0.008948 | 1.000000 | -0.376206 | -0.810304 | -0.363823 | |
pct_black | -0.038178 | -0.011674 | -0.376206 | 1.000000 | -0.060872 | 0.180683 | |
pct_asian | -0.227130 | 0.009207 | -0.810304 | -0.060872 | 1.000000 | -0.066804 | |
pct_hisp | -0.118455 | 0.001649 | -0.363823 | 0.180683 | -0.066804 | 1.000000 | |
San Mateo | rent | 1.000000 | 0.816791 | 0.186228 | -0.074474 | -0.065291 | -0.174153 |
sqft | 0.816791 | 1.000000 | 0.095011 | -0.063527 | 0.016771 | -0.126785 | |
pct_white | 0.186228 | 0.095011 | 1.000000 | -0.393266 | -0.643022 | -0.462254 | |
pct_black | -0.074474 | -0.063527 | -0.393266 | 1.000000 | 0.140799 | 0.152291 | |
pct_asian | -0.065291 | 0.016771 | -0.643022 | 0.140799 | 1.000000 | -0.297553 | |
pct_hisp | -0.174153 | -0.126785 | -0.462254 | 0.152291 | -0.297553 | 1.000000 | |
Santa Clara | rent | 1.000000 | 0.771930 | 0.110132 | -0.180542 | 0.075833 | -0.286915 |
sqft | 0.771930 | 1.000000 | 0.066089 | -0.114692 | 0.030431 | -0.129721 | |
pct_white | 0.110132 | 0.066089 | 1.000000 | -0.051713 | -0.813892 | 0.054657 | |
pct_black | -0.180542 | -0.114692 | -0.051713 | 1.000000 | -0.203472 | 0.161685 | |
pct_asian | 0.075833 | 0.030431 | -0.813892 | -0.203472 | 1.000000 | -0.545422 | |
pct_hisp | -0.286915 | -0.129721 | 0.054657 | 0.161685 | -0.545422 | 1.000000 | |
Solano | rent | 1.000000 | 0.816569 | 0.238070 | -0.213447 | 0.093823 | -0.245141 |
sqft | 0.816569 | 1.000000 | 0.176588 | -0.170842 | -0.008864 | -0.124703 | |
pct_white | 0.238070 | 0.176588 | 1.000000 | -0.722191 | -0.597865 | -0.045126 | |
pct_black | -0.213447 | -0.170842 | -0.722191 | 1.000000 | 0.256453 | -0.213463 | |
pct_asian | 0.093823 | -0.008864 | -0.597865 | 0.256453 | 1.000000 | -0.338282 | |
pct_hisp | -0.245141 | -0.124703 | -0.045126 | -0.213463 | -0.338282 | 1.000000 | |
Sonoma | rent | 1.000000 | 0.823559 | 0.260039 | -0.178540 | 0.104650 | -0.288495 |
sqft | 0.823559 | 1.000000 | 0.246950 | -0.130935 | 0.081570 | -0.307017 | |
pct_white | 0.260039 | 0.246950 | 1.000000 | -0.434451 | -0.301256 | -0.794879 | |
pct_black | -0.178540 | -0.130935 | -0.434451 | 1.000000 | 0.073768 | 0.241904 | |
pct_asian | 0.104650 | 0.081570 | -0.301256 | 0.073768 | 1.000000 | -0.050895 | |
pct_hisp | -0.288495 | -0.307017 | -0.794879 | 0.241904 | -0.050895 | 1.000000 |
print('Correlations for the region')
rentals_sf1[['rent', 'sqft','pct_white', 'pct_black','pct_asian','pct_hisp']].corr()
Correlations for the region
rent | sqft | pct_white | pct_black | pct_asian | pct_hisp | |
---|---|---|---|---|---|---|
rent | 1.000000 | 0.616653 | 0.191424 | -0.222322 | 0.073785 | -0.290056 |
sqft | 0.616653 | 1.000000 | 0.090546 | -0.045550 | -0.010680 | -0.105976 |
pct_white | 0.191424 | 0.090546 | 1.000000 | -0.369745 | -0.694488 | -0.209620 |
pct_black | -0.222322 | -0.045550 | -0.369745 | 1.000000 | -0.173398 | 0.084477 |
pct_asian | 0.073785 | -0.010680 | -0.694488 | -0.173398 | 1.000000 | -0.325704 |
pct_hisp | -0.290056 | -0.105976 | -0.209620 | 0.084477 | -0.325704 | 1.000000 |
# Redo the statistical profile on rents by categories of sqft range using 10 quantiles rather than 4 equal bins
q= pd.qcut(rentals_sf1['sqft'], 10)
rentals_sf1['rent'].groupby(q).apply(get_stats).unstack()
count | max | mean | min | |
---|---|---|---|---|
sqft | ||||
(220.999, 600.0] | 6654.0 | 3829.0 | 1776.498046 | 445.0 |
(600.0, 695.0] | 5637.0 | 4330.0 | 1956.469399 | 600.0 |
(695.0, 750.0] | 6431.0 | 4750.0 | 2077.531333 | 650.0 |
(750.0, 830.0] | 5792.0 | 6000.0 | 2112.290746 | 430.0 |
(830.0, 901.0] | 6088.0 | 6500.0 | 2141.594448 | 650.0 |
(901.0, 1000.0] | 7051.0 | 8300.0 | 2424.039285 | 350.0 |
(1000.0, 1100.0] | 5744.0 | 7000.0 | 2634.580084 | 689.0 |
(1100.0, 1247.0] | 5606.0 | 8350.0 | 2909.535319 | 500.0 |
(1247.0, 1600.0] | 6100.0 | 9250.0 | 3301.863279 | 500.0 |
(1600.0, 5157.0] | 6100.0 | 10250.0 | 4133.284426 | 295.0 |