Data Manipulation Part 1 - Groupby Operations

Chapter 9 in Python for Data Analysis demonstrates a variety of methods to analyze data via data aggregation and grouping operations. Those are the focus of this session. Our overall goal for this session is to do Exploratory Data Analysis, which is essentially looking at and probing our data to learn about the patterns we can discover in them. Often this can generate a better understanding of problems in the data, as well as revealing relationships that might be worth exploring in more depth.

We will use Census Summary File 1 data for much of this session to illustrate concepts using real data.

P1. POPULATION [1] Total: P0010001 P3. RACE [8] Universe: Total population Total: P0030001 White alone P0030002 Black or African American alone P0030003 American Indian and Alaska Native alone P0030004 Asian alone P0030005 Native Hawaiian and Other Pacific Islander alone P0030006 Some Other Race alone P0030007 Two or More Races P0030008 P4. HISPANIC OR LATINO ORIGIN [3] Universe: Total population Total: P0040001 Not Hispanic or Latino P0040002 Hispanic or Latino P0040003 H1. HOUSING UNITS [1] Universe: Housing units Total H00010001 H3. OCCUPANCY STATUS [3] Universe: Housing units Total: H0030001 Occupied H0030002 Vacant H0030003 H4. TENURE [4] Universe: Occupied housing units Total: H0040001 Owned with a mortgage or a loan H0040002 Owned free and clear H0040003 Renter-occupied H0040004 H5. VACANCY STATUS [8] Universe: Vacant housing units Total: H0050001 For rent H0050002 Rented, not occupied H0050003 For sale only H0050004 Sold, not occupied H0050005 For seasonal, recreational, or occasional use H0050006 For migrant workers H0050007 Other vacant H0050008

Below we load a table I created previously with an extract of data from the 2010 Census, Summary File 1, for the Bay Area. The file is being loaded from an HDF5 datastore - this and other file formats are covered in Chapter 6.

In [1]:
import pandas as pd

sf1store = pd.HDFStore('data/bay_sf1_small.h5')
sf1 = sf1store['sf1_extract']
sf1['pct_rent'] = sf1['H0040004'] / sf1['H0040001'] * 100
sf1['pct_black'] = sf1['P0030003'] / sf1['P0030001'] * 100
sf1['pct_asian'] = sf1['P0030005'] / sf1['P0030001'] * 100
sf1['pct_white'] = sf1['P0030002'] / sf1['P0030001'] * 100
sf1['pct_hisp'] = sf1['P0040003'] / sf1['P0040001'] * 100
sf1['pct_vacant'] = sf1['H0050001'] / sf1['H00010001'] * 100
sf1['pop_sqmi'] = (sf1['P0010001'] / (sf1['arealand'] / 2589988))
sf1 = sf1[sf1['P0030001']>0]
sf1[:5]
Out[1]:
logrecno blockfips state county tract blkgrp block arealand P0010001 P0020001 ... H0050006 H0050007 H0050008 pct_rent pct_black pct_asian pct_white pct_hisp pct_vacant pop_sqmi
1 26 060014271001001 06 001 427100 1 1001 79696 113 113 ... 0 0 0 80.000000 13.274336 5.309735 78.761062 1.769912 0.000000 3672.312839
3 28 060014271001003 06 001 427100 1 1003 19546 29 29 ... 0 0 2 70.000000 13.793103 27.586207 37.931034 24.137931 23.076923 3842.712166
4 29 060014271001004 06 001 427100 1 1004 14364 26 26 ... 0 0 0 75.000000 0.000000 38.461538 34.615385 0.000000 0.000000 4688.087441
6 31 060014271001006 06 001 427100 1 1006 1281 2 2 ... 0 0 0 100.000000 0.000000 50.000000 0.000000 100.000000 0.000000 4043.697112
7 32 060014271001007 06 001 427100 1 1007 19020 30 30 ... 0 0 0 33.333333 0.000000 43.333333 50.000000 0.000000 10.000000 4085.154574

5 rows × 49 columns

Groupby and Aggregation Operations

Groupby is a powerful method in pandas that follows the split-apply-combine approach to data. As shown in Figure 9-1 in the context of a sum operation, the data is first split into groups that share the same key values. Then an operation, in this case a sum, is applied to each group. Then the results are combined.

The built-in aggregation methods available for groupby operations include:

  • count
  • sum
  • mean
  • median
  • std, var
  • min, max
  • first, last

You can also apply your own functions as aggregation methods.

Let's apply this approach to computing total population in each county in our dataset. We can do this in two steps to help explain what is happening. First we create a groupby object, using county codes to group all the census blocks in sf1 into groups that share the same county code.

In [2]:
grouped = sf1['P0010001'].groupby(sf1['county'])
grouped
Out[2]:
<pandas.core.groupby.SeriesGroupBy object at 0x10efa6dd8>

Now that we have this grouping object that represents the split part of the workflow in the figure above, we can apply operations and combine the results using methods like sum:

In [3]:
grouped.sum()
Out[3]:
county
001    1510271
013    1049025
041     252409
055     136484
075     805235
081     718451
085    1781642
095     413344
097     483878
Name: P0010001, dtype: int64

Doing this in two steps like above is really just to clarify the two parts of the split and apply process that happen within a groupy operation. Normally we would not bother separately creating a groupby object -- we would just do this in one line:

In [4]:
sf1['P0010001'].groupby(sf1['county']).sum()
Out[4]:
county
001    1510271
013    1049025
041     252409
055     136484
075     805235
081     718451
085    1781642
095     413344
097     483878
Name: P0010001, dtype: int64

Let's add county names to the dataframe so we get more readable output, and rerun this aggregation.

In [5]:
county_names = {'001': 'Alameda', '013': 'Contra Costa', '041': 'Marin', '055': 'Napa', '075': 'San Francisco',
                '081': 'San Mateo', '085': 'Santa Clara', '095': 'Solano', '097': 'Sonoma'}

Let's add county_name as a column in the dataframe. It would be easy to append it as the last column with a merge, but let's see how to insert it in a specified location so that it is easier to read when we browse the data. We can insert it as the 4th column, between county and tract, like so:

In [6]:
sf1.insert(4, 'county_name', sf1['county'].replace(county_names))
sf1[:5]
Out[6]:
logrecno blockfips state county county_name tract blkgrp block arealand P0010001 ... H0050006 H0050007 H0050008 pct_rent pct_black pct_asian pct_white pct_hisp pct_vacant pop_sqmi
1 26 060014271001001 06 001 Alameda 427100 1 1001 79696 113 ... 0 0 0 80.000000 13.274336 5.309735 78.761062 1.769912 0.000000 3672.312839
3 28 060014271001003 06 001 Alameda 427100 1 1003 19546 29 ... 0 0 2 70.000000 13.793103 27.586207 37.931034 24.137931 23.076923 3842.712166
4 29 060014271001004 06 001 Alameda 427100 1 1004 14364 26 ... 0 0 0 75.000000 0.000000 38.461538 34.615385 0.000000 0.000000 4688.087441
6 31 060014271001006 06 001 Alameda 427100 1 1006 1281 2 ... 0 0 0 100.000000 0.000000 50.000000 0.000000 100.000000 0.000000 4043.697112
7 32 060014271001007 06 001 Alameda 427100 1 1007 19020 30 ... 0 0 0 33.333333 0.000000 43.333333 50.000000 0.000000 10.000000 4085.154574

5 rows × 50 columns

Now we can print the results of summing population by county_name:

In [7]:
print('Total Population by County:')
print(sf1['P0010001'].groupby(sf1['county_name']).sum())
Total Population by County:
county_name
Alameda          1510271
Contra Costa     1049025
Marin             252409
Napa              136484
San Francisco     805235
San Mateo         718451
Santa Clara      1781642
Solano            413344
Sonoma            483878
Name: P0010001, dtype: int64

We might want to capture the result in a DataFrame if we want to use it in other processing, like merging the results to the original DataFrame.

In [8]:
county_pop = sf1['P0010001'].groupby(sf1['county_name']).sum().to_frame(name='total_population')
county_pop
Out[8]:
total_population
county_name
Alameda 1510271
Contra Costa 1049025
Marin 252409
Napa 136484
San Francisco 805235
San Mateo 718451
Santa Clara 1781642
Solano 413344
Sonoma 483878

Here we merge the county total population with sf1 and create a new DataFrame.

In [9]:
sf2 = pd.merge(sf1,county_pop, left_on='county_name', right_index=True)
sf2[:5]
Out[9]:
logrecno blockfips state county county_name tract blkgrp block arealand P0010001 ... H0050007 H0050008 pct_rent pct_black pct_asian pct_white pct_hisp pct_vacant pop_sqmi total_population
1 26 060014271001001 06 001 Alameda 427100 1 1001 79696 113 ... 0 0 80.000000 13.274336 5.309735 78.761062 1.769912 0.000000 3672.312839 1510271
3 28 060014271001003 06 001 Alameda 427100 1 1003 19546 29 ... 0 2 70.000000 13.793103 27.586207 37.931034 24.137931 23.076923 3842.712166 1510271
4 29 060014271001004 06 001 Alameda 427100 1 1004 14364 26 ... 0 0 75.000000 0.000000 38.461538 34.615385 0.000000 0.000000 4688.087441 1510271
6 31 060014271001006 06 001 Alameda 427100 1 1006 1281 2 ... 0 0 100.000000 0.000000 50.000000 0.000000 100.000000 0.000000 4043.697112 1510271
7 32 060014271001007 06 001 Alameda 427100 1 1007 19020 30 ... 0 0 33.333333 0.000000 43.333333 50.000000 0.000000 10.000000 4085.154574 1510271

5 rows × 51 columns

Let's say we wanted to compute the population per square mile by county using the groupby method. We could go ahead and create another dataframe with total area by county than then do the division of total population by total area.

In [10]:
county_land = sf1['arealand'].groupby(sf1['county_name']).sum().to_frame(name='total_area')
county_land
Out[10]:
total_area
county_name
Alameda 1190434861
Contra Costa 1095085515
Marin 1046029032
Napa 1556005658
San Francisco 95535946
San Mateo 884654868
Santa Clara 2378681334
Solano 1224964331
Sonoma 3206326062
In [11]:
county_pop_per_sqmi = county_pop['total_population'] / county_land['total_area'] * 2589988.11
county_pop_per_sqmi
Out[11]:
county_name
Alameda           3285.844577
Contra Costa      2481.050329
Marin              624.969565
Napa               227.179082
San Francisco    21829.993454
San Mateo         2103.396042
Santa Clara       1939.911635
Solano             873.948750
Sonoma             390.864261
dtype: float64

Or of course we could have done this whole thing in one line:

In [12]:
sf1['P0010001'].groupby(sf1['county_name']).sum() / sf1['arealand'].groupby(sf1['county_name']).sum() * 2589988.11
Out[12]:
county_name
Alameda           3285.844577
Contra Costa      2481.050329
Marin              624.969565
Napa               227.179082
San Francisco    21829.993454
San Mateo         2103.396042
Santa Clara       1939.911635
Solano             873.948750
Sonoma             390.864261
dtype: float64

Your turn to practice:

Count the number of census blocks per county.

Calculate total households per county.

Calculate percent renters by county. (Careful not to calculate the mean percent rental across blocks in a county)

Calculate percent vacant by county.

Calculate mean, min and max pop_sqmi (at the block level) by county.

Calculate the 90th percentile of pop_sqmi (at the block level) by county.

Transforming Data with Groupby

In some cases you may want to apply a function to your data, by group. An example would be to normalize a column by a mean of each group. Say we wanted to subtract the mean population density of each county from the population density of each census block. We could write a function to subtract the mean from each value, and then use the transform operation to apply this to each group:

In [13]:
def demean(arr):
    return arr - arr.mean()

Now we can apply this tranformation to columns in our dataframe. As examples, let's 'demean' the pop_sqmi and pct_rent columns, subtracting the county-wide mean of these values from the block-specific values, so that the result is transformed to have a mean of zero within each county.

To check the results, we print the means per county, then the original values for the first 5 rows, then the transformed results. The transformed results we should be able to calculate by subtracting the appropriate county mean from the block value.

In [14]:
normalized = sf1[['pop_sqmi', 'pct_rent']].groupby(sf1['county_name']).transform(demean)
print(sf1[['pop_sqmi', 'pct_rent']].groupby(sf1['county_name']).mean())
print(sf1[['county_name','pop_sqmi', 'pct_rent']][:5])
print(normalized[:5])
                   pop_sqmi   pct_rent
county_name                           
Alameda        13753.632044  37.484398
Contra Costa    8081.846244  27.223329
Marin           6338.936151  33.169454
Napa            6245.985021  32.903209
San Francisco  28395.093537  51.927943
San Mateo      11011.638488  30.274104
Santa Clara    10598.597545  29.811290
Solano          7203.793038  34.598761
Sonoma          5415.876988  35.121698
  county_name     pop_sqmi    pct_rent
1     Alameda  3672.312839   80.000000
3     Alameda  3842.712166   70.000000
4     Alameda  4688.087441   75.000000
6     Alameda  4043.697112  100.000000
7     Alameda  4085.154574   33.333333
       pop_sqmi   pct_rent
1 -10081.319205  42.515602
3  -9910.919878  32.515602
4  -9065.544603  37.515602
6  -9709.934932  62.515602
7  -9668.477470  -4.151064

We can merge these transformed results on to the original DataFrame, and check the means of the original variables and the tranformed ones. The transformed ones should be arbitrarily close to zero.

In [15]:
sf2 = pd.merge(sf1,normalized, left_index=True, right_index=True)

sf2.groupby('county_name')[['pop_sqmi_x', 'pop_sqmi_y', 'pct_rent_x', 'pct_rent_y']].mean()
Out[15]:
pop_sqmi_x pop_sqmi_y pct_rent_x pct_rent_y
county_name
Alameda 13753.632044 -7.944529e-11 37.484398 6.856168e-14
Contra Costa 8081.846244 -7.040712e-12 27.223329 -3.800049e-13
Marin 6338.936151 1.461176e-12 33.169454 9.143583e-15
Napa 6245.985021 -1.304992e-13 32.903209 -1.685373e-14
San Francisco 28395.093537 6.833338e-12 51.927943 -1.349699e-15
San Mateo 11011.638488 8.192330e-12 30.274104 -7.197617e-14
Santa Clara 10598.597545 5.347880e-11 29.811290 -1.211290e-13
Solano 7203.793038 9.815285e-12 34.598761 -5.029091e-14
Sonoma 5415.876988 1.308518e-11 35.121698 -5.937600e-14

Apply is a method we have learned previously, which allows us to apply a function to each row in a DataFrame. We can also combine apply with groupby to apply functions based on group membership. For example, the function 'top' sorts an array and selects the top n rows from it. We provide some defaults for the arguments of how many rows, and the column to use for the selection:

In [16]:
def top(df, n=5, column='pop_sqmi'):
    return df.sort_values(by=column, ascending=False).head(n)

Using this on the full dataset and setting the number of rows and the column to get the top values for, in this case using pct_rent to override the default argument, we get the top 10 blocks in the region in terms of percentage rental.

In [17]:
top(sf1, n=10, column='pct_rent')
Out[17]:
logrecno blockfips state county county_name tract blkgrp block arealand P0010001 ... H0050006 H0050007 H0050008 pct_rent pct_black pct_asian pct_white pct_hisp pct_vacant pop_sqmi
55399 593009 060750352022006 06 075 San Francisco 035202 2 2006 5420 38 ... 0 0 0 100.0 0.000000 18.421053 68.421053 2.631579 7.142857 18158.587454
55283 592876 060750332043011 06 075 San Francisco 033204 3 3011 19179 160 ... 0 0 0 100.0 10.000000 18.125000 50.625000 23.750000 10.000000 21606.865843
55270 592862 060750332042000 06 075 San Francisco 033204 2 2000 3140 7 ... 0 0 0 100.0 0.000000 100.000000 0.000000 0.000000 0.000000 5773.858599
55271 592863 060750332042001 06 075 San Francisco 033204 2 2001 70847 1480 ... 2 0 2 100.0 6.824324 34.054054 47.567568 12.364865 8.064516 54105.074880
55273 592866 060750332043001 06 075 San Francisco 033204 3 3001 46198 41 ... 0 0 0 100.0 9.756098 14.634146 48.780488 19.512195 6.250000 2298.573704
55277 592870 060750332043005 06 075 San Francisco 033204 3 3005 15537 85 ... 0 0 0 100.0 9.411765 30.588235 44.705882 14.117647 30.769231 14169.336423
55278 592871 060750332043006 06 075 San Francisco 033204 3 3006 2013 2 ... 0 0 0 100.0 0.000000 0.000000 100.000000 0.000000 0.000000 2573.261798
55279 592872 060750332043007 06 075 San Francisco 033204 3 3007 22700 697 ... 2 0 3 100.0 4.304161 42.467719 39.885222 13.199426 12.745098 79525.182203
55280 592873 060750332043008 06 075 San Francisco 033204 3 3008 21981 211 ... 1 0 1 100.0 5.213270 26.540284 47.867299 19.431280 8.000000 24861.811019
55281 592874 060750332043009 06 075 San Francisco 033204 3 3009 15863 137 ... 0 0 0 100.0 6.569343 29.927007 48.175182 17.518248 12.121212 22368.300826

10 rows × 50 columns

Below we apply this with groupby and use the defaults for n and column, and it applies the function within each county and concatenates the results, producing the top 5 blocks on pop_sqmi for each county in the region.

In [18]:
sf1.groupby('county_name').apply(top)
Out[18]:
logrecno blockfips state county county_name tract blkgrp block arealand P0010001 ... H0050006 H0050007 H0050008 pct_rent pct_black pct_asian pct_white pct_hisp pct_vacant pop_sqmi
county_name
Alameda 8494 9073 060014311002003 06 001 Alameda 431100 2 2003 320 392 ... 0 0 1 100.000000 20.663265 10.714286 34.438776 37.244898 8.720930 3.172735e+06
4786 5117 060014419251004 06 001 Alameda 441925 1 1004 1240 801 ... 5 1 3 100.000000 4.119850 64.544320 22.721598 5.617978 4.591837 1.673049e+06
19003 20241 060014028002013 06 001 Alameda 402800 2 2013 941 370 ... 5 0 1 100.000000 39.729730 14.054054 36.486486 8.378378 17.307692 1.018380e+06
9130 9766 060014351042022 06 001 Alameda 435104 2 2022 1013 246 ... 0 0 0 100.000000 2.032520 25.609756 22.357724 42.682927 3.225806 6.289606e+05
9702 10388 060014369001004 06 001 Alameda 436900 1 1004 2148 431 ... 0 0 0 98.876404 3.712297 0.000000 24.593968 91.879350 1.111111 5.196857e+05
Contra Costa 37338 56116 060133551151034 06 013 Contra Costa 355115 1 1034 2386 336 ... 0 0 0 100.000000 12.797619 36.607143 36.904762 9.226190 3.726708 3.647259e+05
27063 45017 060133150001190 06 013 Contra Costa 315000 1 1190 900 126 ... 1 0 0 12.500000 3.174603 15.079365 65.873016 19.841270 5.882353 3.625983e+05
25628 43440 060133131021012 06 013 Contra Costa 313102 1 1012 4525 568 ... 0 0 2 100.000000 43.661972 3.345070 28.345070 31.338028 3.252033 3.251079e+05
27062 45016 060133150001170 06 013 Contra Costa 315000 1 1170 932 115 ... 0 0 1 18.421053 1.739130 11.304348 77.391304 7.826087 7.317073 3.195801e+05
27904 45921 060133340042011 06 013 Contra Costa 334004 2 2011 799 60 ... 0 0 0 38.461538 5.000000 21.666667 61.666667 11.666667 13.333333 1.944922e+05
Marin 46754 316228 060411302021012 06 041 Marin 130202 1 1012 797 129 ... 0 0 1 16.455696 0.775194 1.550388 95.348837 3.875969 3.658537 4.192076e+05
46762 316236 060411302021020 06 041 Marin 130202 1 1020 1411 101 ... 2 0 1 17.543860 0.000000 1.980198 92.079208 5.940594 6.557377 1.853925e+05
46748 316222 060411302021006 06 041 Marin 130202 1 1006 1744 84 ... 6 0 1 28.846154 0.000000 2.380952 94.047619 1.190476 16.129032 1.247471e+05
44846 314094 060411220001000 06 041 Marin 122000 1 1000 103966 4854 ... 0 0 0 NaN 40.667491 1.112485 39.658014 19.736300 NaN 1.209222e+05
45030 314297 060411060013037 06 041 Marin 106001 3 3037 3050 131 ... 0 0 0 100.000000 4.580153 13.740458 61.068702 21.374046 4.225352 1.112421e+05
Napa 47380 355511 060552010032003 06 055 Napa 201003 2 2003 442 50 ... 0 0 0 76.923077 26.000000 30.000000 22.000000 34.000000 0.000000 2.929851e+05
47402 355533 060552010032026 06 055 Napa 201003 2 2026 3779 125 ... 0 0 0 97.222222 20.000000 13.600000 31.200000 54.400000 0.000000 8.567042e+04
48190 356373 060552005051002 06 055 Napa 200505 1 1002 3916 86 ... 0 0 0 33.333333 0.000000 87.209302 3.488372 6.976744 0.000000 5.687921e+04
48904 357165 060552003013027 06 055 Napa 200301 3 3027 3639 62 ... 0 0 0 61.538462 1.612903 0.000000 41.935484 87.096774 0.000000 4.412730e+04
48388 356588 060552007043003 06 055 Napa 200704 3 3003 1819 30 ... 0 0 0 83.333333 10.000000 0.000000 43.333333 63.333333 0.000000 4.271558e+04
San Francisco 55230 592818 060750332031009 06 075 San Francisco 033203 1 1009 1502 325 ... 2 0 0 100.000000 5.538462 38.461538 44.923077 10.769231 11.764706 5.604168e+05
50114 587161 060750125011003 06 075 San Francisco 012501 1 1003 15488 1994 ... 0 0 8 100.000000 24.373119 12.186560 47.693079 16.048144 9.946237 3.334476e+05
56373 594081 060750611002003 06 075 San Francisco 061100 2 2003 7217 765 ... 2 0 3 99.415205 0.000000 99.215686 0.261438 0.000000 7.317073 2.745380e+05
49731 586709 060750105002008 06 075 San Francisco 010500 2 2008 4426 461 ... 31 0 24 99.716714 2.819957 24.078091 69.631236 5.422993 18.097448 2.697660e+05
50073 587114 060750124012001 06 075 San Francisco 012401 2 2001 15312 1498 ... 1 0 3 99.853801 6.208278 15.153538 35.647530 58.945260 8.921438 2.533831e+05
San Mateo 60283 625772 060816102021046 06 081 San Mateo 610202 1 1046 8954 946 ... 0 0 0 NaN 19.344609 8.668076 36.892178 42.389006 NaN 2.736351e+05
60573 626075 060816103042020 06 081 San Mateo 610304 2 2020 2596 210 ... 0 0 0 22.123894 0.000000 29.047619 67.142857 8.571429 1.739130 2.095137e+05
60370 625862 060816102032006 06 081 San Mateo 610203 2 2006 5177 262 ... 0 0 0 94.736842 2.290076 0.763359 35.114504 92.748092 6.557377 1.310753e+05
63541 629335 060816016011005 06 081 San Mateo 601601 1 1005 8475 397 ... 1 0 0 100.000000 5.541562 42.065491 31.989924 24.937028 4.575163 1.213245e+05
63840 629667 060816008003006 06 081 San Mateo 600800 3 3006 19408 872 ... 2 0 1 99.764706 10.550459 41.857798 32.224771 18.922018 8.405172 1.163680e+05
Santa Clara 68422 646761 060855080013002 06 085 Santa Clara 508001 3 3002 438 154 ... 1 0 1 99.270073 1.948052 10.389610 86.363636 1.298701 17.964072 9.106351e+05
72470 651175 060855113011005 06 085 Santa Clara 511301 1 1005 1653 254 ... 1 0 0 79.695431 0.000000 4.724409 94.881890 0.393701 14.718615 3.979776e+05
73117 651856 060855002002002 06 085 Santa Clara 500200 2 2002 4325 534 ... 0 0 0 NaN 9.550562 7.677903 29.213483 63.670412 NaN 3.197812e+05
75335 654276 060855031084002 06 085 Santa Clara 503108 4 4002 4625 523 ... 0 0 0 100.000000 18.164436 17.399618 46.462715 39.961759 0.578035 2.928786e+05
73520 652286 060855009022000 06 085 Santa Clara 500902 2 2000 20082 2051 ... 0 0 0 83.333333 11.116529 22.866894 50.901999 17.113603 0.000000 2.645187e+05
Solano 90767 701757 060952524014012 06 095 Solano 252401 4 4012 3266 364 ... 2 0 1 99.431818 29.670330 22.252747 27.197802 20.054945 12.000000 2.886576e+05
90347 701314 060952523122011 06 095 Solano 252312 2 2011 7742 440 ... 0 0 1 100.000000 17.500000 14.090909 49.318182 13.636364 4.435484 1.471964e+05
97423 708802 060952509001009 06 095 Solano 250900 1 1009 8139 293 ... 0 0 0 100.000000 62.798635 11.262799 12.969283 10.580205 7.894737 9.323830e+04
90421 701392 060952523133040 06 095 Solano 252313 3 3040 1171 38 ... 0 0 0 78.125000 0.000000 2.631579 97.368421 0.000000 13.513514 8.404743e+04
91166 702183 060952526081007 06 095 Solano 252608 1 1007 3484 97 ... 0 0 1 53.125000 47.422680 8.247423 10.309278 36.082474 15.789474 7.210931e+04
Sonoma 106920 719441 060971514023013 06 097 Sonoma 151402 3 3013 1506 98 ... 0 0 0 14.285714 5.102041 5.102041 40.816327 87.755102 4.545455 1.685384e+05
106731 719231 060971533001020 06 097 Sonoma 153300 1 1020 1292 40 ... 0 0 0 100.000000 0.000000 0.000000 85.000000 25.000000 5.555556 8.018539e+04
104931 717285 060971516023063 06 097 Sonoma 151602 3 3063 383 11 ... 0 0 0 0.000000 0.000000 0.000000 100.000000 0.000000 0.000000 7.438608e+04
104670 717002 060971514025003 06 097 Sonoma 151402 5 5003 7555 213 ... 0 0 1 100.000000 1.877934 5.164319 61.502347 30.046948 3.260870 7.302018e+04
106928 719450 060971514024003 06 097 Sonoma 151402 4 4003 1038 29 ... 0 0 0 100.000000 0.000000 0.000000 24.137931 100.000000 0.000000 7.235997e+04

45 rows × 50 columns

Here we pass arguments to the function to set n and the column to select the top value from.

In [19]:
sf1.groupby('county_name').apply(top, n=1, column='arealand')
Out[19]:
logrecno blockfips state county county_name tract blkgrp block arealand P0010001 ... H0050006 H0050007 H0050008 pct_rent pct_black pct_asian pct_white pct_hisp pct_vacant pop_sqmi
county_name
Alameda 12544 13434 060014301021000 06 001 Alameda 430102 1 1000 31296322 166 ... 0 0 3 45.283019 1.204819 3.614458 84.337349 6.024096 7.017544 13.737653
Contra Costa 37789 56579 060133551121158 06 013 Contra Costa 355112 1 1158 19441925 141 ... 2 0 3 23.809524 2.127660 0.709220 75.886525 24.822695 12.500000 18.783547
Marin 42773 311828 060411322003008 06 041 Marin 132200 3 3008 48884156 112 ... 14 6 0 91.666667 0.000000 0.892857 64.285714 66.071429 36.842105 5.934002
Napa 47066 355191 060552018001000 06 055 Napa 201800 1 1000 93228090 13 ... 4 1 2 83.333333 0.000000 0.000000 92.307692 0.000000 53.846154 0.361156
San Francisco 56139 593833 060750604001013 06 075 San Francisco 060400 1 1013 1036262 3 ... 0 0 0 NaN 0.000000 33.333333 66.666667 0.000000 NaN 7.498069
San Mateo 57660 622919 060816138001035 06 081 San Mateo 613800 1 1035 28976148 105 ... 0 0 4 55.172414 0.952381 0.000000 47.619048 60.952381 19.444444 9.385262
Santa Clara 66451 644610 060855135001202 06 085 Santa Clara 513500 1 1202 277483160 62 ... 3 0 1 16.666667 3.225806 0.000000 82.258065 1.612903 14.285714 0.578699
Solano 92499 703629 060952527026009 06 095 Solano 252702 6 6009 30318073 14 ... 1 0 0 33.333333 0.000000 35.714286 64.285714 0.000000 14.285714 1.195981
Sonoma 99632 711527 060971542022000 06 097 Sonoma 154202 2 2000 35658559 106 ... 14 0 1 36.585366 0.000000 0.000000 89.622642 14.150943 28.070175 7.699098

9 rows × 50 columns

Experimenting with Rental Listings Merged with SF1

Let's read the geocoded rental listings for the Bay Area to begin. We will make sure the fips_block column is read as a string dtype so we can merge properly with the census data. It has leading zeros and is a string in the census data.

In [20]:
rentals = pd.read_csv('data/sfbay_geocoded.csv', usecols=['rent', 'bedrooms', 'sqft', 'fips_block', 'longitude', 'latitude'], dtype={'fips_block': str})
#rentals = pd.read_csv('sfbay_geocoded.csv')
                                                    
rentals[:5]
Out[20]:
rent bedrooms sqft longitude latitude fips_block
0 4500.0 2.0 1200.0 -122.438300 37.745000 060750216002015
1 2650.0 2.0 1040.0 -122.008131 37.353699 060855085053008
2 3100.0 2.0 1000.0 -122.439743 37.731584 060750311005011
3 1850.0 1.0 792.0 -122.234294 37.491715 060816101001026
4 1325.0 1.0 642.0 -122.087751 37.923448 060133400021004

And merge it with the census data using the FIPS block codes, which are named differently in the two DataFrames.

In [21]:
rentals_sf1 = pd.merge(rentals, sf1, left_on='fips_block', right_on='blockfips')
rentals_sf1[:10]
Out[21]:
rent bedrooms sqft longitude latitude fips_block logrecno blockfips state county ... H0050006 H0050007 H0050008 pct_rent pct_black pct_asian pct_white pct_hisp pct_vacant pop_sqmi
0 4500.0 2.0 1200.0 -122.438300 37.745000 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053
1 6250.0 3.0 1215.0 -122.438300 37.745000 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053
2 6250.0 3.0 1215.0 -122.438300 37.745000 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053
3 6650.0 3.0 2900.0 -122.440088 37.745296 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053
4 2600.0 1.0 615.0 -122.440088 37.745296 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053
5 2615.0 1.0 615.0 -122.440088 37.745296 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053
6 2615.0 1.0 615.0 -122.440088 37.745296 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053
7 2600.0 1.0 615.0 -122.440088 37.745296 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053
8 3200.0 NaN 900.0 -122.438300 37.745000 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053
9 2600.0 1.0 615.0 -122.440088 37.745296 060750216002015 589261 060750216002015 06 075 ... 3 0 5 88.888889 23.445826 10.301954 46.358792 23.268206 5.882353 24936.524053

10 rows × 56 columns

Now we can begin looking at this merged dataset. Let's start by computing mean rents by county.

In [22]:
county_rents = rentals_sf1.groupby(rentals_sf1['county_name'])[['rent']].mean().sort_values(by='rent', ascending=False)
county_rents
Out[22]:
rent
county_name
San Francisco 3746.737974
Marin 3277.287562
San Mateo 2857.011248
Santa Clara 2665.584276
Alameda 2235.584293
Napa 2117.797398
Contra Costa 1955.075348
Sonoma 1805.140760
Solano 1359.965551
In [23]:
%matplotlib inline

county_rents.plot(kind='bar', color='g')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c8cb3c8>

Those patterns generally conform to our expectations, with San Francisco highest and Solano lowest.

Let's examine these patterns by county and number of bedrooms. We can just use two columns in the groupby to do this.

In [24]:
rentals_sf1.groupby(['county_name', 'bedrooms'])[['rent']].mean()
Out[24]:
rent
county_name bedrooms
Alameda 1.0 1862.396037
2.0 2284.538409
3.0 2715.874074
4.0 3315.346895
5.0 4820.488152
6.0 5898.392857
7.0 7362.500000
8.0 8400.000000
Contra Costa 1.0 1596.991079
2.0 1874.641106
3.0 2390.508197
4.0 2831.713831
5.0 3547.841808
6.0 3258.636364
7.0 7172.500000
8.0 3200.000000
Marin 1.0 2209.910211
2.0 2998.448575
3.0 4476.835979
4.0 5837.602273
5.0 5145.666667
6.0 5250.000000
Napa 1.0 1387.362745
2.0 1846.914110
3.0 2447.333333
4.0 2998.593750
5.0 3009.000000
San Francisco 1.0 3099.179065
2.0 4225.756604
3.0 5151.170096
... ...
5.0 7525.666667
6.0 7750.000000
7.0 9950.000000
8.0 9900.000000
San Mateo 1.0 2266.792148
2.0 2975.375058
3.0 3888.506952
4.0 5292.336683
5.0 5633.618182
6.0 7260.000000
Santa Clara 1.0 2143.095373
2.0 2721.721386
3.0 3451.772947
4.0 4280.278307
5.0 5642.350211
6.0 6704.545455
7.0 6028.571429
8.0 2550.000000
Solano 1.0 1143.470514
2.0 1277.712283
3.0 1640.565678
4.0 2043.212766
5.0 2491.250000
7.0 3000.000000
Sonoma 1.0 1403.868078
2.0 1830.113469
3.0 2284.307937
4.0 2738.465347
5.0 3037.500000
6.0 3295.000000

61 rows × 1 columns

That output is a bit too long, since there are rows for many bedroom counts (and few units have > 6 bedrooms!). Let's trim this by looking only at units with less than 4 bedrooms.

In [25]:
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean()
Out[25]:
rent
county_name bedrooms
Alameda 1.0 1862.396037
2.0 2284.538409
3.0 2715.874074
Contra Costa 1.0 1596.991079
2.0 1874.641106
3.0 2390.508197
Marin 1.0 2209.910211
2.0 2998.448575
3.0 4476.835979
Napa 1.0 1387.362745
2.0 1846.914110
3.0 2447.333333
San Francisco 1.0 3099.179065
2.0 4225.756604
3.0 5151.170096
San Mateo 1.0 2266.792148
2.0 2975.375058
3.0 3888.506952
Santa Clara 1.0 2143.095373
2.0 2721.721386
3.0 3451.772947
Solano 1.0 1143.470514
2.0 1277.712283
3.0 1640.565678
Sonoma 1.0 1403.868078
2.0 1830.113469
3.0 2284.307937

Let's plot a bar chart of these rents by county and bedroom.

In [33]:
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean().plot(kind='bar')
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x12435b668>

That bar chart is not ideal. Too small, and it would be nicer to make it separate colors for each number of bedrooms. Notice that we have a hierarchical index (county, then bedrooms within county). We can use unstack on this to make it easier to work with for the bar charts, or even to look at as a table:

In [32]:
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean().unstack()
Out[32]:
rent
bedrooms 1.0 2.0 3.0
county_name
Alameda 1862.396037 2284.538409 2715.874074
Contra Costa 1596.991079 1874.641106 2390.508197
Marin 2209.910211 2998.448575 4476.835979
Napa 1387.362745 1846.914110 2447.333333
San Francisco 3099.179065 4225.756604 5151.170096
San Mateo 2266.792148 2975.375058 3888.506952
Santa Clara 2143.095373 2721.721386 3451.772947
Solano 1143.470514 1277.712283 1640.565678
Sonoma 1403.868078 1830.113469 2284.307937

Now we can plot a bar chart with the unstacked data, and add a title and set the figure size.

In [35]:
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].mean().unstack().plot(kind='bar', figsize=(14,6), title='Average Rents by County and Bedrooms')
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1249842b0>

Pandas also provides a syntax for doing this that is familiar for spreadsheet users: the pivot_table. You can see below that using pivot_table achieves exactly the same result.

In [36]:
pd.pivot_table(rentals_sf1[rentals_sf1['bedrooms']<4], values='rent', index=['county_name'], columns=['bedrooms'])
Out[36]:
bedrooms 1.0 2.0 3.0
county_name
Alameda 1862.396037 2284.538409 2715.874074
Contra Costa 1596.991079 1874.641106 2390.508197
Marin 2209.910211 2998.448575 4476.835979
Napa 1387.362745 1846.914110 2447.333333
San Francisco 3099.179065 4225.756604 5151.170096
San Mateo 2266.792148 2975.375058 3888.506952
Santa Clara 2143.095373 2721.721386 3451.772947
Solano 1143.470514 1277.712283 1640.565678
Sonoma 1403.868078 1830.113469 2284.307937
In [38]:
pd.pivot_table(rentals_sf1[rentals_sf1['bedrooms']<4], values='rent', index=['county_name'], columns=['bedrooms']).plot(kind='bar', figsize=(14,6), title='Average Rents by County and Bedrooms')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x124d6d358>

In addition to providing a convenience interface to groupby, pivot_table also can add partial totals, or margins.

In [39]:
pd.pivot_table(rentals_sf1[rentals_sf1['bedrooms']<4], values='rent', index=['county_name'], columns=['bedrooms'], margins=True)
Out[39]:
bedrooms 1.0 2.0 3.0 All
county_name
Alameda 1862.396037 2284.538409 2715.874074 2161.681932
Contra Costa 1596.991079 1874.641106 2390.508197 1856.947814
Marin 2209.910211 2998.448575 4476.835979 3061.734740
Napa 1387.362745 1846.914110 2447.333333 1999.539130
San Francisco 3099.179065 4225.756604 5151.170096 3861.692972
San Mateo 2266.792148 2975.375058 3888.506952 2801.402538
Santa Clara 2143.095373 2721.721386 3451.772947 2590.512772
Solano 1143.470514 1277.712283 1640.565678 1300.941814
Sonoma 1403.868078 1830.113469 2284.307937 1775.032962
All 2083.173271 2547.455154 3271.394562 2472.355483

Although mean is the default type of aggregation in pivot_table, you can use other methods like len, to get a frequency crosstab with the count of observations in each cell of the table, as well as the margins. Let's remove the filter on number of bedrooms to see the full frequency distribution.

In [40]:
pd.pivot_table(rentals_sf1, values='rent', index=['county_name'], columns=['bedrooms'], aggfunc=len, margins=True)
Out[40]:
bedrooms 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 All
county_name
Alameda 4290.0 4361.0 1350.0 467.0 211.0 28.0 4.0 1.0 10712.0
Contra Costa 2242.0 3508.0 976.0 629.0 177.0 33.0 4.0 1.0 7570.0
Marin 568.0 807.0 378.0 176.0 15.0 2.0 NaN NaN 1946.0
Napa 102.0 163.0 195.0 64.0 5.0 NaN NaN NaN 529.0
San Francisco 2245.0 2120.0 729.0 207.0 15.0 4.0 1.0 1.0 5322.0
San Mateo 2598.0 2141.0 935.0 199.0 55.0 5.0 NaN NaN 5933.0
Santa Clara 7413.0 8973.0 2484.0 945.0 237.0 11.0 7.0 1.0 20071.0
Solano 797.0 1498.0 472.0 188.0 32.0 NaN 10.0 NaN 2997.0
Sonoma 614.0 1225.0 315.0 101.0 4.0 1.0 NaN NaN 2260.0
All 20869.0 24796.0 7834.0 2976.0 751.0 84.0 26.0 4.0 57340.0

A simpler way to generate a crosstabulation is also built in to pandas: the crosstab method.

In [41]:
pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True)
Out[41]:
bedrooms 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 All
county_name
Alameda 4290 4361 1350 467 211 28 4 1 10712
Contra Costa 2242 3508 976 629 177 33 4 1 7570
Marin 568 807 378 176 15 2 0 0 1946
Napa 102 163 195 64 5 0 0 0 529
San Francisco 2245 2120 729 207 15 4 1 1 5322
San Mateo 2598 2141 935 199 55 5 0 0 5933
Santa Clara 7413 8973 2484 945 237 11 7 1 20071
Solano 797 1498 472 188 32 0 10 0 2997
Sonoma 614 1225 315 101 4 1 0 0 2260
All 20869 24796 7834 2976 751 84 26 4 57340

Besides the simpler syntax, the other advantage of crosstab for doing this is that you can normalize the crosstab, dividing all the cell values by the grand total. This tells us the fraction of the region's total listings that are in each combination of county and number of bedrooms.

In [42]:
pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True, normalize=True)
Out[42]:
bedrooms 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 All
county_name
Alameda 0.074817 0.076055 0.023544 0.008144 0.003680 0.000488 0.000070 0.000017 0.186815
Contra Costa 0.039100 0.061179 0.017021 0.010970 0.003087 0.000576 0.000070 0.000017 0.132020
Marin 0.009906 0.014074 0.006592 0.003069 0.000262 0.000035 0.000000 0.000000 0.033938
Napa 0.001779 0.002843 0.003401 0.001116 0.000087 0.000000 0.000000 0.000000 0.009226
San Francisco 0.039152 0.036972 0.012714 0.003610 0.000262 0.000070 0.000017 0.000017 0.092815
San Mateo 0.045309 0.037339 0.016306 0.003471 0.000959 0.000087 0.000000 0.000000 0.103471
Santa Clara 0.129281 0.156488 0.043321 0.016481 0.004133 0.000192 0.000122 0.000017 0.350035
Solano 0.013900 0.026125 0.008232 0.003279 0.000558 0.000000 0.000174 0.000000 0.052267
Sonoma 0.010708 0.021364 0.005494 0.001761 0.000070 0.000017 0.000000 0.000000 0.039414
All 0.363952 0.432438 0.136624 0.051901 0.013097 0.001465 0.000453 0.000070 1.000000

We have the option to normalize on the index, which is the county_name in this case, so we see the fraction of the county listings that are of each number of bedrooms. The rows sum to 1.

In [43]:
pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True, normalize='index')
Out[43]:
bedrooms 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
county_name
Alameda 0.400485 0.407114 0.126027 0.043596 0.019698 0.002614 0.000373 0.000093
Contra Costa 0.296169 0.463408 0.128930 0.083091 0.023382 0.004359 0.000528 0.000132
Marin 0.291881 0.414697 0.194245 0.090442 0.007708 0.001028 0.000000 0.000000
Napa 0.192817 0.308129 0.368620 0.120983 0.009452 0.000000 0.000000 0.000000
San Francisco 0.421834 0.398346 0.136979 0.038895 0.002818 0.000752 0.000188 0.000188
San Mateo 0.437890 0.360863 0.157593 0.033541 0.009270 0.000843 0.000000 0.000000
Santa Clara 0.369339 0.447063 0.123761 0.047083 0.011808 0.000548 0.000349 0.000050
Solano 0.265933 0.499833 0.157491 0.062729 0.010677 0.000000 0.003337 0.000000
Sonoma 0.271681 0.542035 0.139381 0.044690 0.001770 0.000442 0.000000 0.000000
All 0.363952 0.432438 0.136624 0.051901 0.013097 0.001465 0.000453 0.000070

The counterpart is to normalize by column, in which case we see the fraction of 1 bedroom listings in the region that are in each county, and so on for different bedroom counts.

In [44]:
pd.crosstab(rentals_sf1['county_name'], rentals_sf1['bedrooms'], margins=True, normalize='columns')
Out[44]:
bedrooms 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 All
county_name
Alameda 0.205568 0.175875 0.172326 0.156922 0.280959 0.333333 0.153846 0.25 0.186815
Contra Costa 0.107432 0.141474 0.124585 0.211358 0.235686 0.392857 0.153846 0.25 0.132020
Marin 0.027217 0.032546 0.048251 0.059140 0.019973 0.023810 0.000000 0.00 0.033938
Napa 0.004888 0.006574 0.024891 0.021505 0.006658 0.000000 0.000000 0.00 0.009226
San Francisco 0.107576 0.085498 0.093056 0.069556 0.019973 0.047619 0.038462 0.25 0.092815
San Mateo 0.124491 0.086345 0.119352 0.066868 0.073236 0.059524 0.000000 0.00 0.103471
Santa Clara 0.355216 0.361873 0.317079 0.317540 0.315579 0.130952 0.269231 0.25 0.350035
Solano 0.038191 0.060413 0.060250 0.063172 0.042610 0.000000 0.384615 0.00 0.052267
Sonoma 0.029422 0.049403 0.040209 0.033938 0.005326 0.011905 0.000000 0.00 0.039414

What if we want to look at more statistics than just mean? We can combine several aggregation methods and compute them at one time.

In [45]:
rentals_sf1[rentals_sf1['bedrooms']<4].groupby(['county_name', 'bedrooms'])[['rent']].agg(['mean', 'std', 'min', 'max'])
Out[45]:
rent
mean std min max
county_name bedrooms
Alameda 1.0 1862.396037 393.300923 500.0 4950.0
2.0 2284.538409 534.999766 800.0 4998.0
3.0 2715.874074 806.606556 350.0 7500.0
Contra Costa 1.0 1596.991079 375.069796 496.0 3125.0
2.0 1874.641106 505.719679 689.0 3895.0
3.0 2390.508197 775.568225 793.0 5995.0
Marin 1.0 2209.910211 618.305211 430.0 5600.0
2.0 2998.448575 950.448793 1325.0 9500.0
3.0 4476.835979 1423.768448 1657.0 9000.0
Napa 1.0 1387.362745 224.556583 750.0 1800.0
2.0 1846.914110 500.127260 850.0 4050.0
3.0 2447.333333 997.897308 1250.0 9775.0
San Francisco 1.0 3099.179065 720.636564 725.0 7500.0
2.0 4225.756604 1284.521670 750.0 9800.0
3.0 5151.170096 1738.374226 1600.0 10250.0
San Mateo 1.0 2266.792148 411.746679 880.0 5000.0
2.0 2975.375058 612.783070 832.0 6500.0
3.0 3888.506952 1192.532121 1800.0 10000.0
Santa Clara 1.0 2143.095373 451.835127 550.0 5202.0
2.0 2721.721386 664.622785 900.0 9065.0
3.0 3451.772947 1013.365050 650.0 9999.0
Solano 1.0 1143.470514 214.261987 590.0 1849.0
2.0 1277.712283 238.327602 799.0 2300.0
3.0 1640.565678 408.047945 950.0 4500.0
Sonoma 1.0 1403.868078 280.041020 550.0 3200.0
2.0 1830.113469 363.751789 995.0 4500.0
3.0 2284.307937 545.081841 1200.0 5000.0

Remember the 'top' function we defined earlier? We can use it here to get the top rental listing in each county, for each bedroom size.

In [46]:
rentals_sf1.groupby(['county_name', 'bedrooms']).apply(top, n=1, column='rent')
Out[46]:
rent bedrooms sqft longitude latitude fips_block logrecno blockfips state county ... H0050006 H0050007 H0050008 pct_rent pct_black pct_asian pct_white pct_hisp pct_vacant pop_sqmi
county_name bedrooms
Alameda 1.0 29483 4950.0 1.0 800.0 -122.280400 37.836500 060014251041007 18519 060014251041007 06 001 ... 0 0 0 66.666667 57.142857 14.285714 28.571429 0.000000 0.000000 1965.089530
2.0 36889 4998.0 2.0 1400.0 -122.264292 37.879443 060014216004005 2094 060014216004005 06 001 ... 6 0 1 62.264151 0.000000 15.841584 76.237624 8.910891 14.516129 13043.569584
3.0 58544 7500.0 3.0 2500.0 -122.226934 37.830734 060014261001000 25365 060014261001000 06 001 ... 0 0 0 10.000000 1.298701 23.376623 74.025974 1.298701 6.250000 2159.258077
4.0 43448 8200.0 4.0 3247.0 -122.229400 37.828817 060014261001006 25371 060014261001006 06 001 ... 0 0 1 6.000000 1.351351 29.054054 58.783784 6.081081 5.660377 10261.222401
5.0 22134 7800.0 5.0 1856.0 -122.266723 37.863312 060014235001001 3192 060014235001001 06 001 ... 0 0 0 NaN 0.000000 0.000000 100.000000 0.000000 NaN 907.176182
6.0 45415 10200.0 6.0 2600.0 -122.255180 37.860126 060014236011001 3257 060014236011001 06 001 ... 0 0 1 68.493151 4.787234 12.765957 72.340426 5.851064 3.947368 21795.780842
7.0 44877 9200.0 7.0 2600.0 -122.250000 37.857100 060014238003015 3446 060014238003015 06 001 ... 0 0 0 10.000000 0.000000 5.882353 94.117647 0.000000 0.000000 10746.838174
8.0 59696 8400.0 8.0 3000.0 -122.266181 37.853644 060014239011007 3469 060014239011007 06 001 ... 0 0 2 58.536585 7.228916 7.228916 79.518072 6.024096 6.818182 14735.006100
Contra Costa 1.0 27811 3125.0 1.0 539.0 -122.073700 37.954000 060133230004016 49621 060133230004016 06 013 ... 0 0 1 0.000000 0.000000 20.689655 79.310345 6.896552 10.000000 6952.022584
2.0 21614 3895.0 2.0 1050.0 -122.128766 37.856406 060133522011000 48681 060133522011000 06 013 ... 1 0 26 59.819121 2.034525 21.331689 66.399507 9.556104 5.953827 3136.152191
3.0 48922 5995.0 3.0 2800.0 -122.103853 37.873058 060133512003007 47756 060133512003007 06 013 ... 0 0 5 6.250000 0.549451 7.142857 87.637363 5.219780 3.030303 1547.798629
4.0 47943 8900.0 4.0 2014.0 -122.294780 37.912424 060133901003006 57494 060133901003006 06 013 ... 0 0 3 27.500000 4.938272 13.580247 71.604938 6.172840 6.976744 4824.288921
5.0 55529 8500.0 5.0 5000.0 -121.933726 37.778898 060133451122037 50456 060133451122037 06 013 ... 0 0 0 7.407407 2.200000 45.200000 48.400000 4.200000 1.459854 2651.698323
6.0 42145 6400.0 6.0 4667.0 -122.024503 37.869687 060133461011024 44514 060133461011024 06 013 ... 0 0 0 9.677419 0.000000 12.244898 84.693878 1.020408 0.000000 2843.398651
7.0 21197 8630.0 7.0 5000.0 -122.298500 37.915600 060133901003002 57490 060133901003002 06 013 ... 0 0 1 16.666667 13.461538 15.384615 69.230769 9.615385 4.000000 3687.825192
8.0 45048 3200.0 8.0 4400.0 -121.862755 37.991022 060133131031020 43501 060133131031020 06 013 ... 0 0 4 15.476190 11.870504 16.007194 48.561151 32.553957 3.448276 5212.752588
Marin 1.0 40634 5600.0 1.0 850.0 -122.580085 37.861099 060411311002063 311311 060411311002063 06 041 ... 10 0 6 28.000000 2.688172 2.688172 91.935484 2.150538 15.966387 937.628859
2.0 46206 9500.0 2.0 2000.0 -122.453098 37.873702 060411242002006 316065 060411242002006 06 041 ... 12 0 1 23.255814 0.000000 3.267974 94.117647 1.307190 16.504854 4418.986150
3.0 58235 9000.0 3.0 3000.0 -122.533016 37.935281 060411200001008 313544 060411200001008 06 041 ... 0 0 0 11.111111 0.000000 0.000000 100.000000 19.047619 0.000000 4554.492380
4.0 55933 10000.0 4.0 3000.0 -122.491484 37.897525 060411241004016 316051 060411241004016 06 041 ... 0 0 0 15.384615 9.375000 6.250000 84.375000 0.000000 7.142857 3744.279015
5.0 36041 8000.0 5.0 2965.0 -122.547233 37.944722 060411200004018 313639 060411200004018 06 041 ... 0 0 0 84.269663 0.684932 8.904110 85.616438 3.424658 6.315789 5596.741578
6.0 54010 5500.0 6.0 4500.0 -122.612793 38.101920 060411031003005 312560 060411031003005 06 041 ... 0 0 0 4.545455 1.621622 2.162162 90.810811 7.027027 0.000000 3398.886162
Napa 1.0 60137 1800.0 1.0 600.0 -122.290361 38.299445 060552002021007 355859 060552002021007 06 055 ... 0 0 0 66.666667 0.000000 14.285714 85.714286 7.142857 0.000000 3619.829490
2.0 11868 4050.0 2.0 1200.0 -122.473290 38.499868 060552016023023 357788 060552016023023 06 055 ... 5 0 2 26.086957 0.000000 0.000000 94.230769 15.384615 32.352941 3790.902018
3.0 56172 9775.0 3.0 4000.0 -122.445300 38.649300 060552018001087 355278 060552018001087 06 055 ... 3 0 0 45.000000 0.000000 0.000000 67.346939 36.734694 13.043478 7.209965
4.0 59306 7500.0 4.0 3500.0 -122.143250 38.342195 060552014031023 357617 060552014031023 06 055 ... 5 0 1 10.344828 0.000000 0.000000 100.000000 4.918033 19.444444 7.667160
5.0 60288 3950.0 5.0 4000.0 -122.242298 38.422125 060552014012000 357550 060552014012000 06 055 ... 6 0 5 9.600000 2.768166 2.422145 90.311419 7.266436 10.714286 27.513183
San Francisco 1.0 53478 7500.0 1.0 1526.0 -122.395813 37.791153 060750615001009 594215 060750615001009 06 075 ... 0 0 0 100.000000 27.272727 18.181818 54.545455 0.000000 0.000000 1371.749627
2.0 16350 9800.0 2.0 1900.0 -122.409100 37.800200 060750106003004 586748 060750106003004 06 075 ... 3 0 0 91.044776 1.265823 46.202532 47.468354 3.164557 12.987013 51364.140078
3.0 48749 10250.0 3.0 3598.0 -122.448787 37.759003 060750305001000 591563 060750305001000 06 075 ... 0 0 2 7.894737 0.000000 15.853659 78.048780 6.097561 9.523810 9056.674456
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5.0 37111 10000.0 5.0 2700.0 -122.449183 37.790572 060750133001006 587484 060750133001006 06 075 ... 0 0 1 14.285714 0.000000 11.475410 88.524590 1.639344 3.448276 10888.302412
6.0 56555 9850.0 6.0 1654.0 -122.417557 37.798429 060750109001004 586820 060750109001004 06 075 ... 8 0 4 63.815789 1.229508 12.295082 83.196721 10.245902 12.643678 40964.352888
7.0 56553 9950.0 7.0 1894.0 -122.417727 37.798408 060750109001004 586820 060750109001004 06 075 ... 8 0 4 63.815789 1.229508 12.295082 83.196721 10.245902 12.643678 40964.352888
8.0 49849 9900.0 8.0 2500.0 -122.453516 37.764395 060750301021000 591271 060750301021000 06 075 ... 6 0 5 44.140625 2.218115 13.123845 77.634011 6.284658 5.882353 6726.337329
San Mateo 1.0 50707 5000.0 1.0 1050.0 -122.220976 37.455238 060816113001009 628573 060816113001009 06 081 ... 0 0 0 19.047619 0.000000 0.000000 97.826087 6.521739 4.545455 6614.448590
2.0 45175 6500.0 2.0 1660.0 -122.192476 37.425639 060816130002002 625012 060816130002002 06 081 ... 0 0 0 0.000000 0.000000 7.407407 82.716049 13.580247 0.000000 1953.070130
3.0 34850 10000.0 3.0 2800.0 -122.186400 37.439600 060816128001007 624939 060816128001007 06 081 ... 0 0 4 15.909091 0.000000 16.000000 74.400000 12.000000 8.333333 4020.821431
4.0 34871 10000.0 4.0 3200.0 -122.186400 37.439600 060816128001007 624939 060816128001007 06 081 ... 0 0 4 15.909091 0.000000 16.000000 74.400000 12.000000 8.333333 4020.821431
5.0 34855 10000.0 5.0 3000.0 -122.186400 37.439600 060816128001007 624939 060816128001007 06 081 ... 0 0 4 15.909091 0.000000 16.000000 74.400000 12.000000 8.333333 4020.821431
6.0 54429 9500.0 6.0 3195.0 -122.192152 37.466122 060816115002009 623360 060816115002009 06 081 ... 0 0 0 0.000000 0.000000 18.867925 75.471698 0.000000 6.250000 1882.413593
Santa Clara 1.0 8833 5202.0 1.0 792.0 -122.016435 37.332699 060855081013057 646985 060855081013057 06 085 ... 0 0 1 8.333333 0.000000 50.000000 38.095238 2.380952 20.000000 1655.750495
2.0 1228 9065.0 2.0 2750.0 -121.935600 37.316300 060855021011001 647181 060855021011001 06 085 ... 0 0 1 88.888889 2.531646 12.658228 59.493671 45.569620 2.173913 5552.183111
3.0 59425 9999.0 3.0 2267.0 -122.137241 37.434522 060855114004009 651397 060855114004009 06 085 ... 0 0 0 18.181818 0.000000 8.333333 91.666667 0.000000 0.000000 7054.785155
4.0 50057 10000.0 4.0 2700.0 -122.105673 37.381340 060855103001018 647505 060855103001018 06 085 ... 0 0 2 3.125000 0.000000 17.553191 75.531915 3.191489 3.030303 4371.287764
5.0 59076 10000.0 5.0 2500.0 -122.141243 37.430643 060855114004024 651412 060855114004024 06 085 ... 0 0 1 12.500000 0.000000 3.846154 88.461538 0.000000 11.111111 9717.126696
6.0 26178 10000.0 6.0 3485.0 -122.134174 37.406939 060855106003006 650612 060855106003006 06 085 ... 3 0 1 31.250000 0.332779 29.118136 65.224626 4.159734 2.608696 4067.041311
7.0 845 9900.0 7.0 2000.0 -122.146875 37.438339 060855114001010 651340 060855114001010 06 085 ... 0 0 2 7.692308 0.000000 6.896552 86.206897 0.000000 13.333333 3187.745183
8.0 31771 2550.0 8.0 774.0 -122.003174 37.321574 060855080041001 646829 060855080041001 06 085 ... 0 0 1 20.833333 4.878049 57.317073 25.609756 6.097561 4.000000 13447.667701
Solano 1.0 22728 1849.0 1.0 862.0 -122.149733 38.060886 060952521071006 707812 060952521071006 06 095 ... 2 0 2 69.327731 7.104796 12.255773 65.541741 11.722913 7.031250 6673.057886
2.0 57504 2300.0 2.0 1000.0 -122.215571 38.066721 060952506042000 708407 060952506042000 06 095 ... 0 0 2 12.195122 21.264368 19.540230 52.873563 4.022989 8.888889 3699.496881
3.0 33286 4500.0 3.0 2300.0 -122.186894 38.214379 060952522014011 703152 060952522014011 06 095 ... 1 0 1 9.090909 7.954545 15.909091 73.863636 9.090909 5.714286 18.738056
4.0 60319 3495.0 4.0 3050.0 -122.154064 38.086228 060952521043005 707728 060952521043005 06 095 ... 0 0 1 6.756757 6.200000 22.600000 58.000000 8.000000 1.333333 2798.165950
5.0 50440 4000.0 5.0 4088.0 -122.159041 38.250363 060952522012030 702575 060952522012030 06 095 ... 0 0 0 0.000000 12.820513 12.820513 71.794872 0.000000 0.000000 537.934272
7.0 38787 3200.0 7.0 2518.0 -121.988700 38.384700 060952532041008 706468 060952532041008 06 095 ... 0 0 1 27.150538 7.075472 6.415094 67.735849 18.867925 4.615385 1126.346001
Sonoma 1.0 41429 3200.0 1.0 875.0 -122.554700 38.416800 060971505002007 721202 060971505002007 06 097 ... 2 0 2 16.393443 0.000000 1.600000 88.000000 11.200000 6.153846 2305.129337
2.0 57190 4500.0 2.0 1800.0 -122.420998 38.242798 060971501003025 721599 060971501003025 06 097 ... 1 0 0 28.571429 0.000000 0.000000 62.857143 40.000000 6.666667 680.782396
3.0 43260 5000.0 3.0 1800.0 -123.037300 38.330900 060971543024017 715003 060971543024017 06 097 ... 63 1 6 39.568345 0.000000 4.391892 83.108108 20.945946 39.301310 126.122201
4.0 58319 4850.0 4.0 3932.0 -122.678661 38.232416 060971511003001 714770 060971511003001 06 097 ... 7 0 4 40.000000 0.954654 0.477327 93.078759 2.863962 10.447761 162.388156
5.0 41918 3750.0 5.0 3320.0 -122.586050 38.244525 060971506112012 713335 060971506112012 06 097 ... 0 0 0 6.250000 0.719424 7.194245 88.489209 5.035971 4.000000 3300.527449
6.0 60559 3295.0 6.0 3400.0 -122.637792 38.463417 060971526002043 718340 060971526002043 06 097 ... 0 0 0 5.263158 0.000000 15.942029 82.608696 5.072464 0.000000 6260.283117

61 rows × 56 columns

Exploring Correlations in the Data

Pandas provides simple ways of computing correlation coefficients among the columns in your DataFrame. If you use corr() on a full DF, it will produce a large correlation table. A bit hard to navigate and you mostly would not be interested in some of these results.

In [47]:
rentals_sf1.corr()
Out[47]:
rent bedrooms sqft longitude latitude logrecno blkgrp arealand P0010001 P0020001 ... H0050006 H0050007 H0050008 pct_rent pct_black pct_asian pct_white pct_hisp pct_vacant pop_sqmi
rent 1.000000 0.481234 0.616653 -0.189488 -0.208624 0.185653 -0.011133 -0.002037 -0.116857 -0.116857 ... 0.087468 -0.015596 0.010655 -0.137076 -0.222322 0.073785 0.191424 -0.290056 0.030535 0.020475
bedrooms 0.481234 1.000000 0.819533 0.035085 0.065029 -0.039371 0.028106 0.051052 -0.120218 -0.120218 ... -0.078880 -0.014990 -0.034599 -0.330290 -0.000024 -0.014917 0.040864 -0.035118 -0.049176 -0.125802
sqft 0.616653 0.819533 1.000000 0.009677 0.041737 -0.016225 0.008702 0.075221 -0.113589 -0.113589 ... -0.044212 -0.009650 -0.031854 -0.331541 -0.045550 -0.010680 0.090546 -0.105976 -0.018599 -0.132978
longitude -0.189488 0.035085 0.009677 1.000000 -0.548065 -0.130319 -0.098020 -0.043462 0.094442 0.094442 ... -0.109002 0.030678 -0.099010 -0.067935 -0.023140 0.235135 -0.284017 0.156813 -0.135010 -0.072233
latitude -0.208624 0.065029 0.041737 -0.548065 1.000000 -0.306019 0.069752 0.080858 -0.105149 -0.105149 ... 0.033156 0.010343 0.072333 -0.013489 0.265711 -0.407035 0.253876 -0.003537 0.123641 -0.011506
logrecno 0.185653 -0.039371 -0.016225 -0.130319 -0.306019 1.000000 0.050819 -0.011586 0.020363 0.020363 ... 0.022854 -0.029056 -0.085817 0.069324 -0.245896 0.140300 -0.032165 0.055755 -0.018806 0.069862
blkgrp -0.011133 0.028106 0.008702 -0.098020 0.069752 0.050819 1.000000 -0.001086 -0.018809 -0.018809 ... 0.041022 -0.009702 -0.023974 0.015964 -0.039342 0.005227 0.018223 -0.009363 0.032782 0.030694
arealand -0.002037 0.051052 0.075221 -0.043462 0.080858 -0.011586 -0.001086 1.000000 0.069781 0.069781 ... 0.079438 0.011564 0.030822 -0.084488 -0.038112 -0.063535 0.094328 -0.026265 0.022772 -0.087006
P0010001 -0.116857 -0.120218 -0.113589 0.094442 -0.105149 0.020363 -0.018809 0.069781 1.000000 1.000000 ... 0.188282 0.086167 0.175744 0.278962 0.018317 0.203882 -0.231680 0.056364 0.008534 0.206183
P0020001 -0.116857 -0.120218 -0.113589 0.094442 -0.105149 0.020363 -0.018809 0.069781 1.000000 1.000000 ... 0.188282 0.086167 0.175744 0.278962 0.018317 0.203882 -0.231680 0.056364 0.008534 0.206183
P0020002 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
P0020003 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
P0020004 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
P0020005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
P0020006 -0.116857 -0.120218 -0.113589 0.094442 -0.105149 0.020363 -0.018809 0.069781 1.000000 1.000000 ... 0.188282 0.086167 0.175744 0.278962 0.018317 0.203882 -0.231680 0.056364 0.008534 0.206183
P0030001 -0.116857 -0.120218 -0.113589 0.094442 -0.105149 0.020363 -0.018809 0.069781 1.000000 1.000000 ... 0.188282 0.086167 0.175744 0.278962 0.018317 0.203882 -0.231680 0.056364 0.008534 0.206183
P0030002 -0.066322 -0.093093 -0.073550 -0.026435 0.017446 0.004837 -0.014263 0.132031 0.854742 0.854742 ... 0.239367 0.047394 0.185614 0.173508 -0.083516 -0.071235 0.110595 0.012326 0.044099 0.126198
P0030003 -0.201005 -0.064711 -0.088316 0.025098 0.127130 -0.166137 -0.008567 0.008576 0.537023 0.537023 ... 0.034097 0.031352 0.152642 0.229637 0.520310 -0.039777 -0.289330 0.119065 0.064829 0.126377
P0030004 -0.165203 -0.090338 -0.115262 0.082768 -0.052450 0.051150 0.040074 0.024509 0.598688 0.598688 ... 0.027137 0.062006 0.083645 0.218691 0.069526 -0.005272 -0.176332 0.270649 -0.029922 0.125312
P0030005 -0.026636 -0.102957 -0.088583 0.166245 -0.259710 0.080939 -0.024014 0.004551 0.787608 0.787608 ... 0.152383 0.064970 0.109526 0.237828 -0.071530 0.555367 -0.424008 -0.135453 -0.034732 0.214667
P0030006 -0.143584 -0.066546 -0.090380 0.059618 -0.051114 -0.031959 0.005681 0.030011 0.447145 0.447145 ... -0.014347 0.070441 0.056157 0.151555 0.151273 0.021010 -0.230884 0.178386 -0.036454 0.070447
P0030007 -0.192221 -0.074380 -0.106523 0.122080 -0.059414 0.035007 0.009119 0.008469 0.641693 0.641693 ... -0.018246 0.139980 0.050649 0.237256 0.071188 -0.066213 -0.215288 0.479125 -0.030471 0.133771
P0030008 -0.171300 -0.118709 -0.115239 0.094721 -0.027325 -0.047382 -0.038164 0.049658 0.892282 0.892282 ... 0.120928 0.076637 0.163941 0.257777 0.111917 0.080781 -0.213042 0.102429 0.021904 0.150980
P0040001 -0.116857 -0.120218 -0.113589 0.094442 -0.105149 0.020363 -0.018809 0.069781 1.000000 1.000000 ... 0.188282 0.086167 0.175744 0.278962 0.018317 0.203882 -0.231680 0.056364 0.008534 0.206183
P0040002 -0.063643 -0.117265 -0.095832 0.063237 -0.105363 0.010651 -0.024778 0.080094 0.956944 0.956944 ... 0.235095 0.057461 0.193467 0.247470 -0.005573 0.283091 -0.211665 -0.121733 0.020981 0.204262
P0040003 -0.199757 -0.079945 -0.112001 0.132892 -0.063399 0.035869 0.003009 0.017469 0.712001 0.712001 ... 0.001203 0.121817 0.063953 0.245416 0.068926 -0.067680 -0.189243 0.465083 -0.025047 0.129977
H00010001 -0.092694 -0.167299 -0.150461 0.041489 -0.072383 0.002331 -0.015419 0.056135 0.914833 0.914833 ... 0.319312 0.070336 0.264189 0.343574 0.024092 0.190429 -0.173990 -0.041511 0.078374 0.269096
H0030001 -0.092694 -0.167299 -0.150461 0.041489 -0.072383 0.002331 -0.015419 0.056135 0.914833 0.914833 ... 0.319312 0.070336 0.264189 0.343574 0.024092 0.190429 -0.173990 -0.041511 0.078374 0.269096
H0030002 -0.098466 -0.165445 -0.150321 0.054147 -0.084385 0.007127 -0.011555 0.056401 0.921130 0.921130 ... 0.281515 0.067513 0.229187 0.338230 0.017712 0.197087 -0.180456 -0.035022 0.017050 0.264373
H0030003 -0.012294 -0.120409 -0.095463 -0.074482 0.049902 -0.036622 -0.040349 0.033087 0.523654 0.523654 ... 0.500332 0.066522 0.443568 0.257462 0.065759 0.066542 -0.057760 -0.077547 0.535213 0.206244
H0040001 -0.098466 -0.165445 -0.150321 0.054147 -0.084385 0.007127 -0.011555 0.056401 0.921130 0.921130 ... 0.281515 0.067513 0.229187 0.338230 0.017712 0.197087 -0.180456 -0.035022 0.017050 0.264373
H0040002 0.003280 0.062530 0.087009 0.043090 -0.005836 -0.054497 -0.067187 0.155032 0.471111 0.471111 ... 0.157948 -0.003670 0.185045 -0.328432 -0.069112 0.061866 0.029861 -0.115989 -0.046065 -0.097770
H0040003 0.048578 0.066642 0.089497 -0.070018 0.001907 -0.021529 -0.018293 0.175840 0.292418 0.292418 ... 0.158014 -0.006291 0.153030 -0.336150 -0.101425 -0.012773 0.124782 -0.128222 -0.051179 -0.130538
H0040004 -0.117207 -0.216751 -0.209725 0.052274 -0.093749 0.029572 0.012649 -0.007570 0.848889 0.848889 ... 0.248214 0.078426 0.179623 0.531463 0.054046 0.202412 -0.226591 0.013496 0.040480 0.346601
H0050001 -0.012294 -0.120409 -0.095463 -0.074482 0.049902 -0.036622 -0.040349 0.033087 0.523654 0.523654 ... 0.500332 0.066522 0.443568 0.257462 0.065759 0.066542 -0.057760 -0.077547 0.535213 0.206244
H0050002 -0.083147 -0.128702 -0.119278 -0.024962 0.033684 0.000885 -0.032613 0.006208 0.615918 0.615918 ... 0.232303 0.094438 0.236658 0.356730 0.081402 0.054431 -0.089416 -0.011010 0.308367 0.205073
H0050003 0.005707 -0.099994 -0.077529 0.006925 -0.052954 0.013185 -0.007182 -0.002627 0.280286 0.280286 ... 0.439021 0.060043 0.164427 0.222865 -0.008653 0.064074 -0.027387 -0.060657 0.203786 0.103148
H0050004 0.024617 -0.014218 -0.000359 -0.017959 0.016459 -0.055300 -0.051306 0.003200 0.087417 0.087417 ... 0.069392 -0.004086 0.051370 -0.019252 0.047000 0.029440 -0.026150 -0.040381 0.399948 0.041855
H0050005 0.058702 -0.016908 -0.002828 -0.065529 0.020884 0.002664 -0.025691 0.031964 0.114451 0.114451 ... 0.199205 -0.002102 0.055245 -0.088754 -0.004807 0.042804 -0.000322 -0.074880 0.182020 0.067681
H0050006 0.087468 -0.078880 -0.044212 -0.109002 0.033156 0.022854 0.041022 0.079438 0.188282 0.188282 ... 1.000000 0.014174 0.205880 0.104297 -0.059490 0.048751 0.055863 -0.138841 0.233885 0.178326
H0050007 -0.015596 -0.014990 -0.009650 0.030678 0.010343 -0.029056 -0.009702 0.011564 0.086167 0.086167 ... 0.014174 1.000000 -0.007758 0.071269 -0.001634 0.042330 -0.052361 0.013735 0.041681 0.010471
H0050008 0.010655 -0.034599 -0.031854 -0.099010 0.072333 -0.085817 -0.023974 0.030822 0.175744 0.175744 ... 0.205880 -0.007758 1.000000 0.066034 0.046249 0.006824 -0.013860 -0.049633 0.392266 0.036943
pct_rent -0.137076 -0.330290 -0.331541 -0.067935 -0.013489 0.069324 0.015964 -0.084488 0.278962 0.278962 ... 0.104297 0.071269 0.066034 1.000000 0.176972 0.039507 -0.218981 0.187086 0.107243 0.354959
pct_black -0.222322 -0.000024 -0.045550 -0.023140 0.265711 -0.245896 -0.039342 -0.038112 0.018317 0.018317 ... -0.059490 -0.001634 0.046249 0.176972 1.000000 -0.173398 -0.369745 0.084477 0.103343 0.052245
pct_asian 0.073785 -0.014917 -0.010680 0.235135 -0.407035 0.140300 0.005227 -0.063535 0.203882 0.203882 ... 0.048751 0.042330 0.006824 0.039507 -0.173398 1.000000 -0.694488 -0.325704 -0.061251 0.156089
pct_white 0.191424 0.040864 0.090546 -0.284017 0.253876 -0.032165 0.018223 0.094328 -0.231680 -0.231680 ... 0.055863 -0.052361 -0.013860 -0.218981 -0.369745 -0.694488 1.000000 -0.209620 0.031902 -0.177767
pct_hisp -0.290056 -0.035118 -0.105976 0.156813 -0.003537 0.055755 -0.009363 -0.026265 0.056364 0.056364 ... -0.138841 0.013735 -0.049633 0.187086 0.084477 -0.325704 -0.209620 1.000000 -0.048769 0.001433
pct_vacant 0.030535 -0.049176 -0.018599 -0.135010 0.123641 -0.018806 0.032782 0.022772 0.008534 0.008534 ... 0.233885 0.041681 0.392266 0.107243 0.103343 -0.061251 0.031902 -0.048769 1.000000 0.039410
pop_sqmi 0.020475 -0.125802 -0.132978 -0.072233 -0.011506 0.069862 0.030694 -0.087006 0.206183 0.206183 ... 0.178326 0.010471 0.036943 0.354959 0.052245 0.156089 -0.177767 0.001433 0.039410 1.000000

49 rows × 49 columns

It is easy to compute correlation coeffients for a subset of columns.

In [48]:
rentals_sf1[['rent', 'sqft']].corr()
Out[48]:
rent sqft
rent 1.000000 0.616653
sqft 0.616653 1.000000

And this method can be combined with groupby to compute correlation tables by group.

In [49]:
rentals_sf1.groupby('county_name')[['rent', 'sqft']].corr()
Out[49]:
rent sqft
county_name
Alameda rent 1.000000 0.784616
sqft 0.784616 1.000000
Contra Costa rent 1.000000 0.688621
sqft 0.688621 1.000000
Marin rent 1.000000 0.799108
sqft 0.799108 1.000000
Napa rent 1.000000 0.711565
sqft 0.711565 1.000000
San Francisco rent 1.000000 0.740660
sqft 0.740660 1.000000
San Mateo rent 1.000000 0.816791
sqft 0.816791 1.000000
Santa Clara rent 1.000000 0.771930
sqft 0.771930 1.000000
Solano rent 1.000000 0.816569
sqft 0.816569 1.000000
Sonoma rent 1.000000 0.823559
sqft 0.823559 1.000000
In [50]:
rentals_sf1[rentals_sf1['bedrooms']<4].groupby('bedrooms')[['rent', 'sqft']].corr()
Out[50]:
rent sqft
bedrooms
1.0 rent 1.000000 0.350723
sqft 0.350723 1.000000
2.0 rent 1.000000 0.485617
sqft 0.485617 1.000000
3.0 rent 1.000000 0.445744
sqft 0.445744 1.000000

Quantile and Bucket Analysis

We use below the cut method to create categories for ranges of a variable. In this example we use 4 even intevals using cut. If we used qcut instead, it would generate quantiles for us.

In [55]:
factor = pd.cut(rentals_sf1['sqft'], 4)
factor[:5]
Out[55]:
0    (216.064, 1455.0]
1    (216.064, 1455.0]
2    (216.064, 1455.0]
3     (2689.0, 3923.0]
4    (216.064, 1455.0]
Name: sqft, dtype: category
Categories (4, interval[float64]): [(216.064, 1455.0] < (1455.0, 2689.0] < (2689.0, 3923.0] < (3923.0, 5157.0]]

Define a function to get grouped statistics.

In [52]:
def get_stats(group):
    return{'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}

Apply the function to the DataFrame for a specified column.

In [53]:
grouped = rentals_sf1['rent'].groupby(factor)
grouped.apply(get_stats).unstack()
Out[53]:
count max mean min
sqft
(216.064, 1455.0] 53054.0 8500.0 2317.507521 350.0
(1455.0, 2689.0] 7003.0 10200.0 3742.959732 295.0
(2689.0, 3923.0] 983.0 10250.0 5334.371312 750.0
(3923.0, 5157.0] 163.0 10000.0 5913.877301 1300.0

Your turn to experiment

Try practicing these techniques on your own, to do the following:

  • Calculate the mean sqft of rental listings by county
  • Calculate the standard deviation (std) of sqft of rental listings by county and bedroom
  • Add a new column with a normalized sqft, substracting the mean sqft by bedroom from each listing's sqft
  • Compute correlation coefficients among rent, sqft, pct_white, pct_black, pct_asian and pct_hisp, by county and for the region
  • Redo the statistical profile on rents by categories of sqft range using 10 quantiles rather than 4 equal bins