In this class we'll learn how to slice and dice our data using pandas, a Python package created specifically for data analysis.
We'll be doing all this in Jupyter Notebooks, a supremely nifty, handy-dandy tool for writing, experimenting with, and editing your code.
'A' creates a new cell above, 'B' creates a new cell below. Double 'D' deletes a cell. 'Option+Enter' runs a cell and creates a new one. What you need might pop up in the dropdown. And finally, select a cell without being in the cell and press 'M' to switch to Markdown.
To use pandas, we have to tell Python to pull the bundle of tools into this specific python session. At the time same time, we call it 'pd,' because that allows us to call on those tools without having to write out 'pandas' every single time.
import pandas as pd
Let's look at our data.
What we want to do:
... we have to get the data into this thing.
How? Simple.
Create a DATAFRAME using the 'read_csv' tool that comes with pandas.
df = pd.read_csv('data/facilities.csv')
You've now created a pandas dataframe. To see what it looks like, show the first five rows of the variable 'df' that you just created.
df.head()
facid | fac_type | capacity | fac_name | fac_address | city_state_zip | Unnamed: 6 | owner | operator | |
---|---|---|---|---|---|---|---|---|---|
0 | 385008 | NF | 96.0 | Presbyterian Community Care Center | 1085 N Oregon St | Ontario, OR 97914 | NaN | Presbyterian Nursing Home, Inc. | Presbyterian Nursing Home, Inc. |
1 | 385010 | NF | 159.0 | Laurelhurst Village Rehabilitation Center | 3060 SE Stark St | Portland, OR 97214 | NaN | Laurelhurst Operations, LLC | Laurelhurst Operations, LLC |
2 | 385015 | NF | 128.0 | Regency Gresham Nursing & Rehabilitation Center | 5905 SE Powell Valley Rd | Gresham, OR 97080 | NaN | Regency Gresham Nursing & Rehabilitation Cente... | Regency Pacific Management, LLC |
3 | 385018 | NF | 98.0 | Providence Benedictine Nursing Center | 540 South Main St | Mt. Angel, OR 97362 | NaN | Providence Health & Services - Oregon | Providence Health & Services - Oregon |
4 | 385024 | NF | 91.0 | Avamere Health Services of Rogue Valley | 625 Stevens St | Medford, OR 97504 | NaN | Medford Operations, LLC | Medford Operations, LLC |
***The '.head()' selects the first five rows of a dataframe. We'll be using these throughout the notebook because selecting an entire dataframe takes up a lot of space.***
1. Select a column:
df[['fac_type']].head()
fac_type | |
---|---|
0 | NF |
1 | NF |
2 | NF |
3 | NF |
4 | NF |
Select two columns:
df[['fac_type','capacity']].head()
fac_type | capacity | |
---|---|---|
0 | NF | 96.0 |
1 | NF | 159.0 |
2 | NF | 128.0 |
3 | NF | 98.0 |
4 | NF | 91.0 |
count your rows
df.count()
facid 644 fac_type 642 capacity 642 fac_name 644 fac_address 644 city_state_zip 644 Unnamed: 6 0 owner 644 operator 644 dtype: int64
Filtering dataframes is a funky business.
First, you create a list of 'Trues' and 'Falses' (a boolean array) based on the fields that interest you, like so:
df['fac_type']=='NF'
0 True 1 True 2 True 3 True 4 True 5 True 6 True 7 True 8 True 9 True 10 True 11 True 12 True 13 True 14 True 15 True 16 True 17 True 18 True 19 True 20 True 21 True 22 True 23 True 24 True 25 True 26 True 27 True 28 True 29 True ... 614 False 615 False 616 False 617 False 618 False 619 False 620 False 621 False 622 False 623 False 624 False 625 False 626 False 627 False 628 False 629 False 630 False 631 False 632 False 633 False 634 False 635 False 636 False 637 False 638 False 639 False 640 False 641 False 642 False 643 False Name: fac_type, Length: 644, dtype: bool
Then, you run that list against the dataframe. Pandas will spit back every row where there's a True. Here's what it looks like:
df[df['fac_type']=='NF'].head()
facid | fac_type | capacity | fac_name | fac_address | city_state_zip | Unnamed: 6 | owner | operator | |
---|---|---|---|---|---|---|---|---|---|
0 | 385008 | NF | 96.0 | Presbyterian Community Care Center | 1085 N Oregon St | Ontario, OR 97914 | NaN | Presbyterian Nursing Home, Inc. | Presbyterian Nursing Home, Inc. |
1 | 385010 | NF | 159.0 | Laurelhurst Village Rehabilitation Center | 3060 SE Stark St | Portland, OR 97214 | NaN | Laurelhurst Operations, LLC | Laurelhurst Operations, LLC |
2 | 385015 | NF | 128.0 | Regency Gresham Nursing & Rehabilitation Center | 5905 SE Powell Valley Rd | Gresham, OR 97080 | NaN | Regency Gresham Nursing & Rehabilitation Cente... | Regency Pacific Management, LLC |
3 | 385018 | NF | 98.0 | Providence Benedictine Nursing Center | 540 South Main St | Mt. Angel, OR 97362 | NaN | Providence Health & Services - Oregon | Providence Health & Services - Oregon |
4 | 385024 | NF | 91.0 | Avamere Health Services of Rogue Valley | 625 Stevens St | Medford, OR 97504 | NaN | Medford Operations, LLC | Medford Operations, LLC |
Here's what filtering on multiple fields looks like. 'AND' is '&,' and 'OR' is '|'. Just make sure to put each argument inside parentheses.
df[(df['fac_type']=='NF') & (df['capacity']>10)].head()
facid | fac_type | capacity | fac_name | fac_address | city_state_zip | Unnamed: 6 | owner | operator | |
---|---|---|---|---|---|---|---|---|---|
0 | 385008 | NF | 96.0 | Presbyterian Community Care Center | 1085 N Oregon St | Ontario, OR 97914 | NaN | Presbyterian Nursing Home, Inc. | Presbyterian Nursing Home, Inc. |
1 | 385010 | NF | 159.0 | Laurelhurst Village Rehabilitation Center | 3060 SE Stark St | Portland, OR 97214 | NaN | Laurelhurst Operations, LLC | Laurelhurst Operations, LLC |
2 | 385015 | NF | 128.0 | Regency Gresham Nursing & Rehabilitation Center | 5905 SE Powell Valley Rd | Gresham, OR 97080 | NaN | Regency Gresham Nursing & Rehabilitation Cente... | Regency Pacific Management, LLC |
3 | 385018 | NF | 98.0 | Providence Benedictine Nursing Center | 540 South Main St | Mt. Angel, OR 97362 | NaN | Providence Health & Services - Oregon | Providence Health & Services - Oregon |
4 | 385024 | NF | 91.0 | Avamere Health Services of Rogue Valley | 625 Stevens St | Medford, OR 97504 | NaN | Medford Operations, LLC | Medford Operations, LLC |
How do you count the number of results you get when you filter?
df[(df['fac_type']=='NF') & (df['capacity']>10)].count()[0]
135
Piece of cake.
df.sort_values('capacity').head()
facid | fac_type | capacity | fac_name | fac_address | city_state_zip | Unnamed: 6 | owner | operator | |
---|---|---|---|---|---|---|---|---|---|
110 | 385265 | NF | 5.0 | Mary's Woods at Marylhurst | 17360 Holy Names Drive | Lake Oswego, OR 97034 | NaN | Mary's Woods at Marylhurst, Inc. | Mary's Woods at Marylhurst, Inc. |
67 | 385200 | NF | 6.0 | Willamette View Health Center | 13145 SE River Rd | Milwaukie, OR 97222 | NaN | Willamette View, Inc. dba | Willamette View, Inc. |
207 | 50R062 | RCF | 7.0 | Hubbard Residential Care Facility | 647 Junction Rd | Glendale, OR 97442 | NaN | Norma Ann Hubbard | Norma Ann Hubbard |
339 | 50R386 | RCF | 7.0 | Thanksgiving House | 184 N 2nd St | St. Helens, OR 97051 | NaN | Cecile Molden | Cecile Molden |
317 | 50R364 | RCF | 8.0 | Summit Springs Village MCU | 120 S. Church St. | Condon, OR 97823 | NaN | Summit Springs Village Corporation | Summit Springs Village Corporation |
If you like bigger numbers...
df.sort_values('capacity', ascending=False).head()
facid | fac_type | capacity | fac_name | fac_address | city_state_zip | Unnamed: 6 | owner | operator | |
---|---|---|---|---|---|---|---|---|---|
93 | 385240 | NF | 214.0 | Marian Estates | 390 Church St | Sublimity, OR 97385 | NaN | Ernmaur, Inc. | Marian Estates Support Services |
259 | 50R293 | RCF | 186.0 | Miramont Pointe | 11520 SE Sunnyside Rd | Clackamas, OR 97015 | NaN | MP, LLC | MP, LLC |
571 | 70M080 | ALF | 180.0 | Rose Schnitzer Manor | 6140 SW Boundary St | Portland, OR 97221 | NaN | Robison Jewish Home | Robison Jewish Home |
20 | 385112 | NF | 180.0 | West Hills Health & Rehabilitation Center | 5701 SW Multnomah Blvd | Portland, OR 97219 | NaN | West Hills Convalescent Center Limited Partner... | West Hills Convalescent Center Limited Partner... |
50 | 385166 | NF | 165.0 | Maryville Nursing Home | 14645 SW Farmington Rd | Beaverton, OR 97007 | NaN | Sisters of St. Mary of Oregon Maryville Corp. | Sisters of St. Mary of Oregon Maryville Corp. |
Simply a breeze.
df.mean()
capacity 57.61838 Unnamed: 6 NaN dtype: float64
df.median()
capacity 53.0 Unnamed: 6 NaN dtype: float64
The concept behind grouping is a little gnarly (I think).
First, you split the dataframe up into chunks, creating a grouped object. Like so:
df.groupby('fac_type')
<pandas.core.groupby.DataFrameGroupBy object at 0x1101fa390>
Then, you tell pandas what you want to do to that group of slices. Let's get a count:
df.groupby('fac_type').count()
facid | capacity | fac_name | fac_address | city_state_zip | Unnamed: 6 | owner | operator | |
---|---|---|---|---|---|---|---|---|
fac_type | ||||||||
ALF | 221 | 221 | 221 | 221 | 221 | 0 | 221 | 221 |
NF | 137 | 137 | 137 | 137 | 137 | 0 | 137 | 137 |
RCF | 284 | 283 | 284 | 284 | 284 | 0 | 284 | 284 |
OK, so we get a count of each type of facility. As you see, it gives us a result for each column that it can. It'll do the same for other kinds of calculations.
Let's get a mean.
df.groupby('fac_type').mean()
capacity | Unnamed: 6 | |
---|---|---|
fac_type | ||
ALF | 66.610860 | NaN |
NF | 81.759124 | NaN |
RCF | 38.759717 | NaN |
And a sum.
df.groupby('fac_type').sum()
capacity | Unnamed: 6 | |
---|---|---|
fac_type | ||
ALF | 14721.0 | NaN |
NF | 11201.0 | NaN |
RCF | 10969.0 | NaN |
df2 = pd.read_xlsx('data/complaints.xlsx')
df2.head()
complaint_id | facility_id | facility_type | incident_date | notes | severity | fine | Facility Invest Results Abuse | Facility Invest Results Rule | Type Of Abuse | |
---|---|---|---|---|---|---|---|---|---|---|
0 | OT105179A | 385008 | NF | 8/31/2010 | RV reported asking staff to change him/her pri... | 2.0 | 0.0 | Not Substantiated | Substantiated | NaN |
1 | OT105179B | 385008 | NF | 8/31/2010 | RV reported staff answered his/her call light,... | 2.0 | 0.0 | Not Substantiated | Substantiated | NaN |
2 | OT105179C | 385008 | NF | 8/31/2010 | RV reported an unknown "not RV's regular staff... | 2.0 | 0.0 | Not Substantiated | Substantiated | NaN |
3 | OR0000656000 | 385008 | NF | 12/21/2010 | Resident 1 was admitted with multiple diagnose... | 3.0 | 0.0 | Substantiated | Substantiated | Neglect |
4 | OT105397 | 385008 | NF | 9/17/2010 | RV was admitted 9/17/10 with multiple diagnose... | 2.0 | 0.0 | Not Substantiated | Substantiated | NaN |
df2.groupby('facility_id').count().head()
complaint_id | facility_type | incident_date | notes | severity | fine | Facility Invest Results Abuse | Facility Invest Results Rule | Type Of Abuse | |
---|---|---|---|---|---|---|---|---|---|
facility_id | |||||||||
385008 | 9 | 9 | 9 | 9 | 9 | 7 | 9 | 9 | 3 |
385010 | 8 | 8 | 8 | 8 | 8 | 7 | 8 | 8 | 4 |
385015 | 17 | 17 | 17 | 17 | 17 | 11 | 17 | 17 | 4 |
385018 | 17 | 17 | 17 | 17 | 17 | 5 | 17 | 17 | 5 |
385024 | 40 | 40 | 40 | 40 | 40 | 28 | 40 | 40 | 12 |
We don't need all these columns because they show more or less the thing. So we select the index column and the one with unique IDs for each complaint.
(You'll notice that facility_id became the index in the grouping. We undo that by doing 'reset_index.')
df2.groupby('facility_id').count().reset_index()[['facility_id','complaint_id']].head()
facility_id | complaint_id | |
---|---|---|
0 | 385008 | 9 |
1 | 385010 | 8 |
2 | 385015 | 17 |
3 | 385018 | 17 |
4 | 385024 | 40 |
Let's create a new dataframe.
fac_comp = df2.groupby('facility_id').count().reset_index()[['facility_id','complaint_id']]
fac_comp.head()
facility_id | complaint_id | |
---|---|---|
0 | 385008 | 9 |
1 | 385010 | 8 |
2 | 385015 | 17 |
3 | 385018 | 17 |
4 | 385024 | 40 |
OK, so now we've got one dataframe with a list of facilities and their characteristics, and another with the number of substantiated complaints per facility.
If we want to do any Journalism with the data, we need to combine the two dataframes. For those familiar with SQL, this is called joining.
The variables in pandas joining are pretty similar to the ones in SQL: you tell pandas which fields to join on and whether it's a left, right, inner, or outer join.
The syntax is as follows:
dataframe1.merge(dataframe2,left_on='dataframe1.column',right_on='dataframe2.column',how='left')
df.merge(fac_comp, left_on='facid',right_on='facility_id',how='left').head()
facid | fac_type | capacity | fac_name | fac_address | city_state_zip | Unnamed: 6 | owner | operator | facility_id | complaint_id | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 385008 | NF | 96.0 | Presbyterian Community Care Center | 1085 N Oregon St | Ontario, OR 97914 | NaN | Presbyterian Nursing Home, Inc. | Presbyterian Nursing Home, Inc. | 385008 | 9.0 |
1 | 385010 | NF | 159.0 | Laurelhurst Village Rehabilitation Center | 3060 SE Stark St | Portland, OR 97214 | NaN | Laurelhurst Operations, LLC | Laurelhurst Operations, LLC | 385010 | 8.0 |
2 | 385015 | NF | 128.0 | Regency Gresham Nursing & Rehabilitation Center | 5905 SE Powell Valley Rd | Gresham, OR 97080 | NaN | Regency Gresham Nursing & Rehabilitation Cente... | Regency Pacific Management, LLC | 385015 | 17.0 |
3 | 385018 | NF | 98.0 | Providence Benedictine Nursing Center | 540 South Main St | Mt. Angel, OR 97362 | NaN | Providence Health & Services - Oregon | Providence Health & Services - Oregon | 385018 | 17.0 |
4 | 385024 | NF | 91.0 | Avamere Health Services of Rogue Valley | 625 Stevens St | Medford, OR 97504 | NaN | Medford Operations, LLC | Medford Operations, LLC | 385024 | 40.0 |
Recall the beauty of being able to easily create a new dataframe:
m1 = df.merge(fac_comp, left_on='facid',right_on='facility_id',how='left')
Check which facilities had no complaints (filter for null).
null_count = m1[m1.complaint_id.isnull()]
null_count.count()[0]
77
Create a new column with an overall complaint rate (complaints per bed).
(NB: Only the df['field'] syntax works when creating a new column.)
m1['comp_rate'] = m1.complaint_id/m1.capacity
Cool. Let's see the 5 facilities with the highest complaint rates.
m1.sort_values('comp_rate',ascending=False).head()
facid | fac_type | capacity | fac_name | fac_address | city_state_zip | Unnamed: 6 | owner | operator | facility_id | complaint_id | comp_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
393 | 5MA170 | RCF | 37.0 | Brookdale McMinnville Westside | 320 SW Hill Road | McMinnville, OR 97128 | NaN | Brookdale Senior Living Communities, Inc. | Brookdale Senior Living Communities, Inc. | 5MA170 | 59.0 | 1.594595 |
403 | 5MA233 | RCF | 30.0 | Ashley Manor - Roseburg | 427 SE Ramp St. | Roseburg, OR 97470 | NaN | Ashley Manor LLC | Ashley Manor LLC | 5MA233 | 47.0 | 1.566667 |
390 | 5MA161 | RCF | 32.0 | Skylark Memory Care | 950 Skylark Place | Ashland, OR 97520 | NaN | Ashland Assisted Living, LLC | Mission Senior Living, LLC | 5MA161 | 48.0 | 1.500000 |
320 | 50R367 | RCF | 48.0 | Arbor Oaks Terrace Memory Care | 317 Werth Blvd. | Newberg, OR 97132 | NaN | Newberg Memory Associates, LLC | Frontier Management, LLC | 50R367 | 71.0 | 1.479167 |
398 | 5MA215 | RCF | 55.0 | Baycrest Memory Care | 955 Kentucky Avenue | Coos Bay, OR 97420 | NaN | Bay Area Properties, LLC | Radiant Senior Living, Inc. | 5MA215 | 77.0 | 1.400000 |
Cool Stuff.
The 'fac_type' field tells you if a facility is a nursing home, a residential care facility, or an assisted living facility. Each one is subject to different sets of regulations so who knows, maybe there's a difference.
We've already gone over the thing we'll need to use to do this. Does anybody remember?
That's right, grouping!
m1.groupby('fac_type').mean()
capacity | Unnamed: 6 | complaint_id | comp_rate | |
---|---|---|---|---|
fac_type | ||||
ALF | 66.610860 | NaN | 10.195122 | 0.154733 |
NF | 81.759124 | NaN | 14.592000 | 0.178521 |
RCF | 38.759717 | NaN | 10.516949 | 0.290260 |
mean_comp_rate = m1.groupby('fac_type').mean()
Wow, RCFs are super high! Wonder what could be going on there...
What if you want to show your results to someone not as techno-magical as you? Why, simply export your results into a csv. Decide where you're gonna output your dataframe and what you're gonna call it.
mean_comp_rate.to_csv('output/mean_comp_rate.csv')
Stackoverflow is your friend.
Google is your friend.
Documentation is your friend.
News Nerdery's 'help me' channel is your friend.
Your friends are your friends.
Make a google spreadsheet where you save snippets of code you're likely to reuse.
If you need to, find a good wall to bang your head against. There is no shame in that.
Never give up. (within reason.)