Good afternoon!

In this class we'll learn how to slice and dice our data using pandas, a Python package created specifically for data analysis.

We'll be doing all this in Jupyter Notebooks, a supremely nifty, handy-dandy tool for writing, experimenting with, and editing your code.

'A' creates a new cell above, 'B' creates a new cell below. Double 'D' deletes a cell. 'Option+Enter' runs a cell and creates a new one. What you need might pop up in the dropdown. And finally, select a cell without being in the cell and press 'M' to switch to Markdown.

To use pandas, we have to tell Python to pull the bundle of tools into this specific python session. At the time same time, we call it 'pd,' because that allows us to call on those tools without having to write out 'pandas' every single time.

In [40]:
import pandas as pd

Let's look at our data.

  1. An Excel file listing three kinds of old-folks homes in Oregon: nursing homes, residential care facilities, and assisted living facilities in Oregon.
  2. A csv file listing each substantiated case of abuse or neglect at each facility over 5 years.

What we want to do:

  1. Filter facilities
  2. Sort facilities
  3. Get summary stats on facilities (means and medians)
  4. Calculate complaints per facility (grouping and joining)
  5. Find worst/best facilities (more sorting)
  6. See which of the three categories of facilities has the most violations. (grouping with summary stats)

But first!

... we have to get the data into this thing.

How? Simple.

Create a DATAFRAME using the 'read_csv' tool that comes with pandas.

In [7]:
df = pd.read_csv('data/facilities.csv')

You've now created a pandas dataframe. To see what it looks like, show the first five rows of the variable 'df' that you just created.

In [45]:
df.head()
Out[45]:
facid fac_type capacity fac_name fac_address city_state_zip Unnamed: 6 owner operator
0 385008 NF 96.0 Presbyterian Community Care Center 1085 N Oregon St Ontario, OR 97914 NaN Presbyterian Nursing Home, Inc. Presbyterian Nursing Home, Inc.
1 385010 NF 159.0 Laurelhurst Village Rehabilitation Center 3060 SE Stark St Portland, OR 97214 NaN Laurelhurst Operations, LLC Laurelhurst Operations, LLC
2 385015 NF 128.0 Regency Gresham Nursing & Rehabilitation Center 5905 SE Powell Valley Rd Gresham, OR 97080 NaN Regency Gresham Nursing & Rehabilitation Cente... Regency Pacific Management, LLC
3 385018 NF 98.0 Providence Benedictine Nursing Center 540 South Main St Mt. Angel, OR 97362 NaN Providence Health & Services - Oregon Providence Health & Services - Oregon
4 385024 NF 91.0 Avamere Health Services of Rogue Valley 625 Stevens St Medford, OR 97504 NaN Medford Operations, LLC Medford Operations, LLC

***The '.head()' selects the first five rows of a dataframe. We'll be using these throughout the notebook because selecting an entire dataframe takes up a lot of space.***

BUT FIRST... THE BASICS

1. Select a column:

In [48]:
df[['fac_type']].head()
Out[48]:
fac_type
0 NF
1 NF
2 NF
3 NF
4 NF

Select two columns:

In [47]:
df[['fac_type','capacity']].head()
Out[47]:
fac_type capacity
0 NF 96.0
1 NF 159.0
2 NF 128.0
3 NF 98.0
4 NF 91.0

count your rows

In [11]:
df.count()
Out[11]:
facid             644
fac_type          642
capacity          642
fac_name          644
fac_address       644
city_state_zip    644
Unnamed: 6          0
owner             644
operator          644
dtype: int64

1. FILTERING

Filtering dataframes is a funky business.

First, you create a list of 'Trues' and 'Falses' (a boolean array) based on the fields that interest you, like so:

In [12]:
df['fac_type']=='NF'
Out[12]:
0       True
1       True
2       True
3       True
4       True
5       True
6       True
7       True
8       True
9       True
10      True
11      True
12      True
13      True
14      True
15      True
16      True
17      True
18      True
19      True
20      True
21      True
22      True
23      True
24      True
25      True
26      True
27      True
28      True
29      True
       ...  
614    False
615    False
616    False
617    False
618    False
619    False
620    False
621    False
622    False
623    False
624    False
625    False
626    False
627    False
628    False
629    False
630    False
631    False
632    False
633    False
634    False
635    False
636    False
637    False
638    False
639    False
640    False
641    False
642    False
643    False
Name: fac_type, Length: 644, dtype: bool

Then, you run that list against the dataframe. Pandas will spit back every row where there's a True. Here's what it looks like:

In [13]:
df[df['fac_type']=='NF'].head()
Out[13]:
facid fac_type capacity fac_name fac_address city_state_zip Unnamed: 6 owner operator
0 385008 NF 96.0 Presbyterian Community Care Center 1085 N Oregon St Ontario, OR 97914 NaN Presbyterian Nursing Home, Inc. Presbyterian Nursing Home, Inc.
1 385010 NF 159.0 Laurelhurst Village Rehabilitation Center 3060 SE Stark St Portland, OR 97214 NaN Laurelhurst Operations, LLC Laurelhurst Operations, LLC
2 385015 NF 128.0 Regency Gresham Nursing & Rehabilitation Center 5905 SE Powell Valley Rd Gresham, OR 97080 NaN Regency Gresham Nursing & Rehabilitation Cente... Regency Pacific Management, LLC
3 385018 NF 98.0 Providence Benedictine Nursing Center 540 South Main St Mt. Angel, OR 97362 NaN Providence Health & Services - Oregon Providence Health & Services - Oregon
4 385024 NF 91.0 Avamere Health Services of Rogue Valley 625 Stevens St Medford, OR 97504 NaN Medford Operations, LLC Medford Operations, LLC

Here's what filtering on multiple fields looks like. 'AND' is '&,' and 'OR' is '|'. Just make sure to put each argument inside parentheses.

In [14]:
df[(df['fac_type']=='NF') & (df['capacity']>10)].head()
Out[14]:
facid fac_type capacity fac_name fac_address city_state_zip Unnamed: 6 owner operator
0 385008 NF 96.0 Presbyterian Community Care Center 1085 N Oregon St Ontario, OR 97914 NaN Presbyterian Nursing Home, Inc. Presbyterian Nursing Home, Inc.
1 385010 NF 159.0 Laurelhurst Village Rehabilitation Center 3060 SE Stark St Portland, OR 97214 NaN Laurelhurst Operations, LLC Laurelhurst Operations, LLC
2 385015 NF 128.0 Regency Gresham Nursing & Rehabilitation Center 5905 SE Powell Valley Rd Gresham, OR 97080 NaN Regency Gresham Nursing & Rehabilitation Cente... Regency Pacific Management, LLC
3 385018 NF 98.0 Providence Benedictine Nursing Center 540 South Main St Mt. Angel, OR 97362 NaN Providence Health & Services - Oregon Providence Health & Services - Oregon
4 385024 NF 91.0 Avamere Health Services of Rogue Valley 625 Stevens St Medford, OR 97504 NaN Medford Operations, LLC Medford Operations, LLC

How do you count the number of results you get when you filter?

In [50]:
df[(df['fac_type']=='NF') & (df['capacity']>10)].count()[0]
Out[50]:
135

2. SORTING

Piece of cake.

In [53]:
df.sort_values('capacity').head()
Out[53]:
facid fac_type capacity fac_name fac_address city_state_zip Unnamed: 6 owner operator
110 385265 NF 5.0 Mary's Woods at Marylhurst 17360 Holy Names Drive Lake Oswego, OR 97034 NaN Mary's Woods at Marylhurst, Inc. Mary's Woods at Marylhurst, Inc.
67 385200 NF 6.0 Willamette View Health Center 13145 SE River Rd Milwaukie, OR 97222 NaN Willamette View, Inc. dba Willamette View, Inc.
207 50R062 RCF 7.0 Hubbard Residential Care Facility 647 Junction Rd Glendale, OR 97442 NaN Norma Ann Hubbard Norma Ann Hubbard
339 50R386 RCF 7.0 Thanksgiving House 184 N 2nd St St. Helens, OR 97051 NaN Cecile Molden Cecile Molden
317 50R364 RCF 8.0 Summit Springs Village MCU 120 S. Church St. Condon, OR 97823 NaN Summit Springs Village Corporation Summit Springs Village Corporation

If you like bigger numbers...

In [17]:
df.sort_values('capacity', ascending=False).head()
Out[17]:
facid fac_type capacity fac_name fac_address city_state_zip Unnamed: 6 owner operator
93 385240 NF 214.0 Marian Estates 390 Church St Sublimity, OR 97385 NaN Ernmaur, Inc. Marian Estates Support Services
259 50R293 RCF 186.0 Miramont Pointe 11520 SE Sunnyside Rd Clackamas, OR 97015 NaN MP, LLC MP, LLC
571 70M080 ALF 180.0 Rose Schnitzer Manor 6140 SW Boundary St Portland, OR 97221 NaN Robison Jewish Home Robison Jewish Home
20 385112 NF 180.0 West Hills Health & Rehabilitation Center 5701 SW Multnomah Blvd Portland, OR 97219 NaN West Hills Convalescent Center Limited Partner... West Hills Convalescent Center Limited Partner...
50 385166 NF 165.0 Maryville Nursing Home 14645 SW Farmington Rd Beaverton, OR 97007 NaN Sisters of St. Mary of Oregon Maryville Corp. Sisters of St. Mary of Oregon Maryville Corp.

3. SUMMARY STATS

Simply a breeze.

In [18]:
df.mean()
Out[18]:
capacity      57.61838
Unnamed: 6         NaN
dtype: float64
In [19]:
df.median()
Out[19]:
capacity      53.0
Unnamed: 6     NaN
dtype: float64

4. GROUPING

The concept behind grouping is a little gnarly (I think).

First, you split the dataframe up into chunks, creating a grouped object. Like so:

In [20]:
df.groupby('fac_type')
Out[20]:
<pandas.core.groupby.DataFrameGroupBy object at 0x1101fa390>

Then, you tell pandas what you want to do to that group of slices. Let's get a count:

In [57]:
df.groupby('fac_type').count()
Out[57]:
facid capacity fac_name fac_address city_state_zip Unnamed: 6 owner operator
fac_type
ALF 221 221 221 221 221 0 221 221
NF 137 137 137 137 137 0 137 137
RCF 284 283 284 284 284 0 284 284

OK, so we get a count of each type of facility. As you see, it gives us a result for each column that it can. It'll do the same for other kinds of calculations.

Let's get a mean.

In [59]:
df.groupby('fac_type').mean()
Out[59]:
capacity Unnamed: 6
fac_type
ALF 66.610860 NaN
NF 81.759124 NaN
RCF 38.759717 NaN

And a sum.

In [60]:
df.groupby('fac_type').sum()
Out[60]:
capacity Unnamed: 6
fac_type
ALF 14721.0 NaN
NF 11201.0 NaN
RCF 10969.0 NaN

SWITCHING GEARS: COMPLAINTS DATA

In [63]:
df2 = pd.read_xlsx('data/complaints.xlsx')
In [25]:
df2.head()
Out[25]:
complaint_id facility_id facility_type incident_date notes severity fine Facility Invest Results Abuse Facility Invest Results Rule Type Of Abuse
0 OT105179A 385008 NF 8/31/2010 RV reported asking staff to change him/her pri... 2.0 0.0 Not Substantiated Substantiated NaN
1 OT105179B 385008 NF 8/31/2010 RV reported staff answered his/her call light,... 2.0 0.0 Not Substantiated Substantiated NaN
2 OT105179C 385008 NF 8/31/2010 RV reported an unknown "not RV's regular staff... 2.0 0.0 Not Substantiated Substantiated NaN
3 OR0000656000 385008 NF 12/21/2010 Resident 1 was admitted with multiple diagnose... 3.0 0.0 Substantiated Substantiated Neglect
4 OT105397 385008 NF 9/17/2010 RV was admitted 9/17/10 with multiple diagnose... 2.0 0.0 Not Substantiated Substantiated NaN
In [26]:
df2.groupby('facility_id').count().head()
Out[26]:
complaint_id facility_type incident_date notes severity fine Facility Invest Results Abuse Facility Invest Results Rule Type Of Abuse
facility_id
385008 9 9 9 9 9 7 9 9 3
385010 8 8 8 8 8 7 8 8 4
385015 17 17 17 17 17 11 17 17 4
385018 17 17 17 17 17 5 17 17 5
385024 40 40 40 40 40 28 40 40 12

We don't need all these columns because they show more or less the thing. So we select the index column and the one with unique IDs for each complaint.

(You'll notice that facility_id became the index in the grouping. We undo that by doing 'reset_index.')

In [27]:
df2.groupby('facility_id').count().reset_index()[['facility_id','complaint_id']].head()
Out[27]:
facility_id complaint_id
0 385008 9
1 385010 8
2 385015 17
3 385018 17
4 385024 40

Let's create a new dataframe.

In [28]:
fac_comp = df2.groupby('facility_id').count().reset_index()[['facility_id','complaint_id']]
In [29]:
fac_comp.head()
Out[29]:
facility_id complaint_id
0 385008 9
1 385010 8
2 385015 17
3 385018 17
4 385024 40

5. JOINING

OK, so now we've got one dataframe with a list of facilities and their characteristics, and another with the number of substantiated complaints per facility.

If we want to do any Journalism with the data, we need to combine the two dataframes. For those familiar with SQL, this is called joining.

The variables in pandas joining are pretty similar to the ones in SQL: you tell pandas which fields to join on and whether it's a left, right, inner, or outer join.

The syntax is as follows:

dataframe1.merge(dataframe2,left_on='dataframe1.column',right_on='dataframe2.column',how='left')

In [64]:
df.merge(fac_comp, left_on='facid',right_on='facility_id',how='left').head()
Out[64]:
facid fac_type capacity fac_name fac_address city_state_zip Unnamed: 6 owner operator facility_id complaint_id
0 385008 NF 96.0 Presbyterian Community Care Center 1085 N Oregon St Ontario, OR 97914 NaN Presbyterian Nursing Home, Inc. Presbyterian Nursing Home, Inc. 385008 9.0
1 385010 NF 159.0 Laurelhurst Village Rehabilitation Center 3060 SE Stark St Portland, OR 97214 NaN Laurelhurst Operations, LLC Laurelhurst Operations, LLC 385010 8.0
2 385015 NF 128.0 Regency Gresham Nursing & Rehabilitation Center 5905 SE Powell Valley Rd Gresham, OR 97080 NaN Regency Gresham Nursing & Rehabilitation Cente... Regency Pacific Management, LLC 385015 17.0
3 385018 NF 98.0 Providence Benedictine Nursing Center 540 South Main St Mt. Angel, OR 97362 NaN Providence Health & Services - Oregon Providence Health & Services - Oregon 385018 17.0
4 385024 NF 91.0 Avamere Health Services of Rogue Valley 625 Stevens St Medford, OR 97504 NaN Medford Operations, LLC Medford Operations, LLC 385024 40.0

Recall the beauty of being able to easily create a new dataframe:

In [31]:
m1 = df.merge(fac_comp, left_on='facid',right_on='facility_id',how='left')

Check which facilities had no complaints (filter for null).

In [32]:
null_count = m1[m1.complaint_id.isnull()]
In [33]:
null_count.count()[0]
Out[33]:
77

Create a new column with an overall complaint rate (complaints per bed).

(NB: Only the df['field'] syntax works when creating a new column.)

In [34]:
m1['comp_rate'] = m1.complaint_id/m1.capacity

Cool. Let's see the 5 facilities with the highest complaint rates.

In [35]:
m1.sort_values('comp_rate',ascending=False).head()
Out[35]:
facid fac_type capacity fac_name fac_address city_state_zip Unnamed: 6 owner operator facility_id complaint_id comp_rate
393 5MA170 RCF 37.0 Brookdale McMinnville Westside 320 SW Hill Road McMinnville, OR 97128 NaN Brookdale Senior Living Communities, Inc. Brookdale Senior Living Communities, Inc. 5MA170 59.0 1.594595
403 5MA233 RCF 30.0 Ashley Manor - Roseburg 427 SE Ramp St. Roseburg, OR 97470 NaN Ashley Manor LLC Ashley Manor LLC 5MA233 47.0 1.566667
390 5MA161 RCF 32.0 Skylark Memory Care 950 Skylark Place Ashland, OR 97520 NaN Ashland Assisted Living, LLC Mission Senior Living, LLC 5MA161 48.0 1.500000
320 50R367 RCF 48.0 Arbor Oaks Terrace Memory Care 317 Werth Blvd. Newberg, OR 97132 NaN Newberg Memory Associates, LLC Frontier Management, LLC 50R367 71.0 1.479167
398 5MA215 RCF 55.0 Baycrest Memory Care 955 Kentucky Avenue Coos Bay, OR 97420 NaN Bay Area Properties, LLC Radiant Senior Living, Inc. 5MA215 77.0 1.400000

Cool Stuff.

The 'fac_type' field tells you if a facility is a nursing home, a residential care facility, or an assisted living facility. Each one is subject to different sets of regulations so who knows, maybe there's a difference.

We've already gone over the thing we'll need to use to do this. Does anybody remember?

That's right, grouping!

6. AND NOW WE PUT IT TOGETHER...

In [65]:
m1.groupby('fac_type').mean()
Out[65]:
capacity Unnamed: 6 complaint_id comp_rate
fac_type
ALF 66.610860 NaN 10.195122 0.154733
NF 81.759124 NaN 14.592000 0.178521
RCF 38.759717 NaN 10.516949 0.290260
In [66]:
mean_comp_rate = m1.groupby('fac_type').mean()

Wow, RCFs are super high! Wonder what could be going on there...

Final final step

What if you want to show your results to someone not as techno-magical as you? Why, simply export your results into a csv. Decide where you're gonna output your dataframe and what you're gonna call it.

In [38]:
mean_comp_rate.to_csv('output/mean_comp_rate.csv')

General words of wisdom

Stackoverflow is your friend.

Google is your friend.

Documentation is your friend.

News Nerdery's 'help me' channel is your friend.

Your friends are your friends.

Make a google spreadsheet where you save snippets of code you're likely to reuse.

If you need to, find a good wall to bang your head against. There is no shame in that.

Never give up. (within reason.)