The US government recently released a data set containing all payments (partially aggregated) made by the Medicare program in 2012. For each health care provider, the billings are grouped into service types, and within each service type group, the total payment and standard deviation of payments are provided.

The data are available here

The official documentation for the data is here.

One use of these data is to identify providers who receive very high payments from the Medicare program. In some cases this may reflect fraud or waste. Here is a link to a media article about one such situation.

The dataset is too large to load into SageMathCloud (SMC), so we wrote a script to split the data by state. We ran this script off-line and uploaded a few of the state-level files into SMC.

If you are curious to see how some related analyses of these data can be done in R, check out this blog post.

First we import some libraries that we will be using.

In [1]:

```
import pandas as pd
import numpy as np
```

In [2]:

```
fname = "FL-subset.csv.gz"
data = pd.read_csv(fname, compression="gzip")
print(data.shape)
print(data.dtypes)
```

Here we print out a small subset of the data to get a sense of what we are working with.

In [3]:

```
vnames = ["npi", "nppes_provider_last_org_name", "hcpcs_description", "line_srvc_cnt", "average_Medicare_payment_amt", "stdev_Medicare_payment_amt"]
data[vnames].head()
```

Out[3]:

Let's look at the first record in more detail:

In [4]:

```
print(data.iloc[0,:])
```

Each row of the data set is an aggregate of several individual charges. The number of charges that are aggregated is given by `line_srvc_cnt`

. The average and standard deviation of these charges are given by `average_Medicare_payment_amt`

and `stdev_Medicare_payment_amt`

.

The key value of interest here is the total amount of money paid by Medicare to each provider. This is not an explicit variable in the data set, but we can create it by multiplying the average payment by the number of payments, within each record of the dataset.

In [5]:

```
data["total_payment"] = data["line_srvc_cnt"] * data["average_Medicare_payment_amt"]
```

In [6]:

```
data_ind = data.loc[data["nppes_entity_code"] == "I", :]
print(data_ind.shape)
print(data.shape)
```

The overall analysis follows the split-apply-combine model. We split the data by provider, apply a function to aggregate the payments (by summing them), and combine the results into a new Pandas `DataFrame`

.

First we look at a couple of inefficient ways to do this that do not take advantage of the optimised split-apply-combine tools in Pandas. The following two functions are not executed since they are quite slow. Uncomment the function calls if you really want to run them.

In [7]:

```
def total_by_provider_naive(data):
"""
Returns a dictionary mapping each provider number to the total payment made to
the provider.
** Do not use, very slow **
"""
from collections import defaultdict
totals = defaultdict(lambda : 0.)
for i in data.index:
totals[data.loc[i, "npi"]] += data.loc[i, "total_payment"]
return totals
# totals1 = total_by_provider_naive(data_ind)
```

Using position-based indexing doesn't help much:

In [8]:

```
def total_by_provider_naive_pos(data):
"""
Returns a dictionary mapping each provider number to the total payment made to
the provider.
** Do not use, very slow **
"""
from collections import defaultdict
totals = defaultdict(lambda : 0.)
for i in data.shape[0]:
row = data.iloc[i, :]
totals[row["npi"]] += row["total_payment"]
return totals
#totals2 = total_by_provider_naive_pos(data_ind)
```

In [9]:

```
totals2 = data_ind.groupby("npi").agg({"total_payment": np.sum})
totals2 = totals2.sort("total_payment", ascending=False)
print(totals2.head())
```

In [10]:

```
first = lambda x : x.iloc[0]
totals2 = data_ind.groupby("npi").agg({"total_payment": np.sum, "nppes_provider_last_org_name": first})
totals2 = totals2.sort("total_payment", ascending=False)
totals2.head()
```

Out[10]:

In [11]:

```
plt.plot(totals2["total_payment"][::-1])
plt.grid(True)
plt.xlabel("Provider rank", size=15)
plt.ylabel("Total payment", size=15)
plt.ylim(0, 1e6)
```

Out[11]:

Another familiar way to view a simple collection of numbers is using a histogram.

In [12]:

```
_ = plt.hist(np.asarray(totals2["total_payment"]), bins=100)
```

In [13]:

```
x = np.asarray(totals2["total_payment"])
x = x[x < 2e5]
plt.hist(x, bins=50, color='lightblue')
plt.xlabel("Payment", size=15)
_ = plt.ylabel("Number of providers", size=15)
```

Determine the maximum total payment among providers within each zip code, then plot it as a histogram or quantile plot.

Determine the maximum total payment among providers of each provider type, then plot it as a histogram or quantile plot.