The US government recently released a data set containing all payments (partially aggregated) made by the Medicare program in 2012. For each health care provider, the billings are grouped into service types, and within each service type group, the total payment and standard deviation of payments are provided.

The data are available here

The official documentation for the data is here.

One use of these data is to identify providers who receive very high payments from the Medicare program. In some cases this may reflect fraud or waste. Here is a link to a media article about one such situation.

The dataset is too large to load into SageMathCloud (SMC), so we wrote a script to split the data by state. We ran this script off-line and uploaded a few of the state-level files into SMC.

If you are curious to see how some related analyses of these data can be done in R, check out this blog post.

First we import some libraries that we will be using.

In [1]:

```
import pandas as pd
import numpy as np
```

Next we read in the data from one state. The state-level data is small enough that it can be loaded entirely into memory.

In [2]:

```
fname = "FL-subset.csv.gz"
data = pd.read_csv(fname, compression="gzip")
print(data.shape)
print(data.dtypes)
```

Here we print out a small subset of the data to get a sense of what we are working with.

In [3]:

```
vnames = ["npi", "nppes_provider_last_org_name", "hcpcs_description", "line_srvc_cnt", "average_Medicare_payment_amt", "stdev_Medicare_payment_amt"]
data[vnames].head()
```

Out[3]:

Let's look at the first record in more detail:

In [4]:

```
print(data.iloc[0,:])
```

Each row of the data set is an aggregate of several individual charges. The number of charges that are aggregated is given by `line_srvc_cnt`

. The average and standard deviation of these charges are given by `average_Medicare_payment_amt`

and `stdev_Medicare_payment_amt`

.

The key value of interest here is the total amount of money paid by Medicare to each provider. This is not an explicit variable in the data set, but we can create it by multiplying the average payment by the number of payments, within each record of the dataset.

In [5]:

```
data["total_payment"] = data["line_srvc_cnt"] * data["average_Medicare_payment_amt"]
```

Medicare makes payments to both individual providers, and to organizations. Most of the payments are to individuals and it's not clear that the payment amounts to providers can be directly compared to the payment amounts to individuals. So here we include only the individuals.

In [6]:

```
data_ind = data.loc[data["nppes_entity_code"] == "I", :]
print(data_ind.shape)
print(data.shape)
```

The overall analysis follows the split-apply-combine model. We split the data by provider, apply a function to aggregate the payments (by summing them), and combine the results into a new Pandas `DataFrame`

.

First we look at a couple of inefficient ways to do this that do not take advantage of the optimised split-apply-combine tools in Pandas. The following two functions are not executed since they are quite slow. Uncomment the function calls if you really want to run them.

In [7]:

```
def total_by_provider_naive(data):
"""
Returns a dictionary mapping each provider number to the total payment made to
the provider.
** Do not use, very slow **
"""
from collections import defaultdict
totals = defaultdict(lambda : 0.)
for i in data.index:
totals[data.loc[i, "npi"]] += data.loc[i, "total_payment"]
return totals
# totals1 = total_by_provider_naive(data_ind)
```

Using position-based indexing doesn't help much:

In [8]:

```
def total_by_provider_naive_pos(data):
"""
Returns a dictionary mapping each provider number to the total payment made to
the provider.
** Do not use, very slow **
"""
from collections import defaultdict
totals = defaultdict(lambda : 0.)
for i in data.shape[0]:
row = data.iloc[i, :]
totals[row["npi"]] += row["total_payment"]
return totals
#totals2 = total_by_provider_naive_pos(data_ind)
```

Here is a much faster way to do the calculation. Our goal is to obtain the total payment for each proivider. We first group the data by the provider number, then aggregate within the groups by taking the sum of payment amounts. Finally we sort the result so that the providers with the greatest total payments appear at the top of the list.

In [9]:

```
totals2 = data_ind.groupby("npi").agg({"total_payment": np.sum})
totals2 = totals2.sort("total_payment", ascending=False)
print(totals2.head())
```

Next we have an enhanced version of the code in the previous cell that also includes the provider's last name in the output table. We cannot aggregate names by summing, so we provide two aggregation functions. For the payment data, we aggregate by taking the sum. For the provider name, we aggregate by selecting the first value within each group (assuming that the provider name is constant within groups, which it should be).

In [10]:

```
first = lambda x : x.iloc[0]
totals2 = data_ind.groupby("npi").agg({"total_payment": np.sum, "nppes_provider_last_org_name": first})
totals2 = totals2.sort("total_payment", ascending=False)
totals2.head()
```

Out[10]:

One way to visualize the distribution of payments is to make a quantile plot, which is a plot of the sorted data against the index.

In [11]:

```
plt.plot(totals2["total_payment"][::-1])
plt.grid(True)
plt.xlabel("Provider rank", size=15)
plt.ylabel("Total payment", size=15)
plt.ylim(0, 1e6)
```

Out[11]:

Another familiar way to view a simple collection of numbers is using a histogram.

In [12]:

```
_ = plt.hist(np.asarray(totals2["total_payment"]), bins=100)
```

The distribution is very skewed, so we can limit the histogram to the providers with total payment less than $200,000.

In [13]:

```
x = np.asarray(totals2["total_payment"])
x = x[x < 2e5]
plt.hist(x, bins=50, color='lightblue')
plt.xlabel("Payment", size=15)
_ = plt.ylabel("Number of providers", size=15)
```

Determine the maximum total payment among providers within each zip code, then plot it as a histogram or quantile plot.

Determine the maximum total payment among providers of each provider type, then plot it as a histogram or quantile plot.