Medicaid Dataset - Basic Data Analysis of Benefit Summary Data

CMS.gov has made available a synthetic dataset of Medicare/Medicaid claims data. The objective is to allow data enterpreneurs to figure out interesting ways to use the data.

As noted in the Data Codebook (PDF), the data has been heavily anonymized and imputed, so any conclusions drawn from this dataset may not be reliable.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import operator
import random
%matplotlib inline

We read the benefit summary data in. There are 59 files representing benefit summaries from 2008-2010. There are a total of 6,758,744 records. We copy these into a separate benefit_summary directory so all the files can be loaded into a single dataframe.

In [2]:
dfs = []
for csvfile in os.listdir("benefit_summary"):
    dfs.append(pd.read_csv("benefit_summary/" + csvfile))
df = pd.concat(dfs)
df.shape
Out[2]:
(6758685, 32)
In [3]:
# find the distribution of records by gender
df[["BENE_SEX_IDENT_CD"]].groupby("BENE_SEX_IDENT_CD").count()
Out[3]:
BENE_SEX_IDENT_CD
BENE_SEX_IDENT_CD
1 3002984
2 3755701
In [4]:
# find distribution of records by race
df[["BENE_RACE_CD"]].groupby("BENE_RACE_CD").count()
Out[4]:
BENE_RACE_CD
BENE_RACE_CD
1 5596084
2 719751
3 284654
5 158196

So it looks like the beneficiaries are split almost evenly by gender with men constituting about 45% of the dataset and the women constituting the other 55%.

Race-wise, whites constitute 83%, blacks 11%, other (asians, native americans, etc) 4% and hispanics 2%. This roughly reflects the race mix among older Americans.

In [5]:
# visualize incidence of diseases by gender
df_trunc = df[["BENE_SEX_IDENT_CD", "SP_ALZHDMTA", "SP_CHF", 
    "SP_CHRNKIDN", "SP_CNCR", "SP_COPD", "SP_DEPRESSN", "SP_DIABETES", 
    "SP_ISCHMCHT", "SP_OSTEOPRS", "SP_RA_OA", "SP_STRKETIA"]]
df_grouped = df_trunc.groupby("BENE_SEX_IDENT_CD").agg(lambda x: 100.0 * np.mean(x - 1))
df_grouped.transpose()
Out[5]:
BENE_SEX_IDENT_CD 1 2
SP_ALZHDMTA 81.597904 79.388801
SP_CHF 71.964619 69.425575
SP_CHRNKIDN 83.974040 82.471981
SP_CNCR 93.506725 93.474401
SP_COPD 87.954048 86.703974
SP_DEPRESSN 80.288973 77.570579
SP_DIABETES 65.798020 62.108698
SP_ISCHMCHT 59.353030 56.210438
SP_OSTEOPRS 84.587930 82.606789
SP_RA_OA 86.827069 85.115562
SP_STRKETIA 96.092187 95.756877

Men lead in the incidence of disease in almost all cases - they show higher incidence of Alzheimer's Disease, CHF, Depression, Diabetes, Ischemic Heart Disease, Osteoporosis and Rheumatoid/Osteo-Arthritis. Men and women seem to have an equal chance of contracting Chronic Kidney Disease, Cancer, COPD, and Stroke. The data is shown visually below:

In [6]:
df_grouped.plot(kind="bar", legend=False)
Out[6]:
<matplotlib.axes.AxesSubplot at 0x1f289bd0>
In [7]:
# visualize incidence of diseases by race
df_trunc = df[["BENE_RACE_CD", "SP_ALZHDMTA", "SP_CHF", 
    "SP_CHRNKIDN", "SP_CNCR", "SP_COPD", "SP_DEPRESSN", "SP_DIABETES", 
    "SP_ISCHMCHT", "SP_OSTEOPRS", "SP_RA_OA", "SP_STRKETIA"]]
df_grouped = df_trunc.groupby("BENE_RACE_CD").agg(lambda x: 100.0 * np.mean(x - 1))
df_grouped.transpose()
Out[7]:
BENE_RACE_CD 1 2 3 5
SP_ALZHDMTA 80.010754 81.348689 83.989334 82.127235
SP_CHF 69.962656 72.319316 75.970828 73.681383
SP_CHRNKIDN 82.826670 83.827324 86.614276 84.818200
SP_CNCR 93.244848 94.377639 95.212785 94.970796
SP_COPD 87.006378 87.860107 89.930231 88.671016
SP_DEPRESSN 78.489047 79.644210 81.982688 79.309211
SP_DIABETES 62.998876 66.469237 69.711650 67.132544
SP_ISCHMCHT 56.630261 61.178935 64.912490 62.750638
SP_OSTEOPRS 83.063121 85.223362 86.334287 85.459809
SP_RA_OA 85.466444 87.533744 88.788494 87.581228
SP_STRKETIA 95.832729 95.970690 96.892719 96.422160

As can be seen from the table above, the incidence of various diseases across different races seems to be fairly constant. The only exceptions are CHF which seems to be less prevalent among whites than other races, and Ischemic Heart Disease, which seem t obe more prevalent among whites than other races. The data is shown visually below:

In [8]:
df_grouped.plot(kind="bar", legend=False)
Out[8]:
<matplotlib.axes.AxesSubplot at 0x5c26690>

We now examine observed comorbidities in the population by calculating the correlation between different pairs of diseases. As can be seen from the top 10 pairs below, the correlation between diseases appears to be fairly weak.

In [8]:
colnames = [x for x in df.columns if x.startswith("SP_") and not x.endswith("STATE_CODE")]
colidxs = {x[1]:x[0] for x in enumerate(colnames)}
CM = np.zeros((len(colnames), len(colnames)))
comorbidities = []
for colname1 in colnames:
    for colname2 in colnames:
        i = colidxs[colname1]
        j = colidxs[colname2]
        if i == j:
            CM[i,j] = 1.0
            continue
        if i < j:
            CM[i,j] = np.corrcoef(df.ix[:, colname1], df.ix[:, colname2])[0,1]
            CM[j,i] = CM[i,j]
            comorbidities.append((colname1, colname2, CM[i,j]))
            continue
sorted(comorbidities, key=operator.itemgetter(2), reverse=True)[0:10]
Out[8]:
[('SP_DIABETES', 'SP_ISCHMCHT', 0.47578541650029821),
 ('SP_CHF', 'SP_ISCHMCHT', 0.43615577919472531),
 ('SP_CHF', 'SP_DIABETES', 0.41850270374368082),
 ('SP_CHRNKIDN', 'SP_DIABETES', 0.39713288341420422),
 ('SP_CHF', 'SP_CHRNKIDN', 0.38213196506731339),
 ('SP_CHRNKIDN', 'SP_ISCHMCHT', 0.35058275391195598),
 ('SP_CHF', 'SP_COPD', 0.33059074013020173),
 ('SP_DEPRESSN', 'SP_DIABETES', 0.3199180438991816),
 ('SP_ALZHDMTA', 'SP_DIABETES', 0.31346118720405913),
 ('SP_CHRNKIDN', 'SP_COPD', 0.31327625784275792)]

The heatmap below summarizes the correlations between different chronic diseases graphically. Lighter colors indicate higher correlation. As expected, the lightest block is (7,8) which corresponds to SP_DIABETES and SP_ISCHMCHT.

In [10]:
fig = plt.figure()
plt.hot()
plt.pcolormesh(CM)
plt.colorbar()
Out[10]:
<matplotlib.colorbar.Colorbar instance at 0x1d8cad40>