Medicaid Dataset - Basic Data Analysis of Benefit Summary Data¶

CMS.gov has made available a synthetic dataset of Medicare/Medicaid claims data. The objective is to allow data enterpreneurs to figure out interesting ways to use the data.

As noted in the Data Codebook (PDF), the data has been heavily anonymized and imputed, so any conclusions drawn from this dataset may not be reliable.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import operator
import random
%matplotlib inline

We read the benefit summary data in. There are 59 files representing benefit summaries from 2008-2010. There are a total of 6,758,744 records. We copy these into a separate benefit_summary directory so all the files can be loaded into a single dataframe.

In [2]:

dfs = []
for csvfile in os.listdir("benefit_summary"):
    dfs.append(pd.read_csv("benefit_summary/" + csvfile))
df = pd.concat(dfs)
df.shape

Out[2]:

(6758685, 32)

In [3]:

# find the distribution of records by gender
df[["BENE_SEX_IDENT_CD"]].groupby("BENE_SEX_IDENT_CD").count()

Out[3]:

	BENE_SEX_IDENT_CD
BENE_SEX_IDENT_CD
1	3002984
2	3755701

In [4]:

# find distribution of records by race
df[["BENE_RACE_CD"]].groupby("BENE_RACE_CD").count()

Out[4]:

	BENE_RACE_CD
BENE_RACE_CD
1	5596084
2	719751
3	284654
5	158196

So it looks like the beneficiaries are split almost evenly by gender with men constituting about 45% of the dataset and the women constituting the other 55%.

Race-wise, whites constitute 83%, blacks 11%, other (asians, native americans, etc) 4% and hispanics 2%. This roughly reflects the race mix among older Americans.

In [5]:

# visualize incidence of diseases by gender
df_trunc = df[["BENE_SEX_IDENT_CD", "SP_ALZHDMTA", "SP_CHF", 
    "SP_CHRNKIDN", "SP_CNCR", "SP_COPD", "SP_DEPRESSN", "SP_DIABETES", 
    "SP_ISCHMCHT", "SP_OSTEOPRS", "SP_RA_OA", "SP_STRKETIA"]]
df_grouped = df_trunc.groupby("BENE_SEX_IDENT_CD").agg(lambda x: 100.0 * np.mean(x - 1))
df_grouped.transpose()

Out[5]:

BENE_SEX_IDENT_CD	1	2
SP_ALZHDMTA	81.597904	79.388801
SP_CHF	71.964619	69.425575
SP_CHRNKIDN	83.974040	82.471981
SP_CNCR	93.506725	93.474401
SP_COPD	87.954048	86.703974
SP_DEPRESSN	80.288973	77.570579
SP_DIABETES	65.798020	62.108698
SP_ISCHMCHT	59.353030	56.210438
SP_OSTEOPRS	84.587930	82.606789
SP_RA_OA	86.827069	85.115562
SP_STRKETIA	96.092187	95.756877

Men lead in the incidence of disease in almost all cases - they show higher incidence of Alzheimer's Disease, CHF, Depression, Diabetes, Ischemic Heart Disease, Osteoporosis and Rheumatoid/Osteo-Arthritis. Men and women seem to have an equal chance of contracting Chronic Kidney Disease, Cancer, COPD, and Stroke. The data is shown visually below:

In [6]:

df_grouped.plot(kind="bar", legend=False)

Out[6]:

<matplotlib.axes.AxesSubplot at 0x1f289bd0>

In [7]:

# visualize incidence of diseases by race
df_trunc = df[["BENE_RACE_CD", "SP_ALZHDMTA", "SP_CHF", 
    "SP_CHRNKIDN", "SP_CNCR", "SP_COPD", "SP_DEPRESSN", "SP_DIABETES", 
    "SP_ISCHMCHT", "SP_OSTEOPRS", "SP_RA_OA", "SP_STRKETIA"]]
df_grouped = df_trunc.groupby("BENE_RACE_CD").agg(lambda x: 100.0 * np.mean(x - 1))
df_grouped.transpose()

Out[7]:

BENE_RACE_CD	1	2	3	5
SP_ALZHDMTA	80.010754	81.348689	83.989334	82.127235
SP_CHF	69.962656	72.319316	75.970828	73.681383
SP_CHRNKIDN	82.826670	83.827324	86.614276	84.818200
SP_CNCR	93.244848	94.377639	95.212785	94.970796
SP_COPD	87.006378	87.860107	89.930231	88.671016
SP_DEPRESSN	78.489047	79.644210	81.982688	79.309211
SP_DIABETES	62.998876	66.469237	69.711650	67.132544
SP_ISCHMCHT	56.630261	61.178935	64.912490	62.750638
SP_OSTEOPRS	83.063121	85.223362	86.334287	85.459809
SP_RA_OA	85.466444	87.533744	88.788494	87.581228
SP_STRKETIA	95.832729	95.970690	96.892719	96.422160

As can be seen from the table above, the incidence of various diseases across different races seems to be fairly constant. The only exceptions are CHF which seems to be less prevalent among whites than other races, and Ischemic Heart Disease, which seem t obe more prevalent among whites than other races. The data is shown visually below:

In [8]:

df_grouped.plot(kind="bar", legend=False)

Out[8]:

<matplotlib.axes.AxesSubplot at 0x5c26690>

We now examine observed comorbidities in the population by calculating the correlation between different pairs of diseases. As can be seen from the top 10 pairs below, the correlation between diseases appears to be fairly weak.

In [8]:

colnames = [x for x in df.columns if x.startswith("SP_") and not x.endswith("STATE_CODE")]
colidxs = {x[1]:x[0] for x in enumerate(colnames)}
CM = np.zeros((len(colnames), len(colnames)))
comorbidities = []
for colname1 in colnames:
    for colname2 in colnames:
        i = colidxs[colname1]
        j = colidxs[colname2]
        if i == j:
            CM[i,j] = 1.0
            continue
        if i < j:
            CM[i,j] = np.corrcoef(df.ix[:, colname1], df.ix[:, colname2])[0,1]
            CM[j,i] = CM[i,j]
            comorbidities.append((colname1, colname2, CM[i,j]))
            continue
sorted(comorbidities, key=operator.itemgetter(2), reverse=True)[0:10]

Out[8]:

[('SP_DIABETES', 'SP_ISCHMCHT', 0.47578541650029821),
 ('SP_CHF', 'SP_ISCHMCHT', 0.43615577919472531),
 ('SP_CHF', 'SP_DIABETES', 0.41850270374368082),
 ('SP_CHRNKIDN', 'SP_DIABETES', 0.39713288341420422),
 ('SP_CHF', 'SP_CHRNKIDN', 0.38213196506731339),
 ('SP_CHRNKIDN', 'SP_ISCHMCHT', 0.35058275391195598),
 ('SP_CHF', 'SP_COPD', 0.33059074013020173),
 ('SP_DEPRESSN', 'SP_DIABETES', 0.3199180438991816),
 ('SP_ALZHDMTA', 'SP_DIABETES', 0.31346118720405913),
 ('SP_CHRNKIDN', 'SP_COPD', 0.31327625784275792)]

The heatmap below summarizes the correlations between different chronic diseases graphically. Lighter colors indicate higher correlation. As expected, the lightest block is (7,8) which corresponds to SP_DIABETES and SP_ISCHMCHT.

In [10]:

fig = plt.figure()
plt.hot()
plt.pcolormesh(CM)
plt.colorbar()

Out[10]:

<matplotlib.colorbar.Colorbar instance at 0x1d8cad40>