Munging fixed-width formats


Large data sets, such as publicly available US Government data, are often made available as fixed-width format (FWF) text files. At first glance, each record looks like a string of characters. The data dictionary or another file is usually provided that provides information about which columns or substrings represent each variable in the data set. This format requires some careful coding to read, to ensure the correctness of the mapping from data to each variable.

Our current project involves the Healthcare Cost and Utilization Project (HCUP) and their National Inpatient Sample (NIS). We need to munge data from their Core database to obtain the volume of discharges with particular DRG codes, and then classify hospitals by volume. The HCUP provides load files for their data in SAS, SPSS and Stata. No open-source software is supported, but, the data is just ASCII data; how hard can this be?

For this post, I will consider the 2008 NIS Core data and examples from it. I will not provide the data (as it is subject to a data use agreement with HCUP) or the exact analysis we are doing, but provide surrogate numbers to reflect the process.

Figuring out formats

Well, it's probably not hard, but it might be tedious. The NIS Core data for 2008 has 128 fields, so we need variable names, widths (how many characters for each field) and variable types for each of these. I don't know about you, but having to type this out would be a barrier for me to do anything with it.

Still, I don't want to fork over thousands of dollars for supported software. But maybe I can take advantage of code that HCUP provides? Looking at the load files, I figured that the Stata code seemed to have the cleanest format. Why not just extract the needed information from the load file? With Python, it shouldn't be too hard. I cooked up the following code (which might give my friends who are professional Python coders the shivers):

In [1]:
import re
import pandas as pd
import numpy as np

def parseStataCode(statafile):
    with open(statafile,'r') as f:
        statacode = f.readlines()
    indStart = min([i for i in range(len(statacode)) if statacode[i].find('infix')>-1])
    indEnd = min([i for i in range(len(statacode)) if statacode[i].find('using')>-1])
    coldata =statacode[indStart:indEnd]
    startcol=[re.findall(' (\d+)', x)[0] for x in coldata];
    maxString = re.findall(r'(\d+)\-*(\d*)', coldata[-1])[0][-1]
    startcol=np.array([int(x) for x in startcol])
    widths = np.diff(startcol)
    varnames = [re.findall('[A-Z]\w+', x)[0] for x in coldata]
    out = pd.DataFrame({'varnames':varnames, 'widths':widths })
    return out

So, we can get the information we need from the Stata load file by running

In [2]:
wt = parseStataCode('StataLoad_NIS_2008_Core.Do')

We could add to the function if we also needed the missing data coding that is provided, but for the current project, we don't need those.

Ingesting data

The NIS data is provided as zipped text files. One advantage to Python is the zipfile module, which allows us to access the data without explicitly uncompressing the file. Yes, we'll take a computational hit at the expense of extra storage, but for this exercise that's probably fine. Also, for some of the years, we found that there was some problems with the compression metadata that made WinZip fail, and the Python method worked beautifully for that.

For 2010 and after, HCUP encrypted the data with AES-256 encryption. Currently the zipfile module cannot handle this, so we used 7-Zip via the subprocess module.

We will use the struct module to parse the FWF files, following the advice given on StackOverflow. For this, we need to specify the data format, in terms of widths, in a format string, in order to parse the data.

In [3]:
def defineFormat(wt, keep=['DRG','HOSPID']):
    widths = list(wt.widths)
    fieldwidths = [-x for x in widths]
    for s in keep:
        fieldwidths[varnames.index(s)] *= -1
    fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)
    return fmtstring

'3x 3x 2x 2x 1x 3x 2x 2x 2x 11x 2x 2x 2x 2x 3s 3x 2x 3x 17x 5x 5x 5x 5x 5x 5x 5x 5x 5x 5x 5x 5x 5x 5x 5x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 5x 5x 5x 5x 2x 4x 4x 4x 4x 2x 3x 2x 5s 2x 14x 5x 6x 2x 2x 5x 5x 2x 2x 3x 2x 4x 2x 2x 2x 10x 2x 10x 3x 4x 4x 4x 4x 4x 4x 4x 4x 4x 4x 4x 4x 4x 4x 4x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 3x 1x 8x 2x 10x 15x 2x 4x 1x'

This allows us to mask variables we don't need (all the x's), and extract variables we do need as strings.

We can then define a parser using this format:

In [4]:
import struct
fieldstruct = struct.Struct(defineFormat(wt))
parse = fieldstruct.unpack_from

We can now extract data from the compressed data file. We will also filter the data to only keep data with DRG code 500.

In [5]:
import zipfile
dat = []
drgcodes = ['500']
with zipfile.ZipFile('NIS_2008_Core.exe','r') as zf:
    with"NIS_2008_Core.ASC", 'r') as f:
        for line in f:
            x = list(parse(line))
            if x[0] in drgcodes:

dat = pd.DataFrame(dat, columns = ['DRG','HOSPID'])

We now have to compute the volumes for each hospital, categorize them as Low, Medium and High, and then find the frequency of each category.

In [6]:
def categorize(s, bins=[0,10,20,5000], labels=['Low','Medium','High']):
    Split into categories
    out = pd.cut(s, bins=bins, labels=labels)
    return out

volumes = categorize(dat.HOSPID.value_counts())
freq = volumes.value_counts()
Low       365
Medium     10
High        1
dtype: int64
In [7]:
%matplotlib inline