MIMIC Notes Pre-Processing¶

Pre-processing MIMIC notes for further use.

Below is a list of redacted items with an example and the replacement token. Replacement tokens are changeable. Check preprocess_notes.py for more details.

Redacted items:

First Name: [**First Name (Titles) 137**], xxname
Last Name: [**Last Name (Titles) **], xxln
Initials: [**Initials (NamePattern4) **], xxinit
Name: [**Name (NI) **], xxname
Doctor First Name: [**Doctor First Name 1266**], xxdocfn
Doctor Last Name: [**Doctor Last Name 1266**], xxdocln
Known Last Name: [**Known lastname 658**], xxln
Hospital: [**Hospital1 **], xxhosp
Hospital Unit Name: **Hospital Unit Name 10**, xxhosp
Company: [**Company 12924**], xxwork
University/College: [**University/College **], xxwork
Date of format YYYY-M-DD: [**2112-4-18**], xxdate
Year: [**Year (4 digits) **], xxyear
Year YYYY format: [**2119**], xxyear - I use a regex \b\d{4}\b that will match any 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.
Date of format M-DD: [**6-12**], [**12/2151**], xxmmdd
Month/Day: [**Month/Day (2) 509**], xxmmdd
Month (only): [**Month (only) 51**], xxmonth
Holiday: [**Holiday 3470**], xxhols
Date Range: [**Date range (1) 7610**], xxdtrnge
Country: [**Country 9958**], xxcntry
State: [**State 3283**], xxstate
Location: **Location (un) 2432**, xxloc
Telephone/Fax: [**Telephone/Fax (3) 8049**], xxph
Clip Number: [**Clip Number (Radiology) 29923**], xxradclip
Pager Numeric Identifier: [**Numeric Identifier 6403**], xxpager
Pager Number: [**Pager number 13866**], xxpager
Social Security Number: [**Security Number 10198**], xxssn
Serial Number: [**Serial Number 3567**], xxsno
Medical Record Number: [**Medical Record Number **], xxmrno
Provider Number: [**Provider Number 12521**], xxpno
Age over 90: [**Age over 90 **], xxage90
Contact Info: [**Contact Info **], xxcontact
Job Number: [**Job Number **], xxjobno
Dictator Number: [**Dictator Info **], xxdict
Pharmacy MD Number/MD number: [**Pharmacy MD Number **], xxmdno
Time: 12:52 PM, split into 6 segments by the hour and replace with the following tokens: midnight, dawn, forenoon, afternoon, dusk, night
2-digit Numbers: [** 84 **], xx2digit
3-digit Numbers: [** 834 **], xx3digit
Wardname

886 notes are marked incorrect with iserror flag set to 1. Thus, there are total of 2,082,294 notes. I have set up a view called correctnotes in the database, which only includese the correct notes. All the data I grab is from that view.

Imports and Inits¶

In [1]:

import pandas as pd
import psycopg2
import numpy as np
import re
import random
import datetime
from pathlib import Path
import pickle
import numpy as np

Softlink ln -s your data path to a data variable in the current folder. That way we don't need to change the path in the notebook.

In [2]:

PATH = Path('data')
!ls {PATH}

In [3]:

from preprocess_notes import *

Grab Data from MIMIC¶

From Database¶

Here the data is grabbed from the MIMIC database. Data can also be grabbed from other sources

In [4]:

%%time
cats = pd.read_csv('note_categories.csv')
max_limit = -1

queries = []
for category, n_notes in zip(cats['category'], cats['number_of_notes']):
    limit = min(max_limit, n_notes) if max_limit > 0 else n_notes
    if limit == max_limit:
        q = f"""
        select * from correctnotes where category=\'{category}\' order by random() limit {limit};
        """
    else:
        q = f"""
        select * from correctnotes where category=\'{category}\';
        """
    queries.append(q)

dfs = []

con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')
for q in queries:
    df = pd.read_sql_query(q, con)
    dfs.append(df)
con.close()
    
df = pd.concat(dfs)
print(df.shape)

(2082294, 10)
CPU times: user 12.1 s, sys: 2.58 s, total: 14.7 s
Wall time: 1min 24s

From Notes File¶

In [ ]:

%%time
df = pd.read_csv(PATH/'NOTEEVENTS.csv.gz')
print(df.shape)

Preprocess¶

In [5]:

df.columns = map(str.lower, df.columns)
df.set_index('row_id', inplace=True)
print(df.shape)

(2082294, 9)

Confirm that the number of notes match the actual number.

In [6]:

df[['category', 'text']].groupby(['category']).agg(['count'])

Out[6]:

	text
	count
category
Case Management	953
Consult	98
Discharge summary	59652
ECG	209051
Echo	45794
General	8236
Nursing	223182
Nursing/other	822497
Nutrition	9400
Pharmacy	101
Physician	141281
Radiology	522279
Rehab Services	5408
Respiratory	31701
Social Work	2661

In [7]:

%%time
df['proc_text'] = df['text'].apply(preprocess_note)

with open(PATH/'preprocessed_noteevents.pkl', 'wb') as f:
    pickle.dump(df, f)

CPU times: user 24min 5s, sys: 3.86 s, total: 24min 9s
Wall time: 24min 55s

Create datasets for Language Modeling¶

To follow the FastAI language modeling lesson, I've created a subset of the original dataframe to sample for the datasets. In particular, I've included the description and preprocessed_text fields in the datasets. The description column is composed of free-text and has 3840 unique descriptions. I consider the description as a unique field which will be marked as such during tokenization as done in the FastAI library.

In [8]:

sub_df = pd.DataFrame({'proc_text': df['proc_text'], 'category': df['category'], 'description': df['description'], 'labels': [0]*len(df)},\
                      columns=['labels', 'category', 'description', 'proc_text'])
sub_df.sample(5)

Out[8]:

	labels	category	description	proc_text
row_id
2056836	0	Nursing/other	Report	NPN\n\n\n#2 Resp: Breathing comfortably on roo...
854474	0	Radiology	CHEST (PA & LAT)	xxdate xxafter\n CHEST (PA & LAT) ...
116729	0	ECG	Report	Baseline artifact is present. Sinus rhythm. Ve...
722400	0	Nursing	Nursing Progress Note	Altered mental status (not Delirium)\n Asses...
921412	0	Radiology	CT HEAD W/O CONTRAST	xxdate xxafter\n CT HEAD W/O CONTRAST ...

Now we can just do a train/test split on the entire dataset for getting a 90/10 training and testing dataset. However, I would like the train/test set have a 90%/10% split in each category. So I chose to iterate over each entry of the category column and create masks to split data with a 90/10 split for training and testing so that I grab 10% of texts in each category for testing instead of a global 10%.

Set random seed for reproducible results.

In [9]:

%%time
np.random.seed(42)

dfs = [sub_df.loc[df['category'] == c] for c in sub_df['category'].unique()]
msks = [np.random.rand(len(d)) < 0.9 for d in dfs]

train_dfs = [None] * len(dfs)
val_dfs = [None] * len(dfs)

for i in range(len(dfs)):
    idf = dfs[i]
    mask = msks[i]
    train_dfs[i] = idf[mask]
    val_dfs[i] = idf[~mask]
    
train_df = pd.concat(train_dfs)
val_df = pd.concat(val_dfs)

print(len(train_df), (len(df) - len(df)//10), len(train_df)-(len(df) - len(df)//10))
print(len(val_df), (len(df)//10), len(val_df)-(len(df)//10))    

1874270 1874065 205
208024 208229 -205
CPU times: user 3.17 s, sys: 104 ms, total: 3.27 s
Wall time: 2.65 s

Sanity check the aggregate count for each category over the 3 dataframes. Then write the train and val dataframes to disk.

In [10]:

val_df[['category', 'proc_text']].groupby(['category']).agg(['count'])

Out[10]:

	proc_text
	count
category
Case Management	96
Consult	11
Discharge summary	5902
ECG	20977
Echo	4446
General	801
Nursing	22307
Nursing/other	82222
Nutrition	963
Pharmacy	4
Physician	13982
Radiology	52337
Rehab Services	522
Respiratory	3194
Social Work	260

In [11]:

train_df[['category', 'proc_text']].groupby(['category']).agg(['count'])

Out[11]:

	proc_text
	count
category
Case Management	857
Consult	87
Discharge summary	53750
ECG	188074
Echo	41348
General	7435
Nursing	200875
Nursing/other	740275
Nutrition	8437
Pharmacy	97
Physician	127299
Radiology	469942
Rehab Services	4886
Respiratory	28507
Social Work	2401

In [12]:

sub_df[['category', 'proc_text']].groupby(['category']).agg(['count'])

Out[12]:

	proc_text
	count
category
Case Management	953
Consult	98
Discharge summary	59652
ECG	209051
Echo	45794
General	8236
Nursing	223182
Nursing/other	822497
Nutrition	9400
Pharmacy	101
Physician	141281
Radiology	522279
Rehab Services	5408
Respiratory	31701
Social Work	2661

In [13]:

%%time
train_df[['labels', 'description', 'proc_text']].to_csv(PATH/'train.csv', header=False, index=False)
val_df[['labels', 'description', 'proc_text']].to_csv(PATH/'test.csv', header=False, index=False)

CPU times: user 1min 7s, sys: 2.64 s, total: 1min 10s
Wall time: 1min 13s