Pre-processing MIMIC notes for further use.
Below is a list of redacted items with an example and the replacement token. Replacement tokens are changeable. Check preprocess_notes.py
for more details.
Redacted items:
[**First Name (Titles) 137**]
, xxname
[**Last Name (Titles) **]
, xxln
[**Initials (NamePattern4) **]
, xxinit
[**Name (NI) **]
, xxname
[**Doctor First Name 1266**]
, xxdocfn
[**Doctor Last Name 1266**]
, xxdocln
[**Known lastname 658**]
, xxln
[**Hospital1 **]
, xxhosp
**Hospital Unit Name 10**
, xxhosp
[**Company 12924**]
, xxwork
[**University/College **]
, xxwork
[**2112-4-18**]
, xxdate
[**Year (4 digits) **]
, xxyear
[**2119**]
, xxyear
- I use a regex \b\d{4}\b
that will match any 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.[**6-12**]
, [**12/2151**]
, xxmmdd
[**Month/Day (2) 509**]
, xxmmdd
[**Month (only) 51**]
, xxmonth
[**Holiday 3470**]
, xxhols
[**Date range (1) 7610**]
, xxdtrnge
[**Country 9958**]
, xxcntry
[**State 3283**]
, xxstate
**Location (un) 2432**
, xxloc
[**Telephone/Fax (3) 8049**]
, xxph
[**Clip Number (Radiology) 29923**]
, xxradclip
[**Numeric Identifier 6403**]
, xxpager
[**Pager number 13866**]
, xxpager
[**Security Number 10198**]
, xxssn
[**Serial Number 3567**]
, xxsno
[**Medical Record Number **]
, xxmrno
[**Provider Number 12521**]
, xxpno
[**Age over 90 **]
, xxage90
[**Contact Info **]
, xxcontact
[**Job Number **]
, xxjobno
[**Dictator Info **]
, xxdict
[**Pharmacy MD Number **]
, xxmdno
12:52 PM
, split into 6 segments by the hour and replace with the following tokens: midnight, dawn, forenoon, afternoon, dusk, night
[** 84 **]
, xx2digit
[** 834 **]
, xx3digit
886
notes are marked incorrect with iserror
flag set to 1. Thus, there are total of 2,082,294
notes. I have set up a view
called correctnotes
in the database, which only includese the correct notes. All the data I grab is from that view
.
import pandas as pd
import psycopg2
import numpy as np
import re
import random
import datetime
from pathlib import Path
import pickle
import numpy as np
Softlink ln -s
your data path to a data
variable in the current folder. That way we don't need to change the path in the notebook.
PATH = Path('data')
!ls {PATH}
from preprocess_notes import *
Here the data is grabbed from the MIMIC database. Data can also be grabbed from other sources
%%time
cats = pd.read_csv('note_categories.csv')
max_limit = -1
queries = []
for category, n_notes in zip(cats['category'], cats['number_of_notes']):
limit = min(max_limit, n_notes) if max_limit > 0 else n_notes
if limit == max_limit:
q = f"""
select * from correctnotes where category=\'{category}\' order by random() limit {limit};
"""
else:
q = f"""
select * from correctnotes where category=\'{category}\';
"""
queries.append(q)
dfs = []
con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')
for q in queries:
df = pd.read_sql_query(q, con)
dfs.append(df)
con.close()
df = pd.concat(dfs)
print(df.shape)
(2082294, 10) CPU times: user 12.1 s, sys: 2.58 s, total: 14.7 s Wall time: 1min 24s
%%time
df = pd.read_csv(PATH/'NOTEEVENTS.csv.gz')
print(df.shape)
df.columns = map(str.lower, df.columns)
df.set_index('row_id', inplace=True)
print(df.shape)
(2082294, 9)
Confirm that the number of notes match the actual number.
df[['category', 'text']].groupby(['category']).agg(['count'])
text | |
---|---|
count | |
category | |
Case Management | 953 |
Consult | 98 |
Discharge summary | 59652 |
ECG | 209051 |
Echo | 45794 |
General | 8236 |
Nursing | 223182 |
Nursing/other | 822497 |
Nutrition | 9400 |
Pharmacy | 101 |
Physician | 141281 |
Radiology | 522279 |
Rehab Services | 5408 |
Respiratory | 31701 |
Social Work | 2661 |
%%time
df['proc_text'] = df['text'].apply(preprocess_note)
with open(PATH/'preprocessed_noteevents.pkl', 'wb') as f:
pickle.dump(df, f)
CPU times: user 24min 5s, sys: 3.86 s, total: 24min 9s Wall time: 24min 55s
To follow the FastAI language modeling lesson, I've created a subset of the original dataframe to sample for the datasets. In particular, I've included the description
and preprocessed_text
fields in the datasets. The description
column is composed of free-text and has 3840
unique descriptions. I consider the description as a unique field
which will be marked as such during tokenization as done in the FastAI library.
sub_df = pd.DataFrame({'proc_text': df['proc_text'], 'category': df['category'], 'description': df['description'], 'labels': [0]*len(df)},\
columns=['labels', 'category', 'description', 'proc_text'])
sub_df.sample(5)
labels | category | description | proc_text | |
---|---|---|---|---|
row_id | ||||
2056836 | 0 | Nursing/other | Report | NPN\n\n\n#2 Resp: Breathing comfortably on roo... |
854474 | 0 | Radiology | CHEST (PA & LAT) | xxdate xxafter\n CHEST (PA & LAT) ... |
116729 | 0 | ECG | Report | Baseline artifact is present. Sinus rhythm. Ve... |
722400 | 0 | Nursing | Nursing Progress Note | Altered mental status (not Delirium)\n Asses... |
921412 | 0 | Radiology | CT HEAD W/O CONTRAST | xxdate xxafter\n CT HEAD W/O CONTRAST ... |
Now we can just do a train/test split on the entire dataset for getting a 90/10 training and testing dataset. However, I would like the train/test set have a 90%/10% split in each category. So I chose to iterate over each entry of the category
column and create masks to split data with a 90/10 split for training and testing so that I grab 10% of texts in each category for testing instead of a global 10%.
Set random seed for reproducible results.
%%time
np.random.seed(42)
dfs = [sub_df.loc[df['category'] == c] for c in sub_df['category'].unique()]
msks = [np.random.rand(len(d)) < 0.9 for d in dfs]
train_dfs = [None] * len(dfs)
val_dfs = [None] * len(dfs)
for i in range(len(dfs)):
idf = dfs[i]
mask = msks[i]
train_dfs[i] = idf[mask]
val_dfs[i] = idf[~mask]
train_df = pd.concat(train_dfs)
val_df = pd.concat(val_dfs)
print(len(train_df), (len(df) - len(df)//10), len(train_df)-(len(df) - len(df)//10))
print(len(val_df), (len(df)//10), len(val_df)-(len(df)//10))
1874270 1874065 205 208024 208229 -205 CPU times: user 3.17 s, sys: 104 ms, total: 3.27 s Wall time: 2.65 s
Sanity check the aggregate count for each category over the 3 dataframes. Then write the train
and val
dataframes to disk.
val_df[['category', 'proc_text']].groupby(['category']).agg(['count'])
proc_text | |
---|---|
count | |
category | |
Case Management | 96 |
Consult | 11 |
Discharge summary | 5902 |
ECG | 20977 |
Echo | 4446 |
General | 801 |
Nursing | 22307 |
Nursing/other | 82222 |
Nutrition | 963 |
Pharmacy | 4 |
Physician | 13982 |
Radiology | 52337 |
Rehab Services | 522 |
Respiratory | 3194 |
Social Work | 260 |
train_df[['category', 'proc_text']].groupby(['category']).agg(['count'])
proc_text | |
---|---|
count | |
category | |
Case Management | 857 |
Consult | 87 |
Discharge summary | 53750 |
ECG | 188074 |
Echo | 41348 |
General | 7435 |
Nursing | 200875 |
Nursing/other | 740275 |
Nutrition | 8437 |
Pharmacy | 97 |
Physician | 127299 |
Radiology | 469942 |
Rehab Services | 4886 |
Respiratory | 28507 |
Social Work | 2401 |
sub_df[['category', 'proc_text']].groupby(['category']).agg(['count'])
proc_text | |
---|---|
count | |
category | |
Case Management | 953 |
Consult | 98 |
Discharge summary | 59652 |
ECG | 209051 |
Echo | 45794 |
General | 8236 |
Nursing | 223182 |
Nursing/other | 822497 |
Nutrition | 9400 |
Pharmacy | 101 |
Physician | 141281 |
Radiology | 522279 |
Rehab Services | 5408 |
Respiratory | 31701 |
Social Work | 2661 |
%%time
train_df[['labels', 'description', 'proc_text']].to_csv(PATH/'train.csv', header=False, index=False)
val_df[['labels', 'description', 'proc_text']].to_csv(PATH/'test.csv', header=False, index=False)
CPU times: user 1min 7s, sys: 2.64 s, total: 1min 10s Wall time: 1min 13s