This notebook walks you through how to load the NIH ChestX-ray14 dataset!
First, we'll download the data. Then, we'll load the data into FiftyOne.
Note: You can also browse this dataset for free at try.fiftyone.ai!
To run this code, you will need to install the FiftyOne open source library for dataset curation.
!pip install fiftyone
We will import all of the necessary modules:
from glob import glob
import os
import subprocess
import urllib.request
import numpy as np
import pandas as pd
from PIL import Image
from tqdm.notebook import tqdm
import fiftyone as fo
from fiftyone import ViewField as F
All of the raw data is hosted by the NIH here.
Download the following files:
Data_Entry_2017.csv
BBox_List_2017.csv
train_val_list.txt
test_list.txt
Run the following cell to batch download the zip files containing the X-ray images:
# URLs for the zip files
links = [
'https://nihcc.box.com/shared/static/vfk49d74nhbxq3nqjg0900w5nvkorp5c.gz',
'https://nihcc.box.com/shared/static/i28rlmbvmfjbl8p2n3ril0pptcmcu9d1.gz',
'https://nihcc.box.com/shared/static/f1t00wrtdk94satdfb9olcolqx20z2jp.gz',
'https://nihcc.box.com/shared/static/0aowwzs5lhjrceb3qp67ahp0rd1l1etg.gz',
'https://nihcc.box.com/shared/static/v5e3goj22zr6h8tzualxfsqlqaygfbsn.gz',
'https://nihcc.box.com/shared/static/asi7ikud9jwnkrnkj99jnpfkjdes7l6l.gz',
'https://nihcc.box.com/shared/static/jn1b4mw4n6lnh74ovmcjb8y48h8xj07n.gz',
'https://nihcc.box.com/shared/static/tvpxmn7qyrgl0w8wfh9kqfjskv6nmm1j.gz',
'https://nihcc.box.com/shared/static/upyy3ml7qdumlgk2rfcvlb9k6gvqq2pj.gz',
'https://nihcc.box.com/shared/static/l6nilvfa9cg3s28tqv1qc1olm3gnz54p.gz',
'https://nihcc.box.com/shared/static/hhq8fkdgvcari67vfhs7ppg2w6ni4jze.gz',
'https://nihcc.box.com/shared/static/ioqwiy20ihqwyr8pf4c24eazhh281pbu.gz'
]
for idx, link in enumerate(links):
fn = 'images_%02d.tar.gz' % (idx+1)
print('downloading'+fn+'...')
urllib.request.urlretrieve(link, fn) # download the zip file
print("Download complete. Please check the checksums")
Then unzip these zip files:
for file in glob('*.tar.gz'):
directory = file.rsplit('.', 2)[0]
os.makedirs(directory, exist_ok=True)
subprocess.run(['tar', '-xzf', file, '-C', directory])
And move all of the images into a common images
folder:
os.system("mkdir images")
for image_dir in glob('images_*/'):
os.system(f"mv {image_dir}images/* images/")
os.system(f"rm -r {image_dir}")
import fiftyone as fo
dataset = fo.Dataset("CXR8")
dataset
Name: CXR8 Media type: image Num samples: 112120 Persistent: True Tags: [] Sample fields: id: fiftyone.core.fields.ObjectIdField filepath: fiftyone.core.fields.StringField tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField) metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata) patient_id: fiftyone.core.fields.StringField view_position: fiftyone.core.fields.StringField patient_age: fiftyone.core.fields.IntField patient_gender: fiftyone.core.fields.StringField follow_up_number: fiftyone.core.fields.IntField findings: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classifications) detection: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detection)
dataset.distinct("findings.classifications.label")
['Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema', 'Effusion', 'Emphysema', 'Fibrosis', 'Hernia', 'Infiltration', 'Mass', 'No Finding', 'Nodule', 'Pleural_Thickening', 'Pneumonia', 'Pneumothorax']
Now we can create a dataset from this image directory:
dataset = fo.Dataset.from_images_dir("images")
dataset.name = "ChestX-ray14"
dataset.persistent= True
Now let's add in the split information ("train" vs "test") as tags:
dirpath = os.path.dirname(dataset.first().filepath)
test_filepaths = [
os.path.join(dirpath, f) for f in test_filenames
]
train_filepaths = [
os.path.join(dirpath, f) for f in train_filenames
for fp in tqdm(train_filepaths):
sample = dataset[fp]
sample.tags.append("train")
sample.save()
for fp in tqdm(test_filepaths):
sample = dataset[fp]
sample.tags.append("test")
sample.save()
Next, let's add in basic attributes:
## load as pandas dataframe
attributes_df = pd.read_csv("Data_Entry_2017_v2020.csv")
## add fields to dataset
dataset.add_sample_field("follow_up_number", fo.IntField)
dataset.add_sample_field("patient_id", fo.StringField)
dataset.add_sample_field("view_position", fo.StringField)
dataset.add_sample_field("patient_age", fo.IntField)
dataset.add_sample_field("patient_gender", fo.StringField)
## iterate through rows of the dataframe
for row in tqdm(attributes_df.iterrows()):
age, gender, view_pos = row[1][['Patient Age', 'Patient Gender', 'View Position']]
pid, fup = row[1][['Patient ID', 'Follow-up #']]
finding = row[1]['Finding Labels'].split('|')
filename = row[1]['Image Index']
fp = os.path.join(dirpath, filename)
classifs = fo.Classifications(
classifications=[
fo.Classification(label=l) for l in finding
]
)
sample = dataset[fp]
sample['patient_age'] = age
sample["patient_gender"] = gender
sample["view_position"] = view_pos
sample["patient_id"] = str(pid)
sample["follow_up_number"] = int(fup)
sample["classifications"] = classifs
sample.save()
Finally, let's add in the detection bounding boxes. There are less than 1,000 of them:
## compute metadata so we have width and height
dataset.compute_metadata()
## load the bounding box data
bbox_df = pd.read_csv('BBox_List_2017.csv')
## create a new field called "detection" that contains the bounding box
for row in bbox_df.iterrows():
fp = os.path.join(dirpath, row[1]["Image Index"])
sample = dataset[fp]
box_w = row[1]["w"]
box_h = row[1]["h]"]
box_x = row[1]["Bbox [x"]
box_y = row[1]["y"]
label = row[1]["Finding Label"]
image_w, image_h = sample.metadata.width, sample.metadata.height
bounding_box = [box_x/image_w, box_y/image_h, box_w/image_w, box_h/image_h]
sample["detection"] = fo.Detection(
label=label,
bounding_box=bounding_box
)
sample.save()
Now we can visualize the data in the FiftyOne App:
session = fo.launch_app(dataset, auto=False)
Session launched. Run `session.show()` to open the App in a cell output.
Depending on what analysis we are performing, it may be helpful to look at the results for each patient individually. We can achieve this by dynamically grouping by patient_id
and ordering by follow_up_number
.