View on GitHub

Download notebook

Load X-ray Data into FiftyOne¶

This notebook walks you through how to load the NIH ChestX-ray14 dataset!

First, we'll download the data. Then, we'll load the data into FiftyOne.

Note: You can also browse this dataset for free at try.fiftyone.ai!

Setup¶

To run this code, you will need to install the FiftyOne open source library for dataset curation.

In [ ]:

!pip install fiftyone

We will import all of the necessary modules:

In [ ]:

from glob import glob
import os
import subprocess
import urllib.request

import numpy as np
import pandas as pd
from PIL import Image
from tqdm.notebook import tqdm

import fiftyone as fo
from fiftyone import ViewField as F

Downloading Data¶

All of the raw data is hosted by the NIH here.

Download the following files:

Data_Entry_2017.csv
BBox_List_2017.csv
train_val_list.txt
test_list.txt

Run the following cell to batch download the zip files containing the X-ray images:

In [1]:

# URLs for the zip files
links = [
    'https://nihcc.box.com/shared/static/vfk49d74nhbxq3nqjg0900w5nvkorp5c.gz',
    'https://nihcc.box.com/shared/static/i28rlmbvmfjbl8p2n3ril0pptcmcu9d1.gz',
    'https://nihcc.box.com/shared/static/f1t00wrtdk94satdfb9olcolqx20z2jp.gz',
	'https://nihcc.box.com/shared/static/0aowwzs5lhjrceb3qp67ahp0rd1l1etg.gz',
    'https://nihcc.box.com/shared/static/v5e3goj22zr6h8tzualxfsqlqaygfbsn.gz',
	'https://nihcc.box.com/shared/static/asi7ikud9jwnkrnkj99jnpfkjdes7l6l.gz',
	'https://nihcc.box.com/shared/static/jn1b4mw4n6lnh74ovmcjb8y48h8xj07n.gz',
    'https://nihcc.box.com/shared/static/tvpxmn7qyrgl0w8wfh9kqfjskv6nmm1j.gz',
	'https://nihcc.box.com/shared/static/upyy3ml7qdumlgk2rfcvlb9k6gvqq2pj.gz',
	'https://nihcc.box.com/shared/static/l6nilvfa9cg3s28tqv1qc1olm3gnz54p.gz',
	'https://nihcc.box.com/shared/static/hhq8fkdgvcari67vfhs7ppg2w6ni4jze.gz',
	'https://nihcc.box.com/shared/static/ioqwiy20ihqwyr8pf4c24eazhh281pbu.gz'
]

for idx, link in enumerate(links):
    fn = 'images_%02d.tar.gz' % (idx+1)
    print('downloading'+fn+'...')
    urllib.request.urlretrieve(link, fn)  # download the zip file

print("Download complete. Please check the checksums")

Then unzip these zip files:

In [2]:

for file in glob('*.tar.gz'):
    directory = file.rsplit('.', 2)[0]
    os.makedirs(directory, exist_ok=True)
    subprocess.run(['tar', '-xzf', file, '-C', directory])

And move all of the images into a common images folder:

In [ ]:

os.system("mkdir images")
for image_dir in glob('images_*/'):
    os.system(f"mv {image_dir}images/* images/")
    os.system(f"rm -r {image_dir}")

In [1]:

import fiftyone as fo

In [2]:

dataset = fo.Dataset("CXR8")

In [6]:

dataset

Out[6]:

Name:        CXR8
Media type:  image
Num samples: 112120
Persistent:  True
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    patient_id:       fiftyone.core.fields.StringField
    view_position:    fiftyone.core.fields.StringField
    patient_age:      fiftyone.core.fields.IntField
    patient_gender:   fiftyone.core.fields.StringField
    follow_up_number: fiftyone.core.fields.IntField
    findings:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classifications)
    detection:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detection)

In [8]:

dataset.distinct("findings.classifications.label")

Out[8]:

['Atelectasis',
 'Cardiomegaly',
 'Consolidation',
 'Edema',
 'Effusion',
 'Emphysema',
 'Fibrosis',
 'Hernia',
 'Infiltration',
 'Mass',
 'No Finding',
 'Nodule',
 'Pleural_Thickening',
 'Pneumonia',
 'Pneumothorax']

In [ ]:

Load Data into FiftyOne¶

Now we can create a dataset from this image directory:

In [3]:

dataset = fo.Dataset.from_images_dir("images")
dataset.name = "ChestX-ray14"
dataset.persistent= True

Now let's add in the split information ("train" vs "test") as tags:

In [4]:

dirpath = os.path.dirname(dataset.first().filepath)
test_filepaths = [
    os.path.join(dirpath, f) for f in test_filenames
]

train_filepaths = [
    os.path.join(dirpath, f) for f in train_filenames

for fp in tqdm(train_filepaths):
    sample = dataset[fp]
    sample.tags.append("train")
    sample.save()

for fp in tqdm(test_filepaths):
    sample = dataset[fp]
    sample.tags.append("test")
    sample.save()

Next, let's add in basic attributes:

In [ ]:

## load as pandas dataframe
attributes_df = pd.read_csv("Data_Entry_2017_v2020.csv")

## add fields to dataset
dataset.add_sample_field("follow_up_number", fo.IntField)
dataset.add_sample_field("patient_id", fo.StringField)
dataset.add_sample_field("view_position", fo.StringField)
dataset.add_sample_field("patient_age", fo.IntField)
dataset.add_sample_field("patient_gender", fo.StringField)

In [ ]:

## iterate through rows of the dataframe
for row in tqdm(attributes_df.iterrows()):
    age, gender, view_pos = row[1][['Patient Age', 'Patient Gender', 'View Position']]
    pid, fup = row[1][['Patient ID', 'Follow-up #']]
    finding = row[1]['Finding Labels'].split('|')
    filename = row[1]['Image Index']
    fp = os.path.join(dirpath, filename)
    classifs = fo.Classifications(
        classifications=[
            fo.Classification(label=l) for l in finding
        ]
    )
    sample = dataset[fp]
    sample['patient_age'] = age
    sample["patient_gender"] = gender
    sample["view_position"] = view_pos
    sample["patient_id"] = str(pid)
    sample["follow_up_number"] = int(fup)
    sample["classifications"] = classifs
    sample.save()

Finally, let's add in the detection bounding boxes. There are less than 1,000 of them:

In [ ]:

## compute metadata so we have width and height
dataset.compute_metadata()

## load the bounding box data
bbox_df = pd.read_csv('BBox_List_2017.csv')

## create a new field called "detection" that contains the bounding box
for row in bbox_df.iterrows():
    fp = os.path.join(dirpath, row[1]["Image Index"])
    sample = dataset[fp]
    box_w = row[1]["w"]
    box_h = row[1]["h]"]
    box_x = row[1]["Bbox [x"]
    box_y = row[1]["y"]
    label = row[1]["Finding Label"]
    image_w, image_h = sample.metadata.width, sample.metadata.height
    bounding_box = [box_x/image_w, box_y/image_h, box_w/image_w, box_h/image_h]
    sample["detection"] = fo.Detection(
        label=label,
        bounding_box=bounding_box
    )
    sample.save()

Now we can visualize the data in the FiftyOne App:

In [10]:

session = fo.launch_app(dataset, auto=False)

Session launched. Run `session.show()` to open the App in a cell output.

chest_xray14

Depending on what analysis we are performing, it may be helpful to look at the results for each patient individually. We can achieve this by dynamically grouping by patient_id and ordering by follow_up_number.

chest_xray14_dynam_group