Notebook

EXPLORATION OF TCIA QIN-HEADNECK DATA COLLECTION¶

This is a Jupyter Notebook that demonstrates how Python can be used to explore the content of a publicly available DICOM dataset stored on The Cancer Imaging Archive (TCIA) and described here: https://wiki.cancerimagingarchive.net/display/Public/QIN-HEADNECK.

This notebook was created as part of the preparations to the DICOM4MICCAI tutorial at the MICCAI 2017 conference on Sept 10, 2017.

The tutorial was organized by the Quantitative Image Informatics for Cancer Research (QIICR) project funded by the Informatics Technology for Cancer Research (ITCR) program of the National Cancer Institute, award U24 CA180918.

More pointers related to the material covered in this notebook:

DICOM4MICCAI gitbook https://qiicr.gitbooks.io/dicom4miccai-handson
dcmqi: conversion between DICOM and quantitative image analysis results https://github.com/QIICR/dcmqi
QIICR project GitHub organization: https://github.org/QIICR
QIICR home page: http://qiicr.org

Feedback¶

Questions, comments, suggestions, corrections are welcomed!

Please email andrey.fedorov@gmail.com, or join the discussion on gitter!

Table of Contents¶

Introduction and prerequisites
Dataset overview
Conversion of the DICOM dataset into tabular form
Python tools
Exploring the DICOM-stored measurements
Reading measurements from DICOM SR derived tables
Linking individual measurements with the images
Further reading

Introduction and prerequisites¶

The goal of this tutorial is to demonstrate how Python can be used to work with the data produced by quantitative image analysis and stored using the DICOM format.

You don't need to know much about DICOM to follow along, but you will need to learn more if you want to use DICOM in your work. You will find pointers in the Further reading section.

DICOM Dataset overview¶

The dataset used in this tutorial is discussed in detail in this publication:

Fedorov A., Clunie D., Ulrich E., Bauer C., Wahle A., Brown B., Onken M., Riesmeier J., Pieper S., Kikinis R., Buatti J., Beichel RR. DICOM for quantitative imaging biomarker development: a standards based approach to sharing clinical data and structured PET/CT analysis results in head and neck cancer research. PeerJ 4:e2057, 2016. DOI: 10.7717/peerj.2057

Here is a bird's eye view of the QIN-HEADNECK dataset:

156 subjects with head and neck cancer
each subject had one or more PET/CT study (each study is expected to include a CT and a PET DICOM imaging series) for disease staging and treatment response assessment
images for a subset of 59 subjects were analyzed as follows:
primary tumor and the involved lymph nodes were segmented by each of the two readers, on two occasions, using 3D Slicer both manually and using an interactive automated segmentation tool described in (Beichel et al. 2016)
the following reference regions used for PET normalization were segmented using automatic tools: cerebellum, liver and aortic arch
all segmentations were saved as DICOM Segmentation objects (DICOM SEG)
all PET images were normalized by Standardized Uptake Value (SUV) body weight
quantitative measurements were calculated from the PET images after applying Standardized Uptake Value (SUV) normalilzation for all the regions defined by the segmentations; SUV normalization factor for each DICOM series was saved into DICOM Real-World Value Mapping object (DICOM RWVM)
all resulting measurements were saved as DICOM Structured Report obects following DICOM SR Template 1500

Conversion of the DICOM dataset into tabular form¶

The DICOM dataset was converted into a collection of tables using this converter script: https://github.com/QIICR/dcm2tables. The script extracts data elements from the DICOM files and stores them as a collection of tab-delimited text files that follow this schema.

You can download the collection of the extracted tables here: https://github.com/fedorov/dicom4miccai-handson/releases/download/miccai2017/QIN-HEADNECK-Tables.tgz. Uncompress the file, note the location of the resulting directory, and set the value of the variable below to that location.

In [1]:

tablesPath = '/home/jovyan/data/QIN-HEADNECK-Tables'
#  set this to your location of the tables if running locally
#tablesPath = '/Users/fedorov/github/dcm2tables/Tables'

We will discuss the contents of the relevant specific tables generated by this script further in this notebook in the context.

Python tools¶

In this demonstration we will use the following Python packages:

Pandas for working with the tabular data
numpy for numerical operations
matplotlib, seaborn and bokeh for plotting

NOTE: there appears to be an issue using the (as of writing) latest 0.12.7 version of bokeh for some of the plotting operations in this notebook. If you are using a local installation of bokeh, you will need to make sure you are using bokeh 0.12.6!

If you are working with this notebook on your own system, you will need to install those packages as a prerequisite to import the packages!

Run the cell below to confirm that all prerequisite packages are installed properly.

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

from bokeh.models import ColumnDataSource, OpenURL, TapTool
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
from bokeh.colors import RGB

output_notebook()

Loading BokehJS ...

Exploring the DICOM-stored measurements¶

In this section we will explore the segmentation-derived measurements originally stored in DICOM SR documents. Specifically, we will create an interactive plot that summarizes the variability of segmentation across different users, sessions and segmentation tools involved.

First, we will load various tables that will be needed for this task. You can always check out the schema of the tables extracted from the DICOM dataset at this link.

The flowchart below summarizes the sequence of steps and the tools involved in the processing throughout this tutorial.

Preparing measurements from DICOM SR derived tables¶

In this specific dataset, all SR documents correspond to the segmentation-based measurement reports. To be more specific, all of these SR documents follow the same Structured Reporting template TID 1500. The overall relationship between segmentations and the content of the SR documents is illustrated in the figure below.

Each segmentation identifies a finding, which can be a primary neoplasm, a secondary neoplasm, or one of the reference regions. For each segmentatio} of the neoplasm done using each combination of {User,Segmentation tool,Segmentation session}, there is an SR document containing the measurements extracted from these segmentations.

Each SR document contains one or more measurement groups. Within each SR document, measurements are organized hierarchically into groups, such that measurements derived from the segmentation of a single finding are located together. Some of the information about the measurements, common across all measurements in the group, is defined at the group level.

The conversion script generated separate table SR1500_MeasurementGroups for the measurement groups, where each row corresponds to a single measurement group.

In [3]:

SR1500_MeasurementGroups = pd.read_csv(tablesPath+'/SR1500_MeasurementGroups.tsv', sep='\t', low_memory=False)
SR1500_MeasurementGroups.columns

Out[3]:

Index(['DeviceObserverName', 'FindingSite_CodeMeaning',
       'FindingSite_CodeValue', 'FindingSite_CodingSchemeDesignator',
       'Finding_CodeMeaning', 'Finding_CodeValue',
       'Finding_CodingSchemeDesignator', 'ObserverType', 'PersonObserverName',
       'SOPInstanceUID', 'TrackingIdentifier', 'TrackingUniqueIdentifier',
       'activitySession', 'measurementMethod_CodeMeaning',
       'measurementMethod_CodeValue',
       'measurementMethod_CodingSchemeDesignator', 'timePoint'],
      dtype='object')

... and another table for storing the individual measurements: SR1500_Measurements, one row per measurements.

In [4]:

SR1500_Measurements = pd.read_csv(tablesPath+'/SR1500_Measurements.tsv', sep='\t', low_memory=False)
SR1500_Measurements.columns

Out[4]:

Index(['SOPInstanceUID', 'TrackingUniqueIdentifier',
       'derivationModifier_CodeMeaning', 'derivationModifier_CodeValue',
       'derivationModifier_CodingSchemeDesignator', 'quantity_CodeMeaning',
       'quantity_CodeValue', 'quantity_CodingSchemeDesignator',
       'units_CodeMeaning', 'units_CodeValue', 'units_CodingSchemeDesignator',
       'value'],
      dtype='object')

For our task, it is important to associate the group-level properties of the measurements (e.g., activitySession and PersonObserverName) with the individual measurements. We can accomplish this with a pandas merge operation, utilizing the combination of SOPInstanceUID and TrackingUniqueIdentifier as merge indices.

In [5]:

SR1500_Measurements.shape

Out[5]:

(60531, 12)

In [6]:

Measurements_merged = pd.merge(SR1500_Measurements,SR1500_MeasurementGroups,on=["SOPInstanceUID","TrackingUniqueIdentifier"])
Measurements_merged.shape

Out[6]:

(60531, 27)

In [7]:

Measurements_merged.columns

Out[7]:

Index(['SOPInstanceUID', 'TrackingUniqueIdentifier',
       'derivationModifier_CodeMeaning', 'derivationModifier_CodeValue',
       'derivationModifier_CodingSchemeDesignator', 'quantity_CodeMeaning',
       'quantity_CodeValue', 'quantity_CodingSchemeDesignator',
       'units_CodeMeaning', 'units_CodeValue', 'units_CodingSchemeDesignator',
       'value', 'DeviceObserverName', 'FindingSite_CodeMeaning',
       'FindingSite_CodeValue', 'FindingSite_CodingSchemeDesignator',
       'Finding_CodeMeaning', 'Finding_CodeValue',
       'Finding_CodingSchemeDesignator', 'ObserverType', 'PersonObserverName',
       'TrackingIdentifier', 'activitySession',
       'measurementMethod_CodeMeaning', 'measurementMethod_CodeValue',
       'measurementMethod_CodingSchemeDesignator', 'timePoint'],
      dtype='object')

There are different types of measurements, so let's first see what are they. Each measurement is defined by a combination of code tuples (read more about coding measurement quantities on p.18 of this preprint article). We can look at all combinations of these codes for the comprehensive list of measurements available.

In [8]:

(Measurements_merged["quantity_CodeMeaning"].map(str)+"_"+Measurements_merged["derivationModifier_CodeMeaning"].map(str)).unique()

Out[8]:

array(['SUVbw_Mean', 'SUVbw_Minimum', 'SUVbw_Maximum', 'Volume_nan',
       'SUVbw_Standard Deviation', 'SUVbw_25th Percentile Value',
       'SUVbw_Median', 'SUVbw_75th Percentile Value',
       'SUVbw_Peak Value Within ROI', 'Total Lesion Glycolysis_nan',
       'SUVbw_Upper Adjacent Value', 'SUVbw_RMS',
       'Glycolysis Within First Quarter of Intensity Range_nan',
       'Glycolysis Within Second Quarter of Intensity Range_nan',
       'Glycolysis Within Third Quarter of Intensity Range_nan',
       'Glycolysis Within Fourth Quarter of Intensity Range_nan',
       'Percent Within First Quarter of Intensity Range_nan',
       'Percent Within Second Quarter of Intensity Range_nan',
       'Percent Within Third Quarter of Intensity Range_nan',
       'Percent Within Fourth Quarter of Intensity Range_nan',
       'Standardized Added Metabolic Activity_nan',
       'Standardized Added Metabolic Activity Background_nan'], dtype=object)

Let's start with the basics and look at the variability of the primary lesion volume measurement!

We know that there multiple lesions for many of the subjects. All possible values for the finding are the following, and let's first consider only the primary lesion.

In [9]:

Measurements_merged["Finding_CodeMeaning"].unique()

Out[9]:

array(['Reference Region', 'Neoplasm, Primary', 'Neoplasm, Secondary'], dtype=object)

Adding Composite Context¶

We are almost ready to make the plot, but we are missing information about the patient for the individual measurements! This information is available in the CompositeContext table, which contains attributes related to the patient, study and series, and which should be present in every (valid!) DICOM file. This table contains one row per DICOM instance. Let's load and merge it with the individual measurements!

In [10]:

CompositeContext=pd.read_csv(tablesPath+'/CompositeContext.tsv', sep='\t',low_memory=False)
CompositeContext.columns

Out[10]:

Index(['BodyPartExamined', 'ManufacturerModelName', 'Modality', 'PatientAge',
       'PatientID', 'PatientName', 'PatientSex', 'PatientWeight',
       'SOPClassUID', 'SOPInstanceUID', 'SeriesDate', 'SeriesDescription',
       'SeriesInstanceUID', 'SeriesTime', 'SoftwareVersions', 'StudyDate',
       'StudyDescription', 'StudyInstanceUID', 'StudyTime'],
      dtype='object')

In [11]:

Measurements_merged.shape

Out[11]:

(60531, 27)

In [12]:

Measurements_merged = pd.merge(Measurements_merged, CompositeContext, on="SOPInstanceUID")
Measurements_merged.shape

Out[12]:

(60531, 45)

In [13]:

Measurements_merged.columns

Out[13]:

Index(['SOPInstanceUID', 'TrackingUniqueIdentifier',
       'derivationModifier_CodeMeaning', 'derivationModifier_CodeValue',
       'derivationModifier_CodingSchemeDesignator', 'quantity_CodeMeaning',
       'quantity_CodeValue', 'quantity_CodingSchemeDesignator',
       'units_CodeMeaning', 'units_CodeValue', 'units_CodingSchemeDesignator',
       'value', 'DeviceObserverName', 'FindingSite_CodeMeaning',
       'FindingSite_CodeValue', 'FindingSite_CodingSchemeDesignator',
       'Finding_CodeMeaning', 'Finding_CodeValue',
       'Finding_CodingSchemeDesignator', 'ObserverType', 'PersonObserverName',
       'TrackingIdentifier', 'activitySession',
       'measurementMethod_CodeMeaning', 'measurementMethod_CodeValue',
       'measurementMethod_CodingSchemeDesignator', 'timePoint',
       'BodyPartExamined', 'ManufacturerModelName', 'Modality', 'PatientAge',
       'PatientID', 'PatientName', 'PatientSex', 'PatientWeight',
       'SOPClassUID', 'SeriesDate', 'SeriesDescription', 'SeriesInstanceUID',
       'SeriesTime', 'SoftwareVersions', 'StudyDate', 'StudyDescription',
       'StudyInstanceUID', 'StudyTime'],
      dtype='object')

Now we are finally ready to make the plot that summarizes the variability of segmentation across different users and sessions! Let's recap: we will prepare the plot based on the contents of the Measurements_merged pandas data frame, and will use the following items (all those details of the dataset are explained in the accompanying manuscript that we mentioned in the opening of this notebook):

PatientID associates the measurements with the subjects in the dataset
the actual measurements are stored in the value column
we will consider the measurements of the lesion volume, these correspond to the rows that have the value Volume quantity_CodeMeaning
we will consider the measurements corresponding to the primary lesion, those correspond to the rows that have the value of Neoplasm, Primary in the Finding_CodeMeaning column
the identifier of the user performing the segmentation is in the PersonObserverName column
the identifier of the segmentation session is in the activitySession column

All we have to do now is to subset the data frame, and make the plot!

Summary plot of the primary tumor volume measurements¶

In [14]:

from bokeh.models import ColumnDataSource, OpenURL, TapTool
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
from bokeh.colors import RGB

from bokeh.models import HoverTool, PanTool, WheelZoomTool, BoxZoomTool, ResetTool, TapTool

output_notebook()

volume = []
user = []
method = []
sesssion = []
subject = []

#SR_merged = pd.merge(SR_merged, segReferences)


#subset = SR_merged[SR_merged["PersonObserverName"]=="User1"]
subset = Measurements_merged[Measurements_merged["Finding_CodeMeaning"]=="Neoplasm, Primary"]
subset = subset[subset["quantity_CodeMeaning"]=="Volume"]

print("Identifiers of the users: "+str(subset["PersonObserverName"].unique()))
print("Identifiers of the activity sessions: "+str(subset["activitySession"].unique()))

#subset = subset[subset["activitySession"]==1]
#subset = subset[subset["segmentationToolType"]=="SemiAuto"]

#subset.sort_values("value", inplace=True)

#subset=subset[subset["PatientID"]=="QIN-HEADNECK-01-0003"]

Loading BokehJS ...

Identifiers of the users: ['User1' 'User2' 'User3']
Identifiers of the activity sessions: [1 2]

In [15]:

volumes = subset["value"].values
observers = subset["PersonObserverName"].values
subjects = subset["PatientID"].values

#subset["segmentationToolType"].unique()

colormap = {'User1': 'red', 'User2': 'green', 'User3': 'blue'}
colors = [colormap[x] for x in subset['PersonObserverName'].tolist()]

source = ColumnDataSource(data=dict(
    x=volumes,
    y=subjects,
    color=colors,
    labels = subset["PersonObserverName"].tolist()
    ))

hover = HoverTool(tooltips=[
    ("(Volume, Subject)", "($x, $y)")
])

wZoom = WheelZoomTool()
bZoom = BoxZoomTool()
reset = ResetTool()
pan = PanTool()

p = figure(x_range=[np.min(volumes),np.max(volumes)], y_range=subjects.tolist(), \
           tools = [hover, wZoom, bZoom, reset, pan], \
           title="Variability of primary neoplasm volume by reader")
p.yaxis.axis_label = "PatientID"
p.xaxis.axis_label = subset["quantity_CodeMeaning"].values[0]+', '+subset['units_CodeMeaning'].values[0]

p.circle('x','y',color='color',source=source, legend='labels')

p.legend.location = "bottom_right"

show(p)

Mouse over the dots in the scatter plot to see the PatientID and the volume measurement for the specific segmentation. Check out the tools on the right hand side of the plot.

Note that there are 4, not 2, circles for each reader. The reason is that each lesion was segmented on 2 occasions using both manual and automated segmentation tools.

You can figure the type of tool used and highlight the segmentations produced by the automated tools as a challenge - it is all in DICOM ;-)

Linking individual measurements with the images¶

One of the cool things about DICOM is that it the storage objects are inherently cross-linked with each other. There are attributes that allow us to keep track of the relationships of the objects in the study and series, identification of the objects that belong to the same patient, dates that allow to track evolution of the disease over time.

Derived DICOM objects, such as segmentations and measurements, also have the capability to store pointers to the images that were used to derive those segmentations/measurements. Given the measurements SR document, we can trace the related evidence via the unique identifiers it contains!

For the dataset in hand, information about the references was stored by the dcm2tables conversion script into the References table.

In [16]:

References=pd.read_csv(tablesPath+'/References.tsv', sep='\t', low_memory=False)
References.columns

Out[16]:

Index(['ReferencedSOPClassUID', 'ReferencedSOPInstanceUID', 'SOPInstanceUID',
       'SeriesInstanceUID'],
      dtype='object')

This table allows to identify all DICOM instances (ReferencedSOPInstanceUID) for a given instance (SOPInstanceUID) and its class UID, which uniquely identifies the type of DICOM object. Therefore, we can find all DICOM Segmentations referenced from the DICOM Structured report containing individual measurements, given the DICOM Structured Report SOPInstanceUID.

NOTE: these references are not always mandated! You may not find them in the DICOM objects you encounter "in the wild"! :-(

In [17]:

# 1.2.840.10008.5.1.4.1.1.66.4 is the SOPClassUID corresponding to the DICOM Segmentation image object
segReferences = References[References["ReferencedSOPClassUID"]=='1.2.840.10008.5.1.4.1.1.66.4']
segReferences = segReferences[["SOPInstanceUID","SeriesInstanceUID"]].rename(columns={"SeriesInstanceUID":"ReferencedSeriesInstanceUID"})

In [18]:

# I am not a pandas expert, so just to be safe, I check that the dimensions of the data frame 
# do not change after the merge operation ...
Measurements_merged.shape

Out[18]:

(60531, 45)

In [19]:

Measurements_merged = pd.merge(Measurements_merged, segReferences)
Measurements_merged.shape

Out[19]:

(60531, 46)

Now we have the pointer to the SOPInstanceUID of the segmentation used to calculate the measurement, for each measurement!

To complete the integration, we will use the two more magic ingredients!

Web application based on the open source zero-footprint Cornerstone web viewer developed by the OHIF project: https://pieper.github.io/dcmjs/examples/qiicr/index-dev.html. This application allows to browse the content of the QIN-HEADNECK dataset, and render the PET images with the segmentation overlay. A version of this application takes the SeriesInstanceUID of the DICOM Segmentation, and dereferences it to get the PET series, download everything to your browser, and do the rendering. Kudos to Steve Pieper, Erik Ziegler, OHIF and the Cornerstone team for developing this app!
"Tap tool" provided by Bokeh out of the box that can be configured to redirect clicks on the plot to open a URL: http://bokeh.pydata.org/en/latest/docs/user_guide/interaction/callbacks.html#openurl. I am very impressed by Bokeh!

In [20]:

subset = Measurements_merged[Measurements_merged["Finding_CodeMeaning"]=="Neoplasm, Primary"]
subset = subset[subset["quantity_CodeMeaning"]=="Volume"]

volumes = subset["value"].values
observers = subset["PersonObserverName"].values
subjects = subset["PatientID"].values

colormap = {'User1': 'red', 'User2': 'green', 'User3': 'blue'}
colors = [colormap[x] for x in subset['PersonObserverName'].tolist()]

source = ColumnDataSource(data=dict(
    x=volumes,
    y=subjects,
    color=colors,
    labels = subset["PersonObserverName"].tolist(),
    seriesUID=subset["ReferencedSeriesInstanceUID"]
    ))

hover = HoverTool(tooltips=[
    ("(Volume, Subject)", "($x, $y)")
])

wZoom = WheelZoomTool()
bZoom = BoxZoomTool()
reset = ResetTool()
tap = TapTool()
pan = PanTool()

p = figure(x_range=[np.min(volumes),np.max(volumes)], \
           y_range=subjects.tolist(), \
           tools = [hover, wZoom, bZoom, reset, tap, pan],
           title="Variability of primary neoplasm volume by reader")

p.circle('x','y',color='color',source=source, legend='labels')

url = "http://pieper.github.com/dcmjs/examples/qiicr/?seriesUID=@seriesUID"
taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)

p.xaxis.axis_label = subset["quantity_CodeMeaning"].values[0]+', '+subset['units_CodeMeaning'].values[0]
p.legend.location = "bottom_right"

show(p)

As before you can mouse over the individual points of the plot, but you can also click on any of the dots, which will open a new browser window, and will show the overlay of the segmented structures over the PET image for the specific segmentation you clicked!

You can check how the segmentation you did in the first part of the tutorial for subject QIN-HEADNECK-01-0024 agrees with those done by the domain experts!