This is a Jupyter Notebook that demonstrates how Python can be used to explore the content of a publicly available DICOM dataset stored on The Cancer Imaging Archive (TCIA) and described here: https://wiki.cancerimagingarchive.net/display/Public/QIN-HEADNECK.
This notebook was created as part of the preparations to the DICOM4MICCAI tutorial at the MICCAI 2017 conference on Sept 10, 2017.
The tutorial was organized by the Quantitative Image Informatics for Cancer Research (QIICR) project funded by the Informatics Technology for Cancer Research (ITCR) program of the National Cancer Institute, award U24 CA180918.
More pointers related to the material covered in this notebook:
Questions, comments, suggestions, corrections are welcomed!
Please email andrey.fedorov@gmail.com
, or join the discussion on gitter!
The goal of this tutorial is to demonstrate how Python can be used to work with the data produced by quantitative image analysis and stored using the DICOM format.
You don't need to know much about DICOM to follow along, but you will need to learn more if you want to use DICOM in your work. You will find pointers in the Further reading section.
The dataset used in this tutorial is discussed in detail in this publication:
Fedorov A., Clunie D., Ulrich E., Bauer C., Wahle A., Brown B., Onken M., Riesmeier J., Pieper S., Kikinis R., Buatti J., Beichel RR. DICOM for quantitative imaging biomarker development: a standards based approach to sharing clinical data and structured PET/CT analysis results in head and neck cancer research. PeerJ 4:e2057, 2016. DOI: 10.7717/peerj.2057
Here is a bird's eye view of the QIN-HEADNECK dataset:
The DICOM dataset was converted into a collection of tables using this converter script: https://github.com/QIICR/dcm2tables. The script extracts data elements from the DICOM files and stores them as a collection of tab-delimited text files that follow this schema.
You can download the collection of the extracted tables here: https://github.com/fedorov/dicom4miccai-handson/releases/download/miccai2017/QIN-HEADNECK-Tables.tgz. Uncompress the file, note the location of the resulting directory, and set the value of the variable below to that location.
tablesPath = '/home/jovyan/data/QIN-HEADNECK-Tables'
# set this to your location of the tables if running locally
#tablesPath = '/Users/fedorov/github/dcm2tables/Tables'
We will discuss the contents of the relevant specific tables generated by this script further in this notebook in the context.
In this demonstration we will use the following Python packages:
NOTE: there appears to be an issue using the (as of writing) latest 0.12.7 version of bokeh for some of the plotting operations in this notebook. If you are using a local installation of bokeh, you will need to make sure you are using bokeh 0.12.6!
If you are working with this notebook on your own system, you will need to install those packages as a prerequisite to import the packages!
Run the cell below to confirm that all prerequisite packages are installed properly.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.models import ColumnDataSource, OpenURL, TapTool
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
from bokeh.colors import RGB
output_notebook()
In this section we will explore the segmentation-derived measurements originally stored in DICOM SR documents. Specifically, we will create an interactive plot that summarizes the variability of segmentation across different users, sessions and segmentation tools involved.
First, we will load various tables that will be needed for this task. You can always check out the schema of the tables extracted from the DICOM dataset at this link.
The flowchart below summarizes the sequence of steps and the tools involved in the processing throughout this tutorial.
In this specific dataset, all SR documents correspond to the segmentation-based measurement reports. To be more specific, all of these SR documents follow the same Structured Reporting template TID 1500. The overall relationship between segmentations and the content of the SR documents is illustrated in the figure below.
Each segmentation identifies a finding, which can be a primary neoplasm, a secondary neoplasm, or one of the reference regions. For each segmentatio} of the neoplasm done using each combination of {User,Segmentation tool,Segmentation session}
, there is an SR document containing the measurements extracted from these segmentations.
Each SR document contains one or more measurement groups. Within each SR document, measurements are organized hierarchically into groups, such that measurements derived from the segmentation of a single finding are located together. Some of the information about the measurements, common across all measurements in the group, is defined at the group level.
The conversion script generated separate table SR1500_MeasurementGroups
for the measurement groups, where each row corresponds to a single measurement group.
SR1500_MeasurementGroups = pd.read_csv(tablesPath+'/SR1500_MeasurementGroups.tsv', sep='\t', low_memory=False)
SR1500_MeasurementGroups.columns
Index(['DeviceObserverName', 'FindingSite_CodeMeaning', 'FindingSite_CodeValue', 'FindingSite_CodingSchemeDesignator', 'Finding_CodeMeaning', 'Finding_CodeValue', 'Finding_CodingSchemeDesignator', 'ObserverType', 'PersonObserverName', 'SOPInstanceUID', 'TrackingIdentifier', 'TrackingUniqueIdentifier', 'activitySession', 'measurementMethod_CodeMeaning', 'measurementMethod_CodeValue', 'measurementMethod_CodingSchemeDesignator', 'timePoint'], dtype='object')
... and another table for storing the individual measurements: SR1500_Measurements
, one row per measurements.
SR1500_Measurements = pd.read_csv(tablesPath+'/SR1500_Measurements.tsv', sep='\t', low_memory=False)
SR1500_Measurements.columns
Index(['SOPInstanceUID', 'TrackingUniqueIdentifier', 'derivationModifier_CodeMeaning', 'derivationModifier_CodeValue', 'derivationModifier_CodingSchemeDesignator', 'quantity_CodeMeaning', 'quantity_CodeValue', 'quantity_CodingSchemeDesignator', 'units_CodeMeaning', 'units_CodeValue', 'units_CodingSchemeDesignator', 'value'], dtype='object')
For our task, it is important to associate the group-level properties of the measurements (e.g., activitySession
and PersonObserverName
) with the individual measurements. We can accomplish this with a pandas merge
operation, utilizing the combination of SOPInstanceUID
and TrackingUniqueIdentifier
as merge indices.
SR1500_Measurements.shape
(60531, 12)
Measurements_merged = pd.merge(SR1500_Measurements,SR1500_MeasurementGroups,on=["SOPInstanceUID","TrackingUniqueIdentifier"])
Measurements_merged.shape
(60531, 27)
Measurements_merged.columns
Index(['SOPInstanceUID', 'TrackingUniqueIdentifier', 'derivationModifier_CodeMeaning', 'derivationModifier_CodeValue', 'derivationModifier_CodingSchemeDesignator', 'quantity_CodeMeaning', 'quantity_CodeValue', 'quantity_CodingSchemeDesignator', 'units_CodeMeaning', 'units_CodeValue', 'units_CodingSchemeDesignator', 'value', 'DeviceObserverName', 'FindingSite_CodeMeaning', 'FindingSite_CodeValue', 'FindingSite_CodingSchemeDesignator', 'Finding_CodeMeaning', 'Finding_CodeValue', 'Finding_CodingSchemeDesignator', 'ObserverType', 'PersonObserverName', 'TrackingIdentifier', 'activitySession', 'measurementMethod_CodeMeaning', 'measurementMethod_CodeValue', 'measurementMethod_CodingSchemeDesignator', 'timePoint'], dtype='object')
There are different types of measurements, so let's first see what are they. Each measurement is defined by a combination of code tuples (read more about coding measurement quantities on p.18 of this preprint article). We can look at all combinations of these codes for the comprehensive list of measurements available.
(Measurements_merged["quantity_CodeMeaning"].map(str)+"_"+Measurements_merged["derivationModifier_CodeMeaning"].map(str)).unique()
array(['SUVbw_Mean', 'SUVbw_Minimum', 'SUVbw_Maximum', 'Volume_nan', 'SUVbw_Standard Deviation', 'SUVbw_25th Percentile Value', 'SUVbw_Median', 'SUVbw_75th Percentile Value', 'SUVbw_Peak Value Within ROI', 'Total Lesion Glycolysis_nan', 'SUVbw_Upper Adjacent Value', 'SUVbw_RMS', 'Glycolysis Within First Quarter of Intensity Range_nan', 'Glycolysis Within Second Quarter of Intensity Range_nan', 'Glycolysis Within Third Quarter of Intensity Range_nan', 'Glycolysis Within Fourth Quarter of Intensity Range_nan', 'Percent Within First Quarter of Intensity Range_nan', 'Percent Within Second Quarter of Intensity Range_nan', 'Percent Within Third Quarter of Intensity Range_nan', 'Percent Within Fourth Quarter of Intensity Range_nan', 'Standardized Added Metabolic Activity_nan', 'Standardized Added Metabolic Activity Background_nan'], dtype=object)
Let's start with the basics and look at the variability of the primary lesion volume measurement!
We know that there multiple lesions for many of the subjects. All possible values for the finding are the following, and let's first consider only the primary lesion.
Measurements_merged["Finding_CodeMeaning"].unique()
array(['Reference Region', 'Neoplasm, Primary', 'Neoplasm, Secondary'], dtype=object)
We are almost ready to make the plot, but we are missing information about the patient for the individual measurements! This information is available in the CompositeContext
table, which contains attributes related to the patient, study and series, and which should be present in every (valid!) DICOM file. This table contains one row per DICOM instance. Let's load and merge it with the individual measurements!
CompositeContext=pd.read_csv(tablesPath+'/CompositeContext.tsv', sep='\t',low_memory=False)
CompositeContext.columns
Index(['BodyPartExamined', 'ManufacturerModelName', 'Modality', 'PatientAge', 'PatientID', 'PatientName', 'PatientSex', 'PatientWeight', 'SOPClassUID', 'SOPInstanceUID', 'SeriesDate', 'SeriesDescription', 'SeriesInstanceUID', 'SeriesTime', 'SoftwareVersions', 'StudyDate', 'StudyDescription', 'StudyInstanceUID', 'StudyTime'], dtype='object')
Measurements_merged.shape
(60531, 27)
Measurements_merged = pd.merge(Measurements_merged, CompositeContext, on="SOPInstanceUID")
Measurements_merged.shape
(60531, 45)
Measurements_merged.columns
Index(['SOPInstanceUID', 'TrackingUniqueIdentifier', 'derivationModifier_CodeMeaning', 'derivationModifier_CodeValue', 'derivationModifier_CodingSchemeDesignator', 'quantity_CodeMeaning', 'quantity_CodeValue', 'quantity_CodingSchemeDesignator', 'units_CodeMeaning', 'units_CodeValue', 'units_CodingSchemeDesignator', 'value', 'DeviceObserverName', 'FindingSite_CodeMeaning', 'FindingSite_CodeValue', 'FindingSite_CodingSchemeDesignator', 'Finding_CodeMeaning', 'Finding_CodeValue', 'Finding_CodingSchemeDesignator', 'ObserverType', 'PersonObserverName', 'TrackingIdentifier', 'activitySession', 'measurementMethod_CodeMeaning', 'measurementMethod_CodeValue', 'measurementMethod_CodingSchemeDesignator', 'timePoint', 'BodyPartExamined', 'ManufacturerModelName', 'Modality', 'PatientAge', 'PatientID', 'PatientName', 'PatientSex', 'PatientWeight', 'SOPClassUID', 'SeriesDate', 'SeriesDescription', 'SeriesInstanceUID', 'SeriesTime', 'SoftwareVersions', 'StudyDate', 'StudyDescription', 'StudyInstanceUID', 'StudyTime'], dtype='object')
Now we are finally ready to make the plot that summarizes the variability of segmentation across different users and sessions! Let's recap: we will prepare the plot based on the contents of the Measurements_merged
pandas data frame, and will use the following items (all those details of the dataset are explained in the accompanying manuscript that we mentioned in the opening of this notebook):
PatientID
associates the measurements with the subjects in the datasetvalue
columnVolume
quantity_CodeMeaning
Neoplasm, Primary
in the Finding_CodeMeaning
columnPersonObserverName
columnactivitySession
columnAll we have to do now is to subset the data frame, and make the plot!
from bokeh.models import ColumnDataSource, OpenURL, TapTool
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
from bokeh.colors import RGB
from bokeh.models import HoverTool, PanTool, WheelZoomTool, BoxZoomTool, ResetTool, TapTool
output_notebook()
volume = []
user = []
method = []
sesssion = []
subject = []
#SR_merged = pd.merge(SR_merged, segReferences)
#subset = SR_merged[SR_merged["PersonObserverName"]=="User1"]
subset = Measurements_merged[Measurements_merged["Finding_CodeMeaning"]=="Neoplasm, Primary"]
subset = subset[subset["quantity_CodeMeaning"]=="Volume"]
print("Identifiers of the users: "+str(subset["PersonObserverName"].unique()))
print("Identifiers of the activity sessions: "+str(subset["activitySession"].unique()))
#subset = subset[subset["activitySession"]==1]
#subset = subset[subset["segmentationToolType"]=="SemiAuto"]
#subset.sort_values("value", inplace=True)
#subset=subset[subset["PatientID"]=="QIN-HEADNECK-01-0003"]
volumes = subset["value"].values
observers = subset["PersonObserverName"].values
subjects = subset["PatientID"].values
#subset["segmentationToolType"].unique()
colormap = {'User1': 'red', 'User2': 'green', 'User3': 'blue'}
colors = [colormap[x] for x in subset['PersonObserverName'].tolist()]
source = ColumnDataSource(data=dict(
x=volumes,
y=subjects,
color=colors,
labels = subset["PersonObserverName"].tolist()
))
hover = HoverTool(tooltips=[
("(Volume, Subject)", "($x, $y)")
])
wZoom = WheelZoomTool()
bZoom = BoxZoomTool()
reset = ResetTool()
pan = PanTool()
p = figure(x_range=[np.min(volumes),np.max(volumes)], y_range=subjects.tolist(), \
tools = [hover, wZoom, bZoom, reset, pan], \
title="Variability of primary neoplasm volume by reader")
p.yaxis.axis_label = "PatientID"
p.xaxis.axis_label = subset["quantity_CodeMeaning"].values[0]+', '+subset['units_CodeMeaning'].values[0]
p.circle('x','y',color='color',source=source, legend='labels')
p.legend.location = "bottom_right"
show(p)
Mouse over the dots in the scatter plot to see the PatientID and the volume measurement for the specific segmentation. Check out the tools on the right hand side of the plot.
Note that there are 4, not 2, circles for each reader. The reason is that each lesion was segmented on 2 occasions using both manual and automated segmentation tools.
You can figure the type of tool used and highlight the segmentations produced by the automated tools as a challenge - it is all in DICOM ;-)
One of the cool things about DICOM is that it the storage objects are inherently cross-linked with each other. There are attributes that allow us to keep track of the relationships of the objects in the study and series, identification of the objects that belong to the same patient, dates that allow to track evolution of the disease over time.
Derived DICOM objects, such as segmentations and measurements, also have the capability to store pointers to the images that were used to derive those segmentations/measurements. Given the measurements SR document, we can trace the related evidence via the unique identifiers it contains!
For the dataset in hand, information about the references was stored by the dcm2tables
conversion script into the References
table.
References=pd.read_csv(tablesPath+'/References.tsv', sep='\t', low_memory=False)
References.columns
Index(['ReferencedSOPClassUID', 'ReferencedSOPInstanceUID', 'SOPInstanceUID', 'SeriesInstanceUID'], dtype='object')
This table allows to identify all DICOM instances (ReferencedSOPInstanceUID
) for a given instance (SOPInstanceUID
) and its class UID, which uniquely identifies the type of DICOM object. Therefore, we can find all DICOM Segmentations referenced from the DICOM Structured report containing individual measurements, given the DICOM Structured Report SOPInstanceUID
.
NOTE: these references are not always mandated! You may not find them in the DICOM objects you encounter "in the wild"! :-(
# 1.2.840.10008.5.1.4.1.1.66.4 is the SOPClassUID corresponding to the DICOM Segmentation image object
segReferences = References[References["ReferencedSOPClassUID"]=='1.2.840.10008.5.1.4.1.1.66.4']
segReferences = segReferences[["SOPInstanceUID","SeriesInstanceUID"]].rename(columns={"SeriesInstanceUID":"ReferencedSeriesInstanceUID"})
# I am not a pandas expert, so just to be safe, I check that the dimensions of the data frame
# do not change after the merge operation ...
Measurements_merged.shape
(60531, 45)
Measurements_merged = pd.merge(Measurements_merged, segReferences)
Measurements_merged.shape
(60531, 46)
Now we have the pointer to the SOPInstanceUID
of the segmentation used to calculate the measurement, for each measurement!
To complete the integration, we will use the two more magic ingredients!
Web application based on the open source zero-footprint Cornerstone web viewer developed by the OHIF project: https://pieper.github.io/dcmjs/examples/qiicr/index-dev.html. This application allows to browse the content of the QIN-HEADNECK dataset, and render the PET images with the segmentation overlay. A version of this application takes the SeriesInstanceUID
of the DICOM Segmentation, and dereferences it to get the PET series, download everything to your browser, and do the rendering. Kudos to Steve Pieper, Erik Ziegler, OHIF and the Cornerstone team for developing this app!
"Tap tool" provided by Bokeh out of the box that can be configured to redirect clicks on the plot to open a URL: http://bokeh.pydata.org/en/latest/docs/user_guide/interaction/callbacks.html#openurl. I am very impressed by Bokeh!
subset = Measurements_merged[Measurements_merged["Finding_CodeMeaning"]=="Neoplasm, Primary"]
subset = subset[subset["quantity_CodeMeaning"]=="Volume"]
volumes = subset["value"].values
observers = subset["PersonObserverName"].values
subjects = subset["PatientID"].values
colormap = {'User1': 'red', 'User2': 'green', 'User3': 'blue'}
colors = [colormap[x] for x in subset['PersonObserverName'].tolist()]
source = ColumnDataSource(data=dict(
x=volumes,
y=subjects,
color=colors,
labels = subset["PersonObserverName"].tolist(),
seriesUID=subset["ReferencedSeriesInstanceUID"]
))
hover = HoverTool(tooltips=[
("(Volume, Subject)", "($x, $y)")
])
wZoom = WheelZoomTool()
bZoom = BoxZoomTool()
reset = ResetTool()
tap = TapTool()
pan = PanTool()
p = figure(x_range=[np.min(volumes),np.max(volumes)], \
y_range=subjects.tolist(), \
tools = [hover, wZoom, bZoom, reset, tap, pan],
title="Variability of primary neoplasm volume by reader")
p.circle('x','y',color='color',source=source, legend='labels')
url = "http://pieper.github.com/dcmjs/examples/qiicr/?seriesUID=@seriesUID"
taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)
p.xaxis.axis_label = subset["quantity_CodeMeaning"].values[0]+', '+subset['units_CodeMeaning'].values[0]
p.legend.location = "bottom_right"
show(p)
As before you can mouse over the individual points of the plot, but you can also click on any of the dots, which will open a new browser window, and will show the overlay of the segmented structures over the PET image for the specific segmentation you clicked!
You can check how the segmentation you did in the first part of the tutorial for subject QIN-HEADNECK-01-0024 agrees with those done by the domain experts!