The following examples illustrate some CDA Python
features while providing insights into the underlying data structure (Getting started). To demonstrate those features, we provide a few relevant text queries along with step-by-step explanations on how to translate those into the CDA Python
queries (Example queries). Finally, there are a few additional queries intended for the test users to play around with and send feedback to the CDA team (Test queries).
from cdapython import Q, columns, unique_terms,query
import cdapython
print(cdapython.__file__)
print(cdapython.__version__)
Print out the list of available fields with columns()
:
columns(verify=False)
/Users/dboles/Documents/python/working/cda-python/venv/lib/python3.7/site-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cda.cda-dev.broadinstitute.org'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings InsecureRequestWarning,
['id', 'identifier', 'identifier.system', 'identifier.value', 'sex', 'race', 'ethnicity', 'days_to_birth', 'subject_associated_project', 'File', 'File.id', 'File.identifier', 'File.identifier.system', 'File.identifier.value', 'File.label', 'File.data_category', 'File.data_type', 'File.file_format', 'File.associated_project', 'File.drs_uri', 'File.byte_size', 'File.checksum', 'ResearchSubject', 'ResearchSubject.id', 'ResearchSubject.identifier', 'ResearchSubject.identifier.system', 'ResearchSubject.identifier.value', 'ResearchSubject.associated_project', 'ResearchSubject.primary_disease_type', 'ResearchSubject.primary_disease_site', 'ResearchSubject.Diagnosis', 'ResearchSubject.Diagnosis.id', 'ResearchSubject.Diagnosis.identifier', 'ResearchSubject.Diagnosis.identifier.system', 'ResearchSubject.Diagnosis.identifier.value', 'ResearchSubject.Diagnosis.primary_diagnosis', 'ResearchSubject.Diagnosis.age_at_diagnosis', 'ResearchSubject.Diagnosis.morphology', 'ResearchSubject.Diagnosis.stage', 'ResearchSubject.Diagnosis.grade', 'ResearchSubject.Diagnosis.Treatment', 'ResearchSubject.Diagnosis.Treatment.id', 'ResearchSubject.Diagnosis.Treatment.identifier', 'ResearchSubject.Diagnosis.Treatment.identifier.system', 'ResearchSubject.Diagnosis.Treatment.identifier.value', 'ResearchSubject.Diagnosis.Treatment.treatment_type', 'ResearchSubject.Diagnosis.Treatment.treatment_outcome', 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start', 'ResearchSubject.Diagnosis.Treatment.days_treatment_end', 'ResearchSubject.File', 'ResearchSubject.File.id', 'ResearchSubject.File.identifier', 'ResearchSubject.File.identifier.system', 'ResearchSubject.File.identifier.value', 'ResearchSubject.File.label', 'ResearchSubject.File.data_category', 'ResearchSubject.File.data_type', 'ResearchSubject.File.file_format', 'ResearchSubject.File.associated_project', 'ResearchSubject.File.drs_uri', 'ResearchSubject.File.byte_size', 'ResearchSubject.File.checksum', 'ResearchSubject.Specimen', 'ResearchSubject.Specimen.id', 'ResearchSubject.Specimen.identifier', 'ResearchSubject.Specimen.identifier.system', 'ResearchSubject.Specimen.identifier.value', 'ResearchSubject.Specimen.associated_project', 'ResearchSubject.Specimen.age_at_collection', 'ResearchSubject.Specimen.primary_disease_type', 'ResearchSubject.Specimen.anatomical_site', 'ResearchSubject.Specimen.source_material_type', 'ResearchSubject.Specimen.specimen_type', 'ResearchSubject.Specimen.derived_from_specimen', 'ResearchSubject.Specimen.derived_from_subject', 'ResearchSubject.Specimen.File', 'ResearchSubject.Specimen.File.id', 'ResearchSubject.Specimen.File.identifier', 'ResearchSubject.Specimen.File.identifier.system', 'ResearchSubject.Specimen.File.identifier.value', 'ResearchSubject.Specimen.File.label', 'ResearchSubject.Specimen.File.data_category', 'ResearchSubject.Specimen.File.data_type', 'ResearchSubject.Specimen.File.file_format', 'ResearchSubject.Specimen.File.associated_project', 'ResearchSubject.Specimen.File.drs_uri', 'ResearchSubject.Specimen.File.byte_size', 'ResearchSubject.Specimen.File.checksum']
All of the above fields are what describes the highest entity in the data structure hierarchy – Patient
entity. The first five fields represent Patient
demographic information, while the ResearchSubject
entity contains details that we are used to seeing within the nodes' Case
record.
One of the contributions of the CDA is aggregated ResearchSubject
information. This means that all ResearchSubject
records coming from the same subject are now gathered under the Patient entity. As we know, certain specimens are studied in multiple projects (being part of a single data node or multiple nodes) as different ResearchSubject
entries. Those ResearchSubject
entries are collected as a list under the ResearchSubject
entity. One example of this is the patient record with id = TCGA-E2-A10A
which contains two ResearchSubject
entries, one from GDC and the other from PDC.
Note that the ResearchSubject
entity is a list of records, as many other entities above are. There are certain considerations that should be made when creating the queries by using the fields that come from lists, but more about that will follow in examples below.
The names in the list may look familiar to you, but they may have been renamed or restructured in the CDA. The field name mappings are described in the CDA Schema Field Mapping document that is linked in the Testing Guide. A more direct way to explore and understand the fields is to use the unique_terms()
function:
unique_terms("ResearchSubject.Specimen.source_material_type",limit=10)
Additionally, you can specify a particular data node by using the system
argument:
unique_terms("ResearchSubject.Specimen.source_material_type", system="PDC")
Now, let's dive into the querying!
We can start by getting the record for id = TCGA-E2-A10A
that we mentioned earlier:
q = Q('id = "TCGA-E2-A10A"') # note the double quotes for the string value
r = q.run()
We see that we've got a single patient record as a result, which is what we expect.
Let's see how the result looks like:
r[0]
The record is pretty large, so we'll print out identifier
values for each ResearchSubject
to confirm that we have one ResearchSubject
that comes from GDC, and one that comes from PDC:
for research_subject in r[0]['ResearchSubject']:
print(research_subject['identifier'])
The values represent ResearchSubject
IDs and are equivalent to case_id
values in data nodes.
Now that we can create a query with Q()
function, let's see how we can combine multiple conditions.
There are three operators available:
And()
Or()
From()
The following examples show how those operators work in practice.
Find data for subjects who were diagnosed after the age of 50 and who were investigated as part of the TCGA-OV project.
q1 = Q('ResearchSubject.Diagnosis.age_at_diagnosis > 50*365')
q2 = Q('ResearchSubject.associated_project = "TCGA-OV"')
q = q1.AND(q2)
r = q.run()
print(r)
Find data for donors with melanoma (Nevi and Melanomas) diagnosis and who were diagnosed before the age of 30.
q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')
q = q1.AND(q2)
r = q.run()
print(r)
In addition, we can check how many records come from particular systems by adding one more condition to the query:
q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')
q3 = Q('ResearchSubject.Specimen.identifier.system = "GDC"')
q = q1.AND(q2.AND(q3))
r = q.run()
print(r)
By comparing the Count
value of the two results we can see that all the patients returned in the initial query are coming from the GDC.
To explore the results further, we can fetch the patient JSON objects by iterating through the results:
projects = set()
for patient in r:
research_subjects = patient['ResearchSubject']
for rs in research_subjects:
projects.add(rs['associated_project'])
print(projects)
The output shows the projects where Nevi and Melanomas cases appear.
Identify all samples that meet the following conditions:
tumor_type = Q('ResearchSubject.Specimen.source_material_type = "Primary Tumor"')
disease1 = Q('ResearchSubject.primary_disease_site = "Ovary"')
disease2 = Q('ResearchSubject.primary_disease_site = "Breast"')
demographics1 = Q('sex = "female"')
demographics2 = Q('days_to_birth > -60*365') # note that days_to_birth is a negative value
q1 = tumor_type.AND(demographics1.AND(demographics2))
q2 = disease1.OR(disease2)
q = q1.AND(q2)
r = q.run()
print(r)
In this case, we have a result that contains more than 1000 records which is the default page size. To load the next 1000 records, we can use the next_page()
method:
r2 = r.next_page()
print(r2)
Alternatively, we can use the offset
argument to specify the record to start from:
...
r = q.run(offset=1000)
print(r)
Find data for donors with "Ovarian Serous Cystadenocarcinoma" with proteomic and genomic data.
Note that disease type value denoting the same disease groups can be completely different within different systems. This is where CDA features come into play. We first start by exploring the values available for this particular field in both systems.
unique_terms('ResearchSubject.primary_disease_type', system="GDC",limit=10,verify=False)
/Users/dboles/Documents/python/working/cda-python/cdapython/decorators_cache.py:18: ResourceWarning: unclosed <ssl.SSLSocket fd=75, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.33.24', 63143), raddr=('35.244.236.227', 443)> func.cache_clear() ResourceWarning: Enable tracemalloc to get the object allocation traceback
[None, 'Acinar Cell Neoplasms', 'Adenomas and Adenocarcinomas', 'Adnexal and Skin Appendage Neoplasms', 'Basal Cell Neoplasms', 'Blood Vessel Tumors', 'Chronic Myeloproliferative Disorders', 'Complex Epithelial Neoplasms', 'Complex Mixed and Stromal Neoplasms', 'Cystic, Mucinous and Serous Neoplasms']
Since “Ovarian Serous Cystadenocarcinoma” doesn’t appear in GDC values we decide to look into the PDC:
unique_terms('ResearchSubject.primary_disease_type', system="PDC")
After examining the output, we see that it does come from the PDC. Hence, if we could first identify the data that has research subjects found within the PDC that have this particular disease type, and then further narrow down the results to include only the portion of the data that is present in GDC, we could get the records that we are looking for.
q1 = Q('ResearchSubject.primary_disease_type = "Ovarian Serous Cystadenocarcinoma"')
q2 = Q('ResearchSubject.identifier.system = "PDC"')
q3 = Q('ResearchSubject.identifier.system = "GDC"')
q = q3.FROM(q1.AND(q2))
r = q.run(verify=False)
print(r)
Getting results from database Total execution time: 27307 ms QueryID: bd084bbd-33bd-4339-b034-b620192922b1 Query: SELECT all_v2.* FROM (SELECT all_v2.* FROM gdc-bq-sample.integration.all_v2 AS all_v2, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE ((_ResearchSubject.primary_disease_type = 'Ovarian Serous Cystadenocarcinoma') AND (_identifier.system = 'PDC'))) AS all_v2, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE (_identifier.system = 'GDC') Offset: 0 Count: 100 Total Row Count: 275 More pages: True
/Users/dboles/Documents/python/working/cda-python/venv/lib/python3.7/site-packages/ipykernel_launcher.py:6: ResourceWarning: unclosed <ssl.SSLSocket fd=76, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.33.24', 63050), raddr=('35.244.236.227', 443)> ResourceWarning: Enable tracemalloc to get the object allocation traceback
As you can see, this is achieved by utilizing From
operator. The From
operator allows us to create queries from results of other queries. This is particularly useful when working with conditions that involve a single field which can take multiple different values for different items in a list that is being part of, e.g. we need ResearchSubject.identifier.system
to be both “PDC” and “GDC” for a single patient. In such cases, And
operator can’t help because it will return those entries where the field takes both values, which is zero entries.
for i in Q.sql("SELECT * FROM `gdc-bq-sample.cda_mvp.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS` WHERE table_name = 'v3' Limit 5",verify=False):
print(i)
{'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'days_to_birth', 'field_path': 'days_to_birth', 'data_type': 'INT64', 'description': None} {'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'race', 'field_path': 'race', 'data_type': 'STRING', 'description': None} {'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'sex', 'field_path': 'sex', 'data_type': 'STRING', 'description': None} {'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'ethnicity', 'field_path': 'ethnicity', 'data_type': 'STRING', 'description': None} {'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'id', 'field_path': 'id', 'data_type': 'STRING', 'description': None}
q1 = query('ResearchSubject.identifier.system = "GDC" FROM ResearchSubject.primary_disease_type = "Ovarian Serous Cystadenocarcinoma" AND ResearchSubject.identifier.system = "PDC"')
result = q1.run(async_call=True)
print(result)
# If you are interested in the extraction dates or data release versions of GDC, PDC, or IDC that is in a table or view, execute this code
for i in Q.sql("SELECT option_value FROM `gdc-bq-sample.integration.INFORMATION_SCHEMA.TABLE_OPTIONS` WHERE table_name = 'all_v1'"):
print(i)
Now that we've successfully run and analyzed a few queries, here are a few additional ones you can try out on your own.
Solutions can be shared with the CDA team as indicated in the Testing Guide document.
Find data from TCGA-BRCA project, with donors over the age of 50 with Stage IIIC cancer.
# Solution
# ...
# print(r)
Find data from all patients who have been treated with "Radiation Therapy, NOS" and have both genomic and proteomic data.
# Solution
# ...
# print(r)
Find data from all subjects with lung adenocarcinomas that have both primary and recurrent tumors.
# Solution
# ...
# print(r)
Q('ResearchSubject.id = "c5421e34-e5c7-4ba5-aed9-146a5575fd8d"').run().pretty_print(-1)