Notebook

CDA Python: Features & Examples¶

The following examples illustrate some CDA Python features while providing insights into the underlying data structure (Getting started). To demonstrate those features, we provide a few relevant text queries along with step-by-step explanations on how to translate those into the CDA Python queries (Example queries). Finally, there are a few additional queries intended for the test users to play around with and send feedback to the CDA team (Test queries).

In [ ]:

from cdapython import Q, columns, unique_terms,query
import cdapython
print(cdapython.__file__)
print(cdapython.__version__)

Getting started¶

Print out the list of available fields with columns():

In [2]:

columns(verify=False)

/Users/dboles/Documents/python/working/cda-python/venv/lib/python3.7/site-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cda.cda-dev.broadinstitute.org'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,

Out[2]:

['id',
 'identifier',
 'identifier.system',
 'identifier.value',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'File',
 'File.id',
 'File.identifier',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size',
 'File.checksum',
 'ResearchSubject',
 'ResearchSubject.id',
 'ResearchSubject.identifier',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.associated_project',
 'ResearchSubject.primary_disease_type',
 'ResearchSubject.primary_disease_site',
 'ResearchSubject.Diagnosis',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.Treatment',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_treatment_end',
 'ResearchSubject.File',
 'ResearchSubject.File.id',
 'ResearchSubject.File.identifier',
 'ResearchSubject.File.identifier.system',
 'ResearchSubject.File.identifier.value',
 'ResearchSubject.File.label',
 'ResearchSubject.File.data_category',
 'ResearchSubject.File.data_type',
 'ResearchSubject.File.file_format',
 'ResearchSubject.File.associated_project',
 'ResearchSubject.File.drs_uri',
 'ResearchSubject.File.byte_size',
 'ResearchSubject.File.checksum',
 'ResearchSubject.Specimen',
 'ResearchSubject.Specimen.id',
 'ResearchSubject.Specimen.identifier',
 'ResearchSubject.Specimen.identifier.system',
 'ResearchSubject.Specimen.identifier.value',
 'ResearchSubject.Specimen.associated_project',
 'ResearchSubject.Specimen.age_at_collection',
 'ResearchSubject.Specimen.primary_disease_type',
 'ResearchSubject.Specimen.anatomical_site',
 'ResearchSubject.Specimen.source_material_type',
 'ResearchSubject.Specimen.specimen_type',
 'ResearchSubject.Specimen.derived_from_specimen',
 'ResearchSubject.Specimen.derived_from_subject',
 'ResearchSubject.Specimen.File',
 'ResearchSubject.Specimen.File.id',
 'ResearchSubject.Specimen.File.identifier',
 'ResearchSubject.Specimen.File.identifier.system',
 'ResearchSubject.Specimen.File.identifier.value',
 'ResearchSubject.Specimen.File.label',
 'ResearchSubject.Specimen.File.data_category',
 'ResearchSubject.Specimen.File.data_type',
 'ResearchSubject.Specimen.File.file_format',
 'ResearchSubject.Specimen.File.associated_project',
 'ResearchSubject.Specimen.File.drs_uri',
 'ResearchSubject.Specimen.File.byte_size',
 'ResearchSubject.Specimen.File.checksum']

All of the above fields are what describes the highest entity in the data structure hierarchy – Patient entity. The first five fields represent Patient demographic information, while the ResearchSubject entity contains details that we are used to seeing within the nodes' Case record.

One of the contributions of the CDA is aggregated ResearchSubject information. This means that all ResearchSubject records coming from the same subject are now gathered under the Patient entity. As we know, certain specimens are studied in multiple projects (being part of a single data node or multiple nodes) as different ResearchSubject entries. Those ResearchSubject entries are collected as a list under the ResearchSubject entity. One example of this is the patient record with id = TCGA-E2-A10A which contains two ResearchSubject entries, one from GDC and the other from PDC.

Note that the ResearchSubject entity is a list of records, as many other entities above are. There are certain considerations that should be made when creating the queries by using the fields that come from lists, but more about that will follow in examples below.

The names in the list may look familiar to you, but they may have been renamed or restructured in the CDA. The field name mappings are described in the CDA Schema Field Mapping document that is linked in the Testing Guide. A more direct way to explore and understand the fields is to use the unique_terms() function:

In [ ]:

unique_terms("ResearchSubject.Specimen.source_material_type",limit=10)

Additionally, you can specify a particular data node by using the system argument:

In [ ]:

unique_terms("ResearchSubject.Specimen.source_material_type", system="PDC")

Now, let's dive into the querying!

We can start by getting the record for id = TCGA-E2-A10A that we mentioned earlier:

In [ ]:

q = Q('id = "TCGA-E2-A10A"') # note the double quotes for the string value

r = q.run()

In [ ]:

We see that we've got a single patient record as a result, which is what we expect.

Let's see how the result looks like:

In [ ]:

r[0]

The record is pretty large, so we'll print out identifier values for each ResearchSubject to confirm that we have one ResearchSubject that comes from GDC, and one that comes from PDC:

In [ ]:

for research_subject in r[0]['ResearchSubject']:
    print(research_subject['identifier'])

The values represent ResearchSubject IDs and are equivalent to case_id values in data nodes.

Example queries¶

Now that we can create a query with Q() function, let's see how we can combine multiple conditions.

There are three operators available:

And()
Or()
From()

The following examples show how those operators work in practice.

Query 1¶

Find data for subjects who were diagnosed after the age of 50 and who were investigated as part of the TCGA-OV project.

In [ ]:

q1 = Q('ResearchSubject.Diagnosis.age_at_diagnosis > 50*365')
q2 = Q('ResearchSubject.associated_project = "TCGA-OV"')

q = q1.AND(q2)
r = q.run()

print(r)

Query 2¶

Find data for donors with melanoma (Nevi and Melanomas) diagnosis and who were diagnosed before the age of 30.

In [ ]:

q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')

q = q1.AND(q2)
r = q.run()

print(r)

In addition, we can check how many records come from particular systems by adding one more condition to the query:

In [ ]:

q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')
q3 = Q('ResearchSubject.Specimen.identifier.system = "GDC"')

q = q1.AND(q2.AND(q3))
r = q.run()

print(r)

By comparing the Count value of the two results we can see that all the patients returned in the initial query are coming from the GDC.

To explore the results further, we can fetch the patient JSON objects by iterating through the results:

In [ ]:

projects = set()

for patient in r:
    research_subjects = patient['ResearchSubject']
    for rs in research_subjects:
        projects.add(rs['associated_project'])

print(projects)

The output shows the projects where Nevi and Melanomas cases appear.

Query 3¶

Identify all samples that meet the following conditions:

Sample is from primary tumor
Disease is ovarian or breast cancer
Subjects are females under the age of 60 years

In [ ]:

tumor_type = Q('ResearchSubject.Specimen.source_material_type = "Primary Tumor"')
disease1 = Q('ResearchSubject.primary_disease_site = "Ovary"')
disease2 = Q('ResearchSubject.primary_disease_site = "Breast"')
demographics1 = Q('sex = "female"')
demographics2 = Q('days_to_birth > -60*365') # note that days_to_birth is a negative value

q1 = tumor_type.AND(demographics1.AND(demographics2))
q2 = disease1.OR(disease2)
q = q1.AND(q2)

r = q.run()
print(r)

In this case, we have a result that contains more than 1000 records which is the default page size. To load the next 1000 records, we can use the next_page() method:

In [ ]:

r2 = r.next_page()

In [ ]:

print(r2)

Alternatively, we can use the offset argument to specify the record to start from:

...
r = q.run(offset=1000)
print(r)

Query 4¶

Find data for donors with "Ovarian Serous Cystadenocarcinoma" with proteomic and genomic data.

Note that disease type value denoting the same disease groups can be completely different within different systems. This is where CDA features come into play. We first start by exploring the values available for this particular field in both systems.

In [3]:

unique_terms('ResearchSubject.primary_disease_type', system="GDC",limit=10,verify=False)

/Users/dboles/Documents/python/working/cda-python/cdapython/decorators_cache.py:18: ResourceWarning: unclosed <ssl.SSLSocket fd=75, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.33.24', 63143), raddr=('35.244.236.227', 443)>
  func.cache_clear()
ResourceWarning: Enable tracemalloc to get the object allocation traceback

Out[3]:

[None,
 'Acinar Cell Neoplasms',
 'Adenomas and Adenocarcinomas',
 'Adnexal and Skin Appendage Neoplasms',
 'Basal Cell Neoplasms',
 'Blood Vessel Tumors',
 'Chronic Myeloproliferative Disorders',
 'Complex Epithelial Neoplasms',
 'Complex Mixed and Stromal Neoplasms',
 'Cystic, Mucinous and Serous Neoplasms']

Since “Ovarian Serous Cystadenocarcinoma” doesn’t appear in GDC values we decide to look into the PDC:

In [ ]:

unique_terms('ResearchSubject.primary_disease_type', system="PDC")

After examining the output, we see that it does come from the PDC. Hence, if we could first identify the data that has research subjects found within the PDC that have this particular disease type, and then further narrow down the results to include only the portion of the data that is present in GDC, we could get the records that we are looking for.

In [8]:

q1 = Q('ResearchSubject.primary_disease_type = "Ovarian Serous Cystadenocarcinoma"')
q2 = Q('ResearchSubject.identifier.system = "PDC"')
q3 = Q('ResearchSubject.identifier.system = "GDC"')

q = q3.FROM(q1.AND(q2))
r = q.run(verify=False)

print(r)

Getting results from database

Total execution time: 27307 ms

        QueryID: bd084bbd-33bd-4339-b034-b620192922b1
        Query: SELECT all_v2.* FROM (SELECT all_v2.* FROM gdc-bq-sample.integration.all_v2 AS all_v2, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE ((_ResearchSubject.primary_disease_type = 'Ovarian Serous Cystadenocarcinoma') AND (_identifier.system = 'PDC'))) AS all_v2, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE (_identifier.system = 'GDC')
        Offset: 0
        Count: 100
        Total Row Count: 275
        More pages: True

/Users/dboles/Documents/python/working/cda-python/venv/lib/python3.7/site-packages/ipykernel_launcher.py:6: ResourceWarning: unclosed <ssl.SSLSocket fd=76, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.33.24', 63050), raddr=('35.244.236.227', 443)>
  
ResourceWarning: Enable tracemalloc to get the object allocation traceback

As you can see, this is achieved by utilizing From operator. The From operator allows us to create queries from results of other queries. This is particularly useful when working with conditions that involve a single field which can take multiple different values for different items in a list that is being part of, e.g. we need ResearchSubject.identifier.system to be both “PDC” and “GDC” for a single patient. In such cases, And operator can’t help because it will return those entries where the field takes both values, which is zero entries.

In [10]:

for i in Q.sql("SELECT * FROM `gdc-bq-sample.cda_mvp.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS` WHERE table_name = 'v3' Limit 5",verify=False):
    print(i)

{'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'days_to_birth', 'field_path': 'days_to_birth', 'data_type': 'INT64', 'description': None}
{'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'race', 'field_path': 'race', 'data_type': 'STRING', 'description': None}
{'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'sex', 'field_path': 'sex', 'data_type': 'STRING', 'description': None}
{'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'ethnicity', 'field_path': 'ethnicity', 'data_type': 'STRING', 'description': None}
{'table_catalog': 'gdc-bq-sample', 'table_schema': 'cda_mvp', 'table_name': 'v3', 'column_name': 'id', 'field_path': 'id', 'data_type': 'STRING', 'description': None}

In [ ]:

q1 = query('ResearchSubject.identifier.system = "GDC" FROM ResearchSubject.primary_disease_type = "Ovarian Serous Cystadenocarcinoma" AND ResearchSubject.identifier.system = "PDC"')
result = q1.run(async_call=True)
print(result)

Data extraction and release information¶

In [ ]:

# If you are interested in the extraction dates or data release versions of GDC, PDC, or IDC that is in a table or view, execute this code

for i in Q.sql("SELECT option_value FROM `gdc-bq-sample.integration.INFORMATION_SCHEMA.TABLE_OPTIONS` WHERE table_name = 'all_v1'"):
    print(i)

Test queries¶

Now that we've successfully run and analyzed a few queries, here are a few additional ones you can try out on your own.

Solutions can be shared with the CDA team as indicated in the Testing Guide document.

Test Query 1¶

Find data from TCGA-BRCA project, with donors over the age of 50 with Stage IIIC cancer.

In [ ]:

# Solution

# ...

# print(r)

Test Query 2¶

Find data from all patients who have been treated with "Radiation Therapy, NOS" and have both genomic and proteomic data.

In [ ]:

# Solution

# ...

# print(r)

Test Query 3¶

Find data from all subjects with lung adenocarcinomas that have both primary and recurrent tumors.

In [ ]:

# Solution

# ...

# print(r)

In [ ]:

Q('ResearchSubject.id = "c5421e34-e5c7-4ba5-aed9-146a5575fd8d"').run().pretty_print(-1)