ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   How to Find GDC File Locations
Author:  Lauren Hagen
Created: 2019-08-13
Purpose: Demonstrate how to find GDC file locations using manifests available in BigQuery
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_Find_GDC_File_Locations.ipynb
Notes:

How to Find GDC File Locations

In this notebook, we will explore the data sets available in the Genomic Data Commons using the GDC metadata tables in BigQuery and find the file location within GDC. The metadata tables are useful because several of the available data sets in GDC are not yet available in the ISB-CGC WebApp or as BigQuery tables. This also means that the data sets can't be used with the ISB-CGC API's. The metadata tables can help you find which data sets are available in GDC along with their locations and available file and sequencing types. The available metadata tables along with other data sets and tables from ISB-CGC can be explored without login with the ISB-CGC BigQuery Table Searcher.

This notebook has been designed to keep itself up to date when new metadata tables releases as new data sets are added to GDC or updated every few months though we will be using the tables from release 14 in the examples.

But first things first is to load the BigQuery module, authenticate ourselves, and create a client variable.

In [0]:
# Load the BigQuery Module
from google.cloud import bigquery
In [2]:
# Authenticate ourselves
!gcloud auth application-default login
Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&prompt=select_account&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&access_type=offline


Enter verification code: 4/pAFKsa-5H-VSrDxsHGX_MbCErcPGRkCOLK1KVm1fEJpAUueWjjGBEg4

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credentials.

To generate an access token for other uses, run:
  gcloud auth application-default print-access-token


To take a quick anonymous survey, run:
  $ gcloud alpha survey

In [0]:
# Create a variable for which client to use with BigQuery
client = bigquery.Client('isb-cgc-02-0001') # Replace isb-cgc-02-0001 with your project ID

Let us first explore the available metadata tables in BigQuery that are hosted by ISB-CGC. The data set in BigQuery is isb-cgc.GDC_metadata and tables are listed with the release of data that they are from. For example: isb-cgc.GDC_metadata.rel14_fileData_current is from release 14 of data in GDC. An explanation of what data is associated with each release can be found on the GDC Data Release and ISB-CGC Data Releases and Future Plans pages. The manifests that are list with each release is what is used to create the metadata within BigQuery that is hosted by ISB-CGC, so that you do not need to download and tidy the release manifests thus saving you time!

Now let us view which releases are available currently in BigQuery:

In [5]:
# For each table in the dataset print the number of rows,
# number of bytes and the name of the table
print("GDC Metadata:")
tables = list()
for t in list(client.list_tables('isb-cgc.GDC_metadata')):
  print(t.table_id)
  tables.append(t.table_id)
GDC Metadata:
DLBC_affected_files
GDC_sync_active_20190104
GDC_sync_active_20190115
GDC_sync_legacy_20190104
GDC_sync_legacy_20190115
GDC_sync_obsolete_20190115
PanCanAtlas_manifest
rel12_GDCfileID_to_GCSurl
rel12_aliquot2caseIDmap
rel12_caseData
rel12_fileData_current
rel12_fileData_legacy
rel12_slide2caseIDmap
rel13_GDCfileID_to_GCSurl
rel13_aliquot2caseIDmap
rel13_caseData
rel13_fileData_current
rel13_fileData_legacy
rel13_slide2caseIDmap
rel14_GDCfileID_to_GCSurl
rel14_GDCfileID_to_GCSurl_NEW
rel14_aliquot2caseIDmap
rel14_caseData
rel14_fileData_current
rel14_fileData_legacy
rel14_slide2caseIDmap
rel15_aliquot2caseIDmap
rel15_caseData
rel15_fileData_current
rel15_fileData_legacy
rel15_slide2caseIDmap
rel16_aliquot2caseIDmap
rel16_caseData
rel16_fileData_active
rel16_fileData_legacy
rel16_slide2caseIDmap
rel17_aliquot2caseIDmap
rel17_caseData
rel17_fileData_active
rel17_fileData_legacy
rel17_slide2caseIDmap

Let us get the most recent data release available in the isb-cgc.GDC_metadata data set using the python module re.

In [0]:
import re
In [7]:
releases = list()
print("All Available Releases:")
for table in tables:
  r = re.search(r"(rel\d*)_", table)
  if r:
    rel = r.group(1)
    if rel not in releases:
      releases.append(rel)
      print("\t" +rel)
curr_release = releases[len(releases)-1]
print("\nCurrent Available Release:\n\t" + curr_release)
All Available Releases:
	rel12
	rel13
	rel14
	rel15
	rel16
	rel17

Current Available Release:
	rel17

The next code block can be turned off to use other release sets but the examples in the rest of the notebook may need to updated for the approapriate release and column names.

In [0]:
# We are going to use release 14 as our example
# If this code block is turn off, please check that the examples are using the
# correct column names.
curr_release = "rel14"

Each release is split up into several tables based on the data that they help to faciltiate finding.

Table Description
rel#_caseData List of all of the cases in GDC
rel#_fileData_current or rel#_fileData_active List of the currently active cases in GDC along with information related to those cases
rel#_fileData_legacy Same as the previous table but with legacy data instead
rel#_aliquot2caseIDmap “helper” table to help map between identifiers at different levels of aliquot data. The intrinsic hierarchy is program > project > case > sample > portion > analyte > aliquot
rel#_slide2caseIDmap “helper” table to help map between identifiers at different levels of tissue slide data. The intrinsic hierarchy is program > project > case > sample > portion > slide
rel#_GDCfileID_to_GCSurl Gives the Google Cloud Storage location for each file

Let us see which data sets are available within the GDC.

In [18]:
datasets_query = ("SELECT program_name, COUNT(program_name) AS Num "
                "FROM `isb-cgc.GDC_metadata." + curr_release + "_fileData_current` "
                "GROUP BY program_name "
                "ORDER BY Num DESC")
# Query the table
datasets_result = client.query(datasets_query)
# Put the results into a pandas dataframe
datasets_result = datasets_result.result().to_dataframe()
datasets_result
Out[18]:
program_name Num
0 TCGA 314354
1 FM 36134
2 TARGET 7138
3 NCICCR 957
4 CTSP 89
5 VAREPOP 7

We are going to explore the Foundation Medicine Adult Cancer Clinical Data set (FM-AD), one of the newer data sets to the GDC. More information can be found on the 'FM-AD Data Set' page in the ISB-CGC documentation.

Let us first take a look at the schema for the isb-cgc.GDC_metadata.rel14_fileData_current table as it has the most interesting data for figuring out which files are available in the GDC.

In [21]:
fields_query = ("SELECT column_name "
                "FROM `isb-cgc.GDC_metadata.INFORMATION_SCHEMA.COLUMNS` "
                "WHERE table_name = '" + curr_release + "_fileData_current'")
# Query the table
fields_result = client.query(fields_query)
# Put the results into a pandas dataframe
fields_result = fields_result.result().to_dataframe()
fields_result
Out[21]:
column_name
0 dbName
1 file_gdc_id
2 access
3 acl
4 analysis_input_file_gdc_ids
5 analysis_workflow_link
6 analysis_workflow_type
7 associated_entities__case_gdc_id
8 associated_entities__entity_gdc_id
9 associated_entities__entity_submitter_id
10 associated_entities__entity_type
11 case_gdc_id
12 project_dbgap_accession_number
13 project_disease_type
14 project_name
15 program_dbgap_accession_number
16 program_name
17 project_short_name
18 created_datetime
19 data_category
20 data_format
21 data_type
22 downstream_analyses__output_file_gdc_ids
23 downstream_analyses__workflow_link
24 downstream_analyses__workflow_type
25 experimental_strategy
26 file_name
27 file_size
28 file_id
29 index_file_gdc_id
30 index_file_name
31 index_file_size
32 md5sum
33 platform
34 file_state
35 file_submitter_id
36 file_type
37 updated_datetime

Then we will find out which files formats are available on GDC with a SQL query. Since, we might want to look at different fields and their counts, we will write a function that will take the field, table, release, and program name that will then tell us the count of the field as a pandas data frame using the Google Cloud Client Libraries.

In [0]:
def variable_count(field, table_type, release, program):
  # Create a variable for the whole table path
  table_path = "isb-cgc.GDC_metadata." + release + table_type
  # Create a  variable with the SQL query
  count_query = ("SELECT " + field + ", "
                "COUNT(" + field + ") AS Num FROM `" + table_path + "` "
                "WHERE program_name = '" + program + "' "
                "GROUP BY " + field)
  # Query the table
  query = client.query(count_query)
  # Put the results into a pandas dataframe
  result = query.result().to_dataframe()
  # return the dataframe
  return result

Let us now test the function to view the available data formats for the FM-AD data set.

In [23]:
# Set the field that we are interested in
field = "data_format"

# Set the program that we are interested in
program = "FM"

# Set which table we are going to query
table_type = "_fileData_current"

# run the function with the defined variables
query_result = variable_count(field, table_type, curr_release, program)

# display the query results
query_result
Out[23]:
data_format Num
0 MAF 42
1 TSV 84
2 VCF 36008

It seems that most of the files for FM-AD are VCF files. Next we are going to see how many of the files available are controlled access vs open access.

In [24]:
# Change the field to access type
field = "access"

# run the function with the defined variables
query_result = variable_count(field, table_type, curr_release, program)

# display the query results
query_result
Out[24]:
access Num
0 controlled 36050
1 open 84

Wow, the majority of the files are controlled access files. You'll want to review that you have the correct permissions such as dbGaP authorization to access controlled data with GDC before proceeding with attempting to use the data. More information can be found in the ISB-CGC documentation on the 'Accessing Controlled Data' page.

A few other fields that we might be interested are data_category, data_type, experimental_strategy, file_size, and platform. We can modify the query by modifying the variables field, dataset, curr_release, and table_type to view different tables and field counts. Let's update the query to look at the data_type field:

In [25]:
# Change the field to data type
field = "data_type"

# run the function with the defined variables
query_result = variable_count(field, table_type, curr_release, program)

# display the query results
query_result
Out[25]:
data_type Num
0 Aggregated Somatic Mutation 42
1 Biospecimen Supplement 42
2 Clinical Supplement 42
3 Raw Simple Somatic Mutation 18004
4 Annotated Somatic Mutation 18004

Now that we have looked over different fields in the table, let us create a set of files that we are interested in. For this example, we are going to create a query that will find the GDC file id associated with the FM-AD data set that are VCF files and have the file type of 'simple_somatic_mutation'.

In [0]:
FM_gdc_file_query = ("SELECT file_gdc_id "
                     "FROM `isb-cgc.GDC_metadata." + curr_release + "_fileData_current` "
                     "WHERE program_name = 'FM' AND data_format = 'VCF' AND file_type =  'simple_somatic_mutation'")
# Query the table
gdc_file_id = client.query(FM_gdc_file_query)
# Put the results into a pandas dataframe
gdc_file_id = gdc_file_id.result().to_dataframe()
In [29]:
gdc_file_id.head(5)
Out[29]:
file_gdc_id
0 7c9a2c2b-4401-47c6-918f-c0032f9c72a7
1 2c9d1e76-1c74-41dd-aa38-43d2f2f55e4e
2 a7e3fd83-49be-4633-8e46-9fbdfe7ba741
3 ac26d85c-efa2-42a2-b63a-58b89f0c9c1c
4 a1f089d2-d5dc-496a-8ec2-faaac1225d86
In [30]:
len(gdc_file_id)
Out[30]:
18004

Now that we have a list of the GDC file ids, we can join it with the GCP urls from the rel14_GDCfileID_to_GCSurl table.

In [0]:
url_query = ("WITH id AS (SELECT file_gdc_id "
             "FROM `isb-cgc.GDC_metadata."+ curr_release + "_fileData_current` "
             "WHERE program_name = '" + program + "' "
             "AND data_format = 'VCF' AND file_type =  'simple_somatic_mutation') "
             "SELECT t2.file_gdc_url "
             "FROM id AS t1 "
             "INNER JOIN `isb-cgc.GDC_metadata." + curr_release +"_GDCfileID_to_GCSurl_NEW` AS t2 "
             "ON t1.file_gdc_id = t2.file_gdc_id")

# Query the table
result = client.query(url_query)
# Put the results into a pandas dataframe
gdc_file_url = result.result().to_dataframe()
In [0]:
gdc_file_url.head(5)
Out[0]:
file_gdc_url
0 gs://gdc-fm-phs001179-controlled/a9130aed-b49d...
1 gs://gdc-fm-phs001179-controlled/b1c15a06-12d2...
2 gs://gdc-fm-phs001179-controlled/01f4eddf-15df...
3 gs://gdc-fm-phs001179-controlled/45bbf05a-3ad2...
4 gs://gdc-fm-phs001179-controlled/646fb13b-6471...
In [0]:
len(gdc_file_url)
Out[0]:
18004

Now that we have some basics about the metadata tables, it would be good to go over a more complicated SQL query to combine a set of cohort case_barcodes and then find the associated GDC urls by building and joining many tables. The first portion of the query below is from the How to Create Cohorts Notebook in the ISB-CGC Community Notebook Repository.

In [0]:
# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe
# Change isb-cgc-02-0001 to your project ID
%%bigquery cohort_to_url --project isb-cgc-02-0001
-- SQL Query from the How to Create Cohorts Notebook
WITH
  select_on_annotations AS (
  SELECT
    case_barcode,
    category AS categoryName,
    classification AS classificationName
  FROM
    `isb-cgc.TCGA_bioclin_v0.Annotations`
  WHERE
    ( entity_type="Patient"
      AND (category="History of unacceptable prior treatment related to a prior/other malignancy"
        OR classification="Redaction" ) )
  GROUP BY
    case_barcode,
    categoryName,
    classificationName ),
  select_on_clinical AS (
  SELECT
    case_barcode,
    vital_status,
    days_to_last_known_alive,
    ethnicity,
    histological_type,
    menopause_status,
    race
  FROM
    `isb-cgc.TCGA_bioclin_v0.Clinical`
  WHERE
    ( disease_code = "BRCA"
      AND age_at_diagnosis<=50
      AND gender="FEMALE" ) ),
-- Combine the cohort with the metadata tables to create a list of GDC urls
    cohort AS (
  SELECT
    case_barcode
  FROM (
    SELECT
      a.categoryName,
      a.classificationName,
      c.case_barcode
    FROM
      select_on_annotations AS a
    FULL JOIN
      select_on_clinical AS c
    ON
      a.case_barcode = c.case_barcode
    WHERE
      a.case_barcode IS NOT NULL
      OR c.case_barcode IS NOT NULL
    ORDER BY
      a.classificationName,
      a.categoryName,
      c.case_barcode )
  WHERE
    categoryName IS NULL
    AND classificationName IS NULL
    AND case_barcode IS NOT NULL
  ORDER BY
    case_barcode),
  gdc AS (SELECT a.case_barcode, b.case_gdc_id
  FROM cohort AS a
  INNER JOIN `isb-cgc.GDC_metadata.rel14_caseData` AS b
  ON a.case_barcode = b.case_barcode),
  curr AS (SELECT c.case_barcode, c.case_gdc_id, d.file_gdc_id
  FROM gdc as c
  INNER JOIN `isb-cgc.GDC_metadata.rel14_fileData_current` AS d
  ON c.case_gdc_id = d.case_gdc_id),
  url AS ( SELECT e.case_barcode, e.case_gdc_id, e.file_gdc_id, f.file_gdc_url
  FROM curr AS e
  INNER JOIN `isb-cgc.GDC_metadata.rel14_GDCfileID_to_GCSurl_NEW` AS f
  ON e.file_gdc_id = f.file_gdc_id)
SELECT case_barcode, file_gdc_url FROM url ORDER BY case_barcode
In [0]:
cohort_to_url.head(5)
Out[0]:
case_barcode file_gdc_url
0 TCGA-3C-AALI gs://gdc-tcga-phs000178-open/c993c79e-ac58-423...
1 TCGA-3C-AALI gs://gdc-tcga-phs000178-open/fe3ceb38-aecc-4ef...
2 TCGA-3C-AALI gs://gdc-tcga-phs000178-controlled/a736c2d0-2c...
3 TCGA-3C-AALI gs://gdc-tcga-phs000178-open/00737aa1-d1da-438...
4 TCGA-3C-AALI gs://gdc-tcga-phs000178-open/ad8e6a02-4483-404...