Check out more notebooks at our Community Notebooks Repository!
Title: How to Find GDC File Locations
Author: Lauren Hagen
Created: 2019-08-13
Purpose: Demonstrate how to find GDC file locations using manifests available in BigQuery
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_Find_GDC_File_Locations.ipynb
Notes:
In this notebook, we will explore the data sets available in the Genomic Data Commons using the GDC metadata tables in BigQuery and find the file location within GDC. The metadata tables are useful because several of the available data sets in GDC are not yet available in the ISB-CGC WebApp or as BigQuery tables. This also means that the data sets can't be used with the ISB-CGC API's. The metadata tables can help you find which data sets are available in GDC along with their locations and available file and sequencing types. The available metadata tables along with other data sets and tables from ISB-CGC can be explored without login with the ISB-CGC BigQuery Table Searcher.
This notebook has been designed to keep itself up to date when new metadata tables releases as new data sets are added to GDC or updated every few months though we will be using the tables from release 14 in the examples.
But first things first is to load the BigQuery module, authenticate ourselves, and create a client variable.
# Load the BigQuery Module
from google.cloud import bigquery
# Authenticate ourselves
!gcloud auth application-default login
Go to the following link in your browser: https://accounts.google.com/o/oauth2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&prompt=select_account&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&access_type=offline Enter verification code: 4/pAFKsa-5H-VSrDxsHGX_MbCErcPGRkCOLK1KVm1fEJpAUueWjjGBEg4 Credentials saved to file: [/content/.config/application_default_credentials.json] These credentials will be used by any library that requests Application Default Credentials. To generate an access token for other uses, run: gcloud auth application-default print-access-token To take a quick anonymous survey, run: $ gcloud alpha survey
# Create a variable for which client to use with BigQuery
client = bigquery.Client('isb-cgc-02-0001') # Replace isb-cgc-02-0001 with your project ID
Let us first explore the available metadata tables in BigQuery that are hosted by ISB-CGC. The data set in BigQuery is isb-cgc.GDC_metadata
and tables are listed with the release of data that they are from. For example: isb-cgc.GDC_metadata.rel14_fileData_current
is from release 14 of data in GDC. An explanation of what data is associated with each release can be found on the GDC Data Release and ISB-CGC Data Releases and Future Plans pages. The manifests that are list with each release is what is used to create the metadata within BigQuery that is hosted by ISB-CGC, so that you do not need to download and tidy the release manifests thus saving you time!
Now let us view which releases are available currently in BigQuery:
# For each table in the dataset print the number of rows,
# number of bytes and the name of the table
print("GDC Metadata:")
tables = list()
for t in list(client.list_tables('isb-cgc.GDC_metadata')):
print(t.table_id)
tables.append(t.table_id)
GDC Metadata: DLBC_affected_files GDC_sync_active_20190104 GDC_sync_active_20190115 GDC_sync_legacy_20190104 GDC_sync_legacy_20190115 GDC_sync_obsolete_20190115 PanCanAtlas_manifest rel12_GDCfileID_to_GCSurl rel12_aliquot2caseIDmap rel12_caseData rel12_fileData_current rel12_fileData_legacy rel12_slide2caseIDmap rel13_GDCfileID_to_GCSurl rel13_aliquot2caseIDmap rel13_caseData rel13_fileData_current rel13_fileData_legacy rel13_slide2caseIDmap rel14_GDCfileID_to_GCSurl rel14_GDCfileID_to_GCSurl_NEW rel14_aliquot2caseIDmap rel14_caseData rel14_fileData_current rel14_fileData_legacy rel14_slide2caseIDmap rel15_aliquot2caseIDmap rel15_caseData rel15_fileData_current rel15_fileData_legacy rel15_slide2caseIDmap rel16_aliquot2caseIDmap rel16_caseData rel16_fileData_active rel16_fileData_legacy rel16_slide2caseIDmap rel17_aliquot2caseIDmap rel17_caseData rel17_fileData_active rel17_fileData_legacy rel17_slide2caseIDmap
Let us get the most recent data release available in the isb-cgc.GDC_metadata
data set using the python module re
.
import re
releases = list()
print("All Available Releases:")
for table in tables:
r = re.search(r"(rel\d*)_", table)
if r:
rel = r.group(1)
if rel not in releases:
releases.append(rel)
print("\t" +rel)
curr_release = releases[len(releases)-1]
print("\nCurrent Available Release:\n\t" + curr_release)
All Available Releases: rel12 rel13 rel14 rel15 rel16 rel17 Current Available Release: rel17
The next code block can be turned off to use other release sets but the examples in the rest of the notebook may need to updated for the approapriate release and column names.
# We are going to use release 14 as our example
# If this code block is turn off, please check that the examples are using the
# correct column names.
curr_release = "rel14"
Each release is split up into several tables based on the data that they help to faciltiate finding.
Table | Description |
---|---|
rel#_caseData | List of all of the cases in GDC |
rel#_fileData_current or rel#_fileData_active | List of the currently active cases in GDC along with information related to those cases |
rel#_fileData_legacy | Same as the previous table but with legacy data instead |
rel#_aliquot2caseIDmap | “helper” table to help map between identifiers at different levels of aliquot data. The intrinsic hierarchy is program > project > case > sample > portion > analyte > aliquot |
rel#_slide2caseIDmap | “helper” table to help map between identifiers at different levels of tissue slide data. The intrinsic hierarchy is program > project > case > sample > portion > slide |
rel#_GDCfileID_to_GCSurl | Gives the Google Cloud Storage location for each file |
Let us see which data sets are available within the GDC.
datasets_query = ("SELECT program_name, COUNT(program_name) AS Num "
"FROM `isb-cgc.GDC_metadata." + curr_release + "_fileData_current` "
"GROUP BY program_name "
"ORDER BY Num DESC")
# Query the table
datasets_result = client.query(datasets_query)
# Put the results into a pandas dataframe
datasets_result = datasets_result.result().to_dataframe()
datasets_result
program_name | Num | |
---|---|---|
0 | TCGA | 314354 |
1 | FM | 36134 |
2 | TARGET | 7138 |
3 | NCICCR | 957 |
4 | CTSP | 89 |
5 | VAREPOP | 7 |
We are going to explore the Foundation Medicine Adult Cancer Clinical Data set (FM-AD), one of the newer data sets to the GDC. More information can be found on the 'FM-AD Data Set' page in the ISB-CGC documentation.
Let us first take a look at the schema for the isb-cgc.GDC_metadata.rel14_fileData_current
table as it has the most interesting data for figuring out which files are available in the GDC.
fields_query = ("SELECT column_name "
"FROM `isb-cgc.GDC_metadata.INFORMATION_SCHEMA.COLUMNS` "
"WHERE table_name = '" + curr_release + "_fileData_current'")
# Query the table
fields_result = client.query(fields_query)
# Put the results into a pandas dataframe
fields_result = fields_result.result().to_dataframe()
fields_result
column_name | |
---|---|
0 | dbName |
1 | file_gdc_id |
2 | access |
3 | acl |
4 | analysis_input_file_gdc_ids |
5 | analysis_workflow_link |
6 | analysis_workflow_type |
7 | associated_entities__case_gdc_id |
8 | associated_entities__entity_gdc_id |
9 | associated_entities__entity_submitter_id |
10 | associated_entities__entity_type |
11 | case_gdc_id |
12 | project_dbgap_accession_number |
13 | project_disease_type |
14 | project_name |
15 | program_dbgap_accession_number |
16 | program_name |
17 | project_short_name |
18 | created_datetime |
19 | data_category |
20 | data_format |
21 | data_type |
22 | downstream_analyses__output_file_gdc_ids |
23 | downstream_analyses__workflow_link |
24 | downstream_analyses__workflow_type |
25 | experimental_strategy |
26 | file_name |
27 | file_size |
28 | file_id |
29 | index_file_gdc_id |
30 | index_file_name |
31 | index_file_size |
32 | md5sum |
33 | platform |
34 | file_state |
35 | file_submitter_id |
36 | file_type |
37 | updated_datetime |
Then we will find out which files formats are available on GDC with a SQL query. Since, we might want to look at different fields and their counts, we will write a function that will take the field, table, release, and program name that will then tell us the count of the field as a pandas data frame using the Google Cloud Client Libraries.
def variable_count(field, table_type, release, program):
# Create a variable for the whole table path
table_path = "isb-cgc.GDC_metadata." + release + table_type
# Create a variable with the SQL query
count_query = ("SELECT " + field + ", "
"COUNT(" + field + ") AS Num FROM `" + table_path + "` "
"WHERE program_name = '" + program + "' "
"GROUP BY " + field)
# Query the table
query = client.query(count_query)
# Put the results into a pandas dataframe
result = query.result().to_dataframe()
# return the dataframe
return result
Let us now test the function to view the available data formats for the FM-AD data set.
# Set the field that we are interested in
field = "data_format"
# Set the program that we are interested in
program = "FM"
# Set which table we are going to query
table_type = "_fileData_current"
# run the function with the defined variables
query_result = variable_count(field, table_type, curr_release, program)
# display the query results
query_result
data_format | Num | |
---|---|---|
0 | MAF | 42 |
1 | TSV | 84 |
2 | VCF | 36008 |
It seems that most of the files for FM-AD are VCF files. Next we are going to see how many of the files available are controlled access vs open access.
# Change the field to access type
field = "access"
# run the function with the defined variables
query_result = variable_count(field, table_type, curr_release, program)
# display the query results
query_result
access | Num | |
---|---|---|
0 | controlled | 36050 |
1 | open | 84 |
Wow, the majority of the files are controlled access files. You'll want to review that you have the correct permissions such as dbGaP authorization to access controlled data with GDC before proceeding with attempting to use the data. More information can be found in the ISB-CGC documentation on the 'Accessing Controlled Data' page.
A few other fields that we might be interested are data_category
, data_type
, experimental_strategy
, file_size
, and platform
. We can modify the query by modifying the variables field
, dataset
, curr_release
, and table_type
to view different tables and field counts. Let's update the query to look at the data_type
field:
# Change the field to data type
field = "data_type"
# run the function with the defined variables
query_result = variable_count(field, table_type, curr_release, program)
# display the query results
query_result
data_type | Num | |
---|---|---|
0 | Aggregated Somatic Mutation | 42 |
1 | Biospecimen Supplement | 42 |
2 | Clinical Supplement | 42 |
3 | Raw Simple Somatic Mutation | 18004 |
4 | Annotated Somatic Mutation | 18004 |
Now that we have looked over different fields in the table, let us create a set of files that we are interested in. For this example, we are going to create a query that will find the GDC file id associated with the FM-AD data set that are VCF files and have the file type of 'simple_somatic_mutation'.
FM_gdc_file_query = ("SELECT file_gdc_id "
"FROM `isb-cgc.GDC_metadata." + curr_release + "_fileData_current` "
"WHERE program_name = 'FM' AND data_format = 'VCF' AND file_type = 'simple_somatic_mutation'")
# Query the table
gdc_file_id = client.query(FM_gdc_file_query)
# Put the results into a pandas dataframe
gdc_file_id = gdc_file_id.result().to_dataframe()
gdc_file_id.head(5)
file_gdc_id | |
---|---|
0 | 7c9a2c2b-4401-47c6-918f-c0032f9c72a7 |
1 | 2c9d1e76-1c74-41dd-aa38-43d2f2f55e4e |
2 | a7e3fd83-49be-4633-8e46-9fbdfe7ba741 |
3 | ac26d85c-efa2-42a2-b63a-58b89f0c9c1c |
4 | a1f089d2-d5dc-496a-8ec2-faaac1225d86 |
len(gdc_file_id)
18004
Now that we have a list of the GDC file ids, we can join it with the GCP urls from the rel14_GDCfileID_to_GCSurl
table.
url_query = ("WITH id AS (SELECT file_gdc_id "
"FROM `isb-cgc.GDC_metadata."+ curr_release + "_fileData_current` "
"WHERE program_name = '" + program + "' "
"AND data_format = 'VCF' AND file_type = 'simple_somatic_mutation') "
"SELECT t2.file_gdc_url "
"FROM id AS t1 "
"INNER JOIN `isb-cgc.GDC_metadata." + curr_release +"_GDCfileID_to_GCSurl_NEW` AS t2 "
"ON t1.file_gdc_id = t2.file_gdc_id")
# Query the table
result = client.query(url_query)
# Put the results into a pandas dataframe
gdc_file_url = result.result().to_dataframe()
gdc_file_url.head(5)
file_gdc_url | |
---|---|
0 | gs://gdc-fm-phs001179-controlled/a9130aed-b49d... |
1 | gs://gdc-fm-phs001179-controlled/b1c15a06-12d2... |
2 | gs://gdc-fm-phs001179-controlled/01f4eddf-15df... |
3 | gs://gdc-fm-phs001179-controlled/45bbf05a-3ad2... |
4 | gs://gdc-fm-phs001179-controlled/646fb13b-6471... |
len(gdc_file_url)
18004
Now that we have some basics about the metadata tables, it would be good to go over a more complicated SQL query to combine a set of cohort case_barcodes and then find the associated GDC urls by building and joining many tables. The first portion of the query below is from the How to Create Cohorts Notebook in the ISB-CGC Community Notebook Repository.
# Magic command of bigquery with the project id as isb-cgc-02-0001 and create a Pandas Dataframe
# Change isb-cgc-02-0001 to your project ID
%%bigquery cohort_to_url --project isb-cgc-02-0001
-- SQL Query from the How to Create Cohorts Notebook
WITH
select_on_annotations AS (
SELECT
case_barcode,
category AS categoryName,
classification AS classificationName
FROM
`isb-cgc.TCGA_bioclin_v0.Annotations`
WHERE
( entity_type="Patient"
AND (category="History of unacceptable prior treatment related to a prior/other malignancy"
OR classification="Redaction" ) )
GROUP BY
case_barcode,
categoryName,
classificationName ),
select_on_clinical AS (
SELECT
case_barcode,
vital_status,
days_to_last_known_alive,
ethnicity,
histological_type,
menopause_status,
race
FROM
`isb-cgc.TCGA_bioclin_v0.Clinical`
WHERE
( disease_code = "BRCA"
AND age_at_diagnosis<=50
AND gender="FEMALE" ) ),
-- Combine the cohort with the metadata tables to create a list of GDC urls
cohort AS (
SELECT
case_barcode
FROM (
SELECT
a.categoryName,
a.classificationName,
c.case_barcode
FROM
select_on_annotations AS a
FULL JOIN
select_on_clinical AS c
ON
a.case_barcode = c.case_barcode
WHERE
a.case_barcode IS NOT NULL
OR c.case_barcode IS NOT NULL
ORDER BY
a.classificationName,
a.categoryName,
c.case_barcode )
WHERE
categoryName IS NULL
AND classificationName IS NULL
AND case_barcode IS NOT NULL
ORDER BY
case_barcode),
gdc AS (SELECT a.case_barcode, b.case_gdc_id
FROM cohort AS a
INNER JOIN `isb-cgc.GDC_metadata.rel14_caseData` AS b
ON a.case_barcode = b.case_barcode),
curr AS (SELECT c.case_barcode, c.case_gdc_id, d.file_gdc_id
FROM gdc as c
INNER JOIN `isb-cgc.GDC_metadata.rel14_fileData_current` AS d
ON c.case_gdc_id = d.case_gdc_id),
url AS ( SELECT e.case_barcode, e.case_gdc_id, e.file_gdc_id, f.file_gdc_url
FROM curr AS e
INNER JOIN `isb-cgc.GDC_metadata.rel14_GDCfileID_to_GCSurl_NEW` AS f
ON e.file_gdc_id = f.file_gdc_id)
SELECT case_barcode, file_gdc_url FROM url ORDER BY case_barcode
cohort_to_url.head(5)
case_barcode | file_gdc_url | |
---|---|---|
0 | TCGA-3C-AALI | gs://gdc-tcga-phs000178-open/c993c79e-ac58-423... |
1 | TCGA-3C-AALI | gs://gdc-tcga-phs000178-open/fe3ceb38-aecc-4ef... |
2 | TCGA-3C-AALI | gs://gdc-tcga-phs000178-controlled/a736c2d0-2c... |
3 | TCGA-3C-AALI | gs://gdc-tcga-phs000178-open/00737aa1-d1da-438... |
4 | TCGA-3C-AALI | gs://gdc-tcga-phs000178-open/ad8e6a02-4483-404... |