ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   Quick Start Guide to ISB-CGC
Author:  Lauren Hagen
Created: 2019-06-20
Purpose: Painless intro to working in the cloud
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb
Notes:

Quick Start Guide to ISB-CGC

ISB-CGC

This Quick Start Guide is intended give an overview of the data available, to walk you though the steps of setting up your accounts, and get started with a basic example in python. If you have read the R version, you can skip to the Example section.

Access Requirements

Access Suggestions

  • Favored Programming Language (R or Python)
  • Favored IDE (RStudio or Jupyter)

Outline for this Notebook

  • Quick Overview of ISB-CGC
  • About the Data on ISB-CGC
  • Overview How to Access Data
  • Account Set up
  • ISB-CGC Web Interface
  • Google Cloud Platform (GCP) and BigQuery Overview
  • Example of Accessing Data with Python
  • Where to go next

Overview of ISB-CGC

The ISB-CGC provides both interactive and programmatic access to data hosted by institutes such as the Genomic Data Commons (GDC) of the National Cancer Institute (NCI) and the Wellcome Trust Sanger Institute while leveraging many aspects of the Google Cloud Platform. You can also import your own data to analyze it side by side with the datasets and share your data when you see fit.

In [0]:
#@title Introduction to ISB-CGC Video
#@markdown This 12 minute video goes over an introduction to ISB-CGC
from IPython.display import YouTubeVideo
YouTubeVideo('RQsLKDTciWk', width=600, height=400)
#@markdown For more videos check out: [ISB-CGC Video Tutorial Series](https://isb-cgc.appspot.com/videotutorials/)
Out[0]:

About the Data in the Cloud

The main data that is hosted on the cloud is The Cancer Genome Atlas (TCGA) data which was a large-scale multi-disciplinary collaboration started by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). Some of the hosted data types and files include RNA-Seq FASTQ, DNA-Seq and RNA-Seq BAM Files, Genome-Wide SNP6 array CEL files, and Variant-calls in VCF files along with a number of other datasets including data from Therapeutically Applicable Research to Generate Effective Treatments (TARGET) and Cancer Cell Line Encyclopedia (CCLE) programs. ISB-CGC hosts several tables in BigQuery with data from the TCGA, TARGET, and CCLE along with reference tables and Catalogue Of Somatic Mutations In Cancer (COSMIC) data sets from the Wellcome Trust Sanger Institute. ISB-CGC is adding more data sets all the time, so if you have suggestions for a datasets to be added please email: [email protected]

For more information, please visit: Programs and Data Sets and Data in BigQuery

Overview of How to Access Data

There are several ways to access the Data that is hosted by ISB-CGC.

Account Set-up

If not completed prior to reading this guide

  1. Log in or create a Gmail account
  • Can be use your institutional email if it is a Google Identity
  1. Create a GCP Project using a GMail account
  1. Authorize your account for dbGaP in the ISB-CGC WebApp (required for viewing controlled access data)
  • To access controlled data, users must first be authenticated by NIH (via the ISB-CGC web-app). Upon successful authentication, user dbGaP authorization will be verified. These two steps are required before the user’s Google identity is added to the access control list (ACL) for the controlled data. At this time, this access must be renewed every 24 hours.
  • Please view Accessing Controlled-Access Data if you need help with this step.
  1. Register your GCP project in the ISB-CGC WebApp
  1. Enable the following required Google Cloud APIs:
    • Google Compute Engine
    • Google Genomics
    • Google BigQuery
    • Google Cloud Logging
    • Google Cloud Pub/Sub
  1. Install optional software such as:

ISB-CGC Web Interface

The ISB-CGC Web Interface is an interactive web-based application to access and explore the rich TCGA, TARGET, and CCLE datasets with more datasets being added regularly. Through the WebApp you can create Cohorts, lists of Favorite Genes, miRNA, and Variables. The Cohorts and Variables can be used in Workbooks to allow you to quickly analyze and export datasets by mixing and matching the selections. The ISB-CGC Web Interface also allows you to view and analyze available pathology and radiology images associated with selected cohort data.

Google Cloud Platform and BigQuery Overview

The Google Cloud Platform Console is the web-based interface to your GCP Project. From the Console, you can check the overall status of your project, create and delete Cloud Storage buckets, upload and download files, spin up and shut down VMs, add members to your project, acces the Cloud Shell command line, etc. Click here to download a quick tour from ISB-CGC of the GCP Console. You'll want to remember that any costs that you incur are charged under your current project, so you will want to make sure you are on the correct one if you are part of multiple projects. Here is how to check which project is your current project.

"BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and machine learning built in." Source ISB-CGC has uploaded multiple cancer genomic datasets into BigQuery tables that are open-source such as TCGA and TARGET Clinical, Biospecimen and Molecular Data along with dataset megadata. This data can be accessed from the Google Cloud Platform Console web-UI, programmatically with R, and programmatically with python through Cloud Datalab or Colab.

More indepth walk throughs:

Example of Accessing Data with Python

Log into Google Cloud Storage and Authenticate ourselves

  1. Authenticate yourself with your Google Cloud Login
  2. A second tab will open or follow the link provided
  3. Follow prompts to Authorize your account to use Google Cloud SDK
  4. Copy code provided and paste into the box under the Command
  5. Press Enter

Alternatives for Authentication can be found here

In [0]:
# Run a command line command with the bang (!) and gcloud
!gcloud auth application-default login 
Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&prompt=select_account&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&access_type=offline


Enter verification code: 4/XQGk8wtHV404M8mfwbkdcZjmj-DpxkeKCnUvD3hh4y8XCWa00jfNoww

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credentials.

To generate an access token for other uses, run:
  gcloud auth application-default print-access-token


To take a quick anonymous survey, run:
  $ gcloud alpha survey

View Datasets and Tables in BigQuery

Let us look at the datasets available through ISB-CGC that are in BigQuery. You will need to load the BigQuery API and set the client (click here for more information).

In [0]:
# Load BigQuery API
from google.cloud import bigquery

# Create a client to access the data within BigQuery
client = bigquery.Client('isb-cgc')

# Create a variable of datasets 
datasets = list(client.list_datasets())
# Create a variable for the name of the project
project = client.project

# If there are datasets available then print their names,
# else print that there are no data sets available
if datasets:
    print("Datasets in project {}:".format(project))
    for dataset in datasets:  # API request(s)
        print("\t{}".format(dataset.dataset_id))
else:
    print("{} project does not contain any datasets.".format(project))
Datasets in project isb-cgc:
	CCLE_bioclin_v0
	GDC_metadata
	GTEx_v7
	QotM
	TARGET_bioclin_v0
	TARGET_hg38_data_v0
	TCGA_bioclin_v0
	TCGA_hg19_data_v0
	TCGA_hg38_data_v0
	Toil_recompute
	ccle_201602_alpha
	genome_reference
	hg19_data_previews
	hg38_data_previews
	metadata
	platform_reference
	tcga_201607_beta
	tcga_cohorts
	tcga_seq_metadata

Let us see which tables are under the TCGA_bioclin_v0 dataset.

In [0]:
print("Tables:")
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables('isb-cgc.TCGA_bioclin_v0'))
# If there are tables then print their names,
# else print that there are no tables
if tables:
    for table in tables:
        print("\t{}".format(table.table_id))
else:
    print("\tThis dataset does not contain any tables.")
Tables:
	Annotations
	Biospecimen
	Clinical

Access BigQuery to call a table

First you'll want to call to BigQuery with a magic command and then you can use Standard SQL to write your query. Click here for more on IPython Magic Commands for BigQuery. The result will be a Pandas Dataframe.

In [0]:
# Call to BigQuery with a magic command
# and replace PROJECT_ID with your project ID Number
%%bigquery --project isb-cgc-02-0001
SELECT # Select a few columns to view
  program_name,
  case_barcode,
  project_short_name
FROM # From the TCGA Clinical Dataset
  `isb-cgc.TCGA_bioclin_v0.Clinical`
LIMIT # Limit to 5 rows as the dataset is very large and we only want to see a few results
  5

# Syntax for the above query
# SELECT * 
# FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
# Limit to the first 5 fields
Out[0]:
program_name case_barcode project_short_name
0 TCGA TCGA-01-0628 TCGA-OV
1 TCGA TCGA-01-0630 TCGA-OV
2 TCGA TCGA-01-0631 TCGA-OV
3 TCGA TCGA-01-0633 TCGA-OV
4 TCGA TCGA-01-0636 TCGA-OV

Now that wasn't so difficult! Have fun exploring and analyzing the ISB-CGC Data!