Open In Colab

ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   Intro to COSMIC in BigQuery
Author:  Akshay Balaji
Created: 2020-07-02
Purpose: Painless intro to working with COSMIC data in the cloud
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_cohorts.ipynb
Notes:

Intro to COSMIC in BigQuery

This notebook serves as an introduction to the COSMIC dataset and also a review of the basics of working with BigQuery in Python. Much of the non-COSMIC information can also be found in our Quick Start Guide to ISB-CGC notebook.

Note: You will need to get extra permissions to work with COSMIC. You must sign up with the Sanger Institute's COSMIC page before you can access ISB's COSMIC BigQuery tables. You can find more information here.

Goals:

  • Successfully establish a client with the BigQuery COSMIC data set to begin querying it from the notebook itself
  • Perform basic SQL commands and analyses on the COSMIC data set
  • Understand the kinds of data stored in each table of COSMIC in the CGC

Login to Google Cloud and authenticate yourself

The very first step is to login to Google Cloud so that BigQuery can be accessed.

  1. Run the code bit below.
  2. Go to the link returned in the command line.
  3. Follow the prompts to authorize your account to use the Google Cloud SDK.
  4. Copy the code provided and paste it into the box in the command line.
  5. Press Enter.

Note: You may encounter a warning message such as WARNING: Cannot find a quota project to add to ADC.... You can ignore this message; it should not impact your ability to use the BigQuery client.

In [ ]:
from google.colab import auth
try:
  auth.authenticate_user()
  print('You have been successfully authenticated!')
except:
  print('You have not been authenticated.')
You have been successfully authenticated!

Create notebook client to BigQuery

The command to create the client to the ISB-CGC BigQuery data is:

In [ ]:
# Load BigQuery API <-- If you skip this, notebook throws a NameError
from google.cloud import bigquery

# Establish the client
# Ignore the UserWarning message that results
client = bigquery.Client('isb-cgc')

Identify the COSMIC datasets available in BigQuery

In BigQuery, datasets are stored within projects. Projects define access to data in two important respects:

  1. The ability to view metadata
  2. The ability to query data

In addition, each project defines which users can access it. ISB-CGC maintains isb-cgc as a project to allow users to view datasets, but we use it to view the metadata about the datasets we're talking about, such as what tables exist within each dataset. However to query the data, your own Google Cloud Project must be used.

In order for outside researchers to be able to query the ISB-CGC data, you will need to use your own project for the queries to bill to. Google offers a GCP free tier that allows for a user to query up to 1TB a month. ISB-CGC also offers $300 in free cloud credits for cancer researchers. More information can be found in our documentation here.


Now, knowing that the COSMIC datasets are stored within isb-cgc, we can identify all the COSMIC datasets available in the ISB-CGC resource (a form of metadata):

In [ ]:
# Create a variable of datasets 
datasets = list(client.list_datasets())
# Create a variable for the name of the project
project = client.project

# If there are datasets available then check whether they're COSMIC datasets
# else print that either: 1) there are no COSMIC data sets available; or 2) there are
# no data sets available at all

cosmic_count = 0 # number of COSMIC datasets

if datasets:
    print("COSMIC datasets in project {}:".format(project))
    for dataset in datasets:  # API request(s)
        if "COSMIC" in dataset.dataset_id.upper():
          cosmic_count += 1
          print("\t{}".format(dataset.dataset_id))
    if cosmic_count == 0:
      print("No COSMIC datasets, but other datasets are available.")
else:
    print("{} project does not contain any datasets.".format(project))
COSMIC datasets in project isb-cgc:
	COSMIC_v85_grch37
	COSMIC_v85_grch38
	COSMIC_v86_grch37
	COSMIC_v86_grch38
	COSMIC_v87_grch37
	COSMIC_v87_grch38
	COSMIC_v89_grch37
	COSMIC_v89_grch38
	COSMIC_v90_grch37
	COSMIC_v90_grch38
	COSMIC_v91_grch37
	COSMIC_v91_grch38

As you can see, ISB-CGC hosts the last several versions of the COSMIC database. The latest version is the last one, version 91, as indicated by its dataset name. There are two options for v91: grch37 and grch38. These correspond to different versions of the reference human genome, and for the purposes of our analyses we want to be working with the most recent human genome version, grch38.

Next, we can look at the tables within COSMIC_v91_grch38. We do this using a SQL query -- we'll go over querying very soon. For now, what's important to know is that the table names and descriptions are contained in the dataset's metadata, accessible with the view INFORMATION_SCHEMA.

A metadata view is a collection of fields containing metadata on a dataset.

We can write a SQL query to query INFORMATION_SCHEMA.TABLE_OPTIONS, which will provide table descriptions, among other information which can be found here. We can then save the query result as a Pandas dataframe and display it nicely:

In [ ]:
import pandas as pd #import the pandas library and refer to it as 'pd', instead of 'pandas'


# We want to select the rows which contain descriptions for the tables in COSMIC_v91_grch38
# and display the table name with the description, which will be contained in the 
# 'option_value' column. 'option_name' column = 'description' lets us know that 
# the 'option_value' column is a description.
QUERY = """
SELECT table_name AS Table_Name, TRIM(option_value, '\"') as Description
FROM `isb-cgc.COSMIC_v91_grch38.INFORMATION_SCHEMA.TABLE_OPTIONS`
WHERE option_name = 'description'
ORDER BY table_name ASC;
"""
# obtain the query result and convert it to a Pandas dataframe
tables = client.query(QUERY).result().to_dataframe() 

pd.set_option('display.max_colwidth', None) # make sure the descriptions display in full
display(tables) # IPython's display() command displays tables more nicely than print()
Table_Name Description
0 ASCAT_Purity_Ploidy This table contains information regarding the ploidy and aberrant cell fraction (purity estimate) for TCGA samples re-analysed using ASCAT.
1 Breakpoints This table contains a list of breakpoints from the current release.
2 Cancer_Gene_Census This table contains a list of cancer census genes from the current release.
3 Classification This table contains COSMIC cancer classification information.
4 Complete_CNA All copy number abberations from the current release in a tab separated table. For more information on copy number data, please see http://cancer.sanger.ac.uk/cosmic/help/cnv/overview.
5 Complete_Differential_Methylation This table contains all TCGA level 3 methylation data from the ICGC portal for the current release
6 Complete_Gene_Expression This table contains all gene-expression level 3 data from the TCGA portal.
7 Complete_Targeted_Screens_Mutant This table contains information regarding the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set.
8 Fusion This table contains all gene fusion mutation data from the current release.
9 Genome_Screens_Mutant This table contains information regarding coding point mutations from genome wide screens (including whole exome sequencing).
10 HGNC This table contains information regarding the relationship between the Cancer Gene Census, COSMIC ID, Gene Name, HGNC ID, and Entrez ID.
11 Mutant This table contains all COSMIC coding point mutations from targeted and genome wide screens from the current release.
12 Mutant_Census This table contains all coding mutations in genes listed in the Cancer Gene Census.
13 Mutation_Tracking This table contains the listing the mapping all of COSMIC's legacy mutations(COSMs) to the new genomic identifiers(COSVs). This file also helps to identify the transcripts and the accession numbers on which the current mutation is annotated on, along with the mutation type.
14 NCV This table contains all non-coding mutations from the current release.
15 Resistance_Mutations This table contains the details of all mutations in COSMIC which are known to confer drug resistance.
16 Sample This table contains all of the features related to a sample from the current release.
17 Structural_Variants This table contains all structural variants from the current release.
18 Transcripts This table contains the gene name and transcript accession for each gene ID.

Here, we can read about each table's purpose within COSMIC. As we discuss querying the dataset in the remainder of this notebook, we'll be interested in looking at two tables of clincal importance: Cancer_Gene_Census, a list of "high confidence" genes that have been strongly causally linked to certain types of tumors, and Sample, which contains all the features related to the hand-curated samples which comprise COSMIC.

Make a Standard SQL call to COSMIC in BigQuery

To make calls to BigQuery from an IPython notebook, we first need to direct the behavior of the IPython command line to accept SQL commands. This is done using a magic command, which is a kind of under-the-hood command that controls the behavior of the IPython line.

There are several magic commands devoted to BigQuery, which can be found here.

The magic command to initiate a SQL query to BigQuery is

%%bigquery --project <PROJECT_ID>

Because we're now performing a query, you will need to use your own project ID, as was previously discussed. You will need to update all code blocks with your project id in place of the PROJECT_ID in the magic commmand.

Querying example

After the magic statement, you can then follow with the SQL command. For example, we can query the Cancer_Gene_Census table within the COSMIC_v91_grch38 dataset to view a few genes and their associated somatic tumor types and mutation types:

In [ ]:
# Call to BigQuery with IPython magic command using the appropriate project
# Because we're querying, you'll need to update the magic command with
# your project id
%%bigquery --project PROJECT_ID

# Standard SQL code
SELECT # Select a few columns to view
    Gene_Symbol,
    Name,
    Tumour_Types_Somatic,
    Mutation_Types
  
FROM # From the Cancer_Gene_Census table within the COSMIC_v91_grch38 dataset
  `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census`

LIMIT 5 # Limit to 5 rows; the table is huge and we only want a quick look

# Syntax for the above query
# SELECT * 
# FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
# Limit to the first 5 fields
Out[ ]:
Gene_Symbol Name Tumour_Types_Somatic Mutation_Types
0 ANK1 ankyrin 1 CCRCC Mis
1 BCLAF1 BCL2 associated transcription factor 1 melanoma, SCC Mis
2 CD209 CD209 molecule prostate carcinoma Mis
3 CNBD1 cyclic nucleotide binding domain containing 1 gastric cancer, colon cancer Mis
4 CRNKL1 crooked neck pre-mRNA splicing factor 1 base cell carcinoma Mis

But suppose you wanted to save the query result as an object so that you could do further analysis on them in Python. The table displayed above could be saved as a Pandas dataframe. There are two ways to go about this:

  1. You could include a variable name in the %%bigquery magic command, which will create a new variable with that name and save the query results in it as a dataframe. The object now exists in Python and you can reference it in subsequent code. Note, though, that the subsequent Python code has to be run separate from the SQL code, otherwise the magics will interpret the Python as SQL and throw an error. Here, we save the results in a new dataframe called my_table:
In [ ]:
%%bigquery my_table --project PROJECT_ID
#--------------------------------- SQL start
SELECT
    Gene_Symbol,
    Name,
    Tumour_Types_Somatic,
    Mutation_Types
  
FROM
  `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census`

LIMIT 5 
#--------------------------------- SQL end
In [ ]:
my_table.head() #pandas command to view the first couple of rows of a dataframe
Out[ ]:
Gene_Symbol Name Tumour_Types_Somatic Mutation_Types
0 ANK1 ankyrin 1 CCRCC Mis
1 BCLAF1 BCL2 associated transcription factor 1 melanoma, SCC Mis
2 CD209 CD209 molecule prostate carcinoma Mis
3 CNBD1 cyclic nucleotide binding domain containing 1 gastric cancer, colon cancer Mis
4 CRNKL1 crooked neck pre-mRNA splicing factor 1 base cell carcinoma Mis

Or,

  1. You could query COSMIC by first storing the SQL query as a string, and then passing the stored query into a function that the client executes. The results are then returned as a dataframe. The benefit of this method is that you can insert a variable value in queries using the String.format() method. When we want to automate queries, you'll see this method used. The same query above would be written in this manner as:
In [ ]:
# construct the query as a SINGLE string, which can be formatted over \
# several lines for readability
QUERY = """
  SELECT Gene_Symbol, Name, Tumour_Types_Somatic, Mutation_Types
  FROM `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census`
  LIMIT 5
"""

# execute the query
query_job = client.query(QUERY) # creates client job from your query
df = query_job.result().to_dataframe() # .result() executes the job;
                                       # .to_dataframe() converts the result, 
                                       # a RowIterator object, to a pandas dataframe

# print the dataframe to compare to tabular output above
df.head()
Out[ ]:
Gene_Symbol Name Tumour_Types_Somatic Mutation_Types
0 ANK1 ankyrin 1 CCRCC Mis
1 BCLAF1 BCL2 associated transcription factor 1 melanoma, SCC Mis
2 CD209 CD209 molecule prostate carcinoma Mis
3 CNBD1 cyclic nucleotide binding domain containing 1 gastric cancer, colon cancer Mis
4 CRNKL1 crooked neck pre-mRNA splicing factor 1 base cell carcinoma Mis

And there you go! We've successfully queried the COSMIC data stored in BigQuery.

Visualizing COSMIC query results

Once we have a dataframe, we can perform a visualization of the query results. For this example, let's explore a different table within COSMIC -- the Sample table.

Sample contains all the information associated with a curated sample within the COSMIC database. A sample is an instance of a portion of a tumour being examined for mutations.

  • A number of samples can be taken from a single tumour
  • A number of tumours can be obtained from one individual

Supposed we wanted to visualize the distribution of the average ploidy of a sample. The ploidy of a cell is the number of sets of chromosomes it contains. Normal human cells are diploid, meaning they contain two sets of chromosomes. However, because cancer cells rapidly divide and often have errors in cell cycle regulation that have allowed them to become tumorous, cancer cells can be aneuploid, meaning they contain more or less than two sets of chromosomes. Aneuploid cancer cells can be more aggressive than diploid cancer cells, so knowing a tumor sample's average ploidy -- an average of the ploidies of all of the cancer cells in the sample -- can be clinically useful for determining the nature of a tumor, the patient's prognosis, and treatment plans.

Hence, it would be interesting and useful to visualize the distribution of average ploidies of all the samples in COSMIC. We can do this by creating a histogram which bins on average ploidy values.

Plotly

To create our visualizations, we'll be using a Python graphics library called Plotly. Plotly is easy to use and produces descriptive, versatile, and interactive charts that are visually pleasing as well.

This is what the first few rows of the average_ploidy column in the Sample table looks like:

In [ ]:
# SQL which gets the name and average ploidy of the first 10 samples for which \
# there is an average ploidy value (some of the rows are NULL)
QUERY = """
  SELECT sample_name, average_ploidy
  FROM `isb-cgc.COSMIC_v91_grch38.Sample`
  WHERE average_ploidy IS NOT NULL
  LIMIT 10
"""
ap_preview = client.query(QUERY).result().to_dataframe() # fetch the result dataframe
ap_preview.head(10) # view the dataframe
Out[ ]:
sample_name average_ploidy
0 JEG-3 3.195344
1 NCCIT 2.500285
2 HUTU-80 1.960884
3 MSTO-211H 3.489185
4 NCI-H2052 3.875562
5 NCI-H28 3.800372
6 A431 3.301722
7 A431 3.301722
8 HT-29 3.044550
9 NCI-H28 3.800372

Our strategy for creating a histogram of these average ploidies is not going to be as straightforward as simply taking the column and plotting it. We're only viewing 10 of the 1,457,206 total rows. It would be incredibly taxing, and perhaps impossible, to feed all 1.4 million rows into Plotly's histogram function.

Instead, we have to write SQL code to sort the ploidy column into value bins (0-1, 1-2, 2-3, etc.):

In [ ]:
%%bigquery avg_ploi_1 --project PROJECT_ID

SELECT
  CASE # create a different bin for each range that ploidy falls between
    WHEN average_ploidy >= 0 and average_ploidy < 0.5 THEN '0.0-0.5' 
    WHEN average_ploidy >= 0.5 and average_ploidy < 1 THEN '0.5-1.0'
    WHEN average_ploidy >= 1 and average_ploidy < 1.5 THEN '1.0-1.5'
    WHEN average_ploidy >= 1.5 and average_ploidy < 2 THEN '1.5-2.0'
    WHEN average_ploidy >= 2 and average_ploidy < 2.5 THEN '2.0-2.5'
    WHEN average_ploidy >= 2.5 and average_ploidy < 3 THEN '2.5-3.0'
    WHEN average_ploidy >= 3 and average_ploidy < 3.5 THEN '3.0-3.5'
    WHEN average_ploidy >= 3.5 and average_ploidy < 4 THEN '3.5-4.0'
    WHEN average_ploidy >= 4 and average_ploidy < 4.5 THEN '4.0-4.5'
    WHEN average_ploidy >= 4.5 and average_ploidy < 5 THEN '4.5-5.0'
    WHEN average_ploidy >= 5 and average_ploidy < 5.5 THEN '5.0-5.5'
    WHEN average_ploidy >= 5.5 and average_ploidy < 6 THEN '5.5-6.0'
    ELSE 'NA'
  END as ploidy_bin, COUNT(*) as count # count the number of rows per each ploidy bin
FROM `isb-cgc.COSMIC_v91_grch38.Sample` 
WHERE average_ploidy IS NOT NULL # don't count any rows without values 
GROUP BY ploidy_bin # necessary if doing count for each ploidy bin
ORDER BY ploidy_bin ASC; # display bins in order of ascending ploidy value

Note: We can then view the resulting table as we have before, with head(), or we can alternatively use a Python library called Tabulate, which offers several different styles of nicely formatted outputs for tabular data!

In [ ]:
from tabulate import tabulate # import the tabulate command from the tabulate library

# view the new table of bins, avg_ploi_1
# headers='keys' means that the column headers for the table will be the 
# column names of the Pandas dataframe
print(tabulate(avg_ploi_1, headers='keys', tablefmt='simple'))
    ploidy_bin      count
--  ------------  -------
 0  1.0-1.5            10
 1  1.5-2.0          4972
 2  2.0-2.5          4227
 3  2.5-3.0          2622
 4  3.0-3.5          2357
 5  3.5-4.0          2186
 6  4.0-4.5           799
 7  4.5-5.0           291
 8  5.0-5.5           157
 9  5.5-6.0            15

Then, we can use this new table to create a bar plot of the bins vs. counts using Plotly. We can format it so that it reads as a histogram:

In [ ]:
import plotly.express as px # import the plotly library

# first, we create the figure object as a barplot
# 'labels' maps the column names of our dataframe to the label we'd like to display 
fig1 = px.bar(avg_ploi_1, x='ploidy_bin', y='count', 
              labels={'ploidy_bin':'Average Ploidy of Sample', 'count':'Number of Samples'})

# next, we make sure there is no gap between bars to make it more histogram-like
# and we give the plot a title
fig1.update_layout(bargap=0, 
                   title_text='Distribution of Average Ploidies Across All COSMIC Samples')

# finally, we change the fill color and outline color of the bars, and set the 
# outline width and fill opacity
fig1.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)

# 'show()' displays the final figure
fig1.show()

Now we can visualize the distribution of average ploidies across the samples which make up the COSMIC database! As we can see, almost all samples have an avg. ploidy of at least 1.5, with the median being somewhere between 1.5 and 2.5. We might expect this knowing that somatic human cells are diploid (a ploidy of 2.0). It's also interesting to see how high the ploidy can go with some samples, even up to at least 5.5.

Most importantly, we see that aneuploid tumor cells almost always have too many sets of chromosomes, rather than too little, as indicated by the right skew of the histogram.

If we wanted greater resolution on the histogram, we could've also chosen another method to construct the bins:

In [ ]:
# Instead of sorting into predefined value ranges, we'll round the average ploidy column \
# to the first decimal using the ROUND function. \
# Then, we'll group the rows by their rounded value, producing many more bins, and count \
# the rows in each bin the same as before.
%%bigquery avg_ploi_2 --project PROJECT_ID

SELECT ROUND(average_ploidy, 1) AS ploidy_bin, COUNT(*) AS count 
FROM `isb-cgc.COSMIC_v91_grch38.Sample` 
WHERE average_ploidy IS NOT NULL
GROUP BY ploidy_bin
ORDER BY ploidy_bin;
In [ ]:
# view avg_ploi_2 using tabulate
print(tabulate(avg_ploi_2.head(10), headers='keys', tablefmt='simple'))
print("\n ({} more rows)".format(avg_ploi_2.size-10))
      ploidy_bin    count
--  ------------  -------
 0           1.2        1
 1           1.3        1
 2           1.4        4
 3           1.5       27
 4           1.6      104
 5           1.7      270
 6           1.8      681
 7           1.9     1866
 8           2       3686
 9           2.1     1221

 (84 more rows)
In [ ]:
# just like the last figure, create a barplot to represent avg_ploi_2
fig2 = px.bar(avg_ploi_2, x='ploidy_bin', y='count', 
              labels={'ploidy_bin':'Average Ploidy of Sample', 'count':'Number of Samples'})
fig2.update_layout(bargap=0, 
                   title_text='Distribution of Average Ploidies Across All COSMIC Samples')
fig2.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)
fig2.show()

Here, we can see that the data is really much more modal than we thought. A significant portion of samples do indeed have average ploidies at or around 2.0. We also get a better idea of the true shape of the distribution, and we see that there is perhaps a second, smaller mode at 3.0. This might indicate that in aneuploid tumors, there is a slight tendency to retain the ploidy as tumor cells proliferate, since an individual cell's ploidy is usually a whole number set of chromosomes and the next aneuploidy greater than 2 is 3. If one of the original tumor cells was triploid, then the daughter tumor cells may carry the triploidy throughout the growth of the tumor.

These are the kinds of insights we get with using smaller bins, i.e. taking the visualization one step further, which is really easy to do with Plotly.

Following COSMIC notebooks will explore the COSMIC data further and highlight some interesting analyses and visualizations that can be produced using Python and BigQuery!

Where to Go Next

Explore, Discover, and Analyze the Data provided by ISB-CGC along with side by side with your own :)

More COSMIC Notebooks coming soon to ISB-CGC! Check the Github repo for the latest notebooks.

ISB-CGC Links:

Google Tutorials: