Open In Colab

ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   Exploring the COSMIC Cancer Gene Census
Author:  Akshay Balaji
Created: 2020-07-08
Purpose: Understand the information stored in COSMIC's Cancer Gene Census dataset and ways it can be analyzed.
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Exploring_COSMICs_Cancer_Gene_Census_table.ipynb
Notes:

Exploring COSMIC's Cancer Gene Census table

This notebook introduces users to the Cancer_Gene_Census table within the COSMIC dataset in the ISB Cancer Genomics Cloud resource.

Goals:

  • Understand the table's schema - its fields (columns) and the types of data they hold
  • Answer interesting questions about the data by performing analyses in SQL & Python
  • Produce visualizations of our analyses

Note: Because our cloud resource is often abbreviated to "CGC", we will avoid using such an abbreviation for the Cancer Gene Census, which is COSMIC's product.

Login to Google Cloud and create notebook client to BigQuery

As usual, we'll start by authenticating ourselves in the Google Cloud. Then, we'll load the BigQuery API, import the Pandas library to work with dataframes, and create our client.

In [ ]:
from google.colab import auth
try:
  auth.authenticate_user()
  print('You have been successfully authenticated!')
except:
  print('You have not been authenticated.')
In [ ]:
from google.cloud import bigquery
import pandas as pd
from IPython.display import display, display_pretty

PROJECT_ID = 'your project id'
client = bigquery.Client(PROJECT_ID)

Viewing the Cancer Gene Census table schema

Next, we'll introduce ourselves to the Cancer_Gene_Census table by taking a look at its schema -- the names and data types of its fields. To do this, we'll actually pull the table directly into Python using the BigQuery client function get_table(). Next, we'll obtain the field names and types from the schema attribute of the table object using Python's lambda and map() functions.

lambda creates a mini function of our own choosing using the format

lambda <INPUT>: <OUTPUT>

For example,

lambda str: len(str)

takes a string and outputs its length.

map() takes a function and applies it to a list of inputs using the format

map(<FUNCTION>, <LIST OF INPUTS>)

Building on the previous example,

map(lambda str: len(str), str_list[])

takes the string function we defined above and applies it to a list of strings, returning the lengths of every string in the list.

Similarly, we can apply a function that fetches name and field type to every element of the table schema, which consists of field objects:

In [ ]:
table_ref = 'isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census'
table = client.get_table(table_ref)

print('The current table is {}'.format(table.table_id))
print('-------------------------------------')

# investigate the table
fieldNames = list(map(lambda tsf: tsf.name, table.schema))
fieldTypes = list(map(lambda tsf: tsf.field_type, table.schema))

view_num = 10 # view the first 10 fields of the table
print("This table has {} fields. ".format(len(fieldNames)))
print("The first {} field names and types are: ".format(view_num))
for i in range(view_num):
  print("{} - {}".format(fieldNames[i], fieldTypes[i]))

Investigating the tumor type column Tumour_Types_Somatic

First, let's visualize the actual table data to see what the Tumour_Types_Somatic column looks like.

Note: you will need to update 'your_project_number' with your project number before continuing with the notebook

In [ ]:
%%bigquery --project YOUR_PROJECT_ID

SELECT 
  Gene_Symbol,
  Tumour_Types_Somatic
FROM
  `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census`
LIMIT 5;

As we can see, each gene gets a single entry containing all of its somatic tumor associations. For example, the 5 rows above tell us that the gene BCLAF1 is associated with melanoma and SCC (squamous cell carcinoma).

Selecting all genes for a specific tumor type

  • Selecting for values within string data

Our first example will be to identify all the genes in the Cancer Gene Census that are associated with a specific tumor type. For example, if we wanted to know all the genes that have been linked to gliomas, we could view the table to see that gliomas are represented in the column as 'glioma', and search on that. The challenge is that in this table, like future tables you may encounter, the tumor associations for a single gene are all written out in a single table cell as one long string. How do we search for specifically 'glioma' within each gene's string in SQL? The trick is to use SQL's LIKE operator, which can match strings within other strings using the % symbol as a wildcard for other characters.

WHERE string1 LIKE "%pattern%"

will identify all strings in the string1 column which contain pattern, since the first % means that any characters can come before pattern, and the last % means that any characters can come after pattern.

We have to be careful using this on Tumour_Types_Somatic, however, because some tumor types could be found in the names of other more specific tumor types. For example, there's SCC, and then there's oral SCC. If we do "%SCC%", we'll match the oral SCC rows as well, which we don't want to do. So instead, we need to match for individual list entries, which will be delineated by either a following comma, if the type is at the end of the list, a leading space. We use this logic to produce the following query for 'glioma', and turn the tumor type into a variable so that users can simply change the variable name to search for other tumors.

Since we use a variable to represent the tumor type, we have to write the query as a string in Python and format the string afterwards, rather than writing a direct SQL script.

In [ ]:
# Select all genes for a specific tumor type
# Take user input for what kind of tumor type interested in

from tabulate import tabulate # import tabulate for compact table output

tumor_type = 'glioma' # this variable can be changed to the tumor type you want to search for!

QUERY = '''
    SELECT Gene_Symbol, Name 
    FROM `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census` 
    WHERE Tumour_Types_Somatic LIKE "% {}%"
    OR Tumour_Types_Somatic LIKE "%{},%" 
'''.format(tumor_type, tumor_type)

df = client.query(QUERY).result().to_dataframe()

print(tabulate(df.head(10), headers="keys", tablefmt = 'simple'))
print("\n ({} more rows)".format(df.size-10))

This is nice, but you might have thought to yourself -- "I wish I could see a list of all the tumor type options in the column so I'd know exactly what to search for!" In other words, we wish that we could obtain a list of all the unique tumor types in the column, which leads us to our next section.

Isolating the unique tumor types and counting their associations in the table

  • Splitting a list-like string entry into its individual items
  • Working with table entries that are arrays

The significance of seeing the unique tumor types is not only logistical, it's clinical. When we isolate the tumor types, we should also be able to count the number of times they appear in the Cancer Gene Census -- in other words, count the number of associations they have to high-confidence cancer genes. This can give us an indication of which kinds of tumors are most thoroughly linked to specific genetic causes thus far.

The fact that the tumor types are all listed out together in strings makes obtaining a unique set of them complicated. We can't simply use GROUP BY or even DISTINCT() on the column, because two entries like "Melanoma, SCC" and "SCC, Melanoma" will show up as different values, when in reality they both contain two tumor types: "Melanoma" and "SCC". Instead, we have split each string entry into the individual tumors, and then group on these.

BigQuery SQL has two functions that can help us out here: SPLIT() and UNNEST().

SPLIT(string, delimiter) will take a string and split it into an array of smaller strings, splitting by whatever character(s) you provide as the delimiter. For example, if you have a list "A, B, C", you would want to use ", " as your delimiter to obtain ["A","B","C"]. Using just "," with no space will result in ["A"," B"," C"], which has those extra white spaces that we don't want.

UNNEST(array) will take an array and return a table containing one row for each element in the array. Thus, if a column test contains entries that are arrays of strings, running UNNEST(test) will return a table with one column containing every single string from every single array, each in its own row.

So the idea here is that we run SPLIT() on the tumor type column, and then UNNEST() this column to obtain a single column with every individual tumor type listed in the Cancer Gene Census. From here, we can GROUP BY to obtain a unique set of all the tumor types, and in the process we can run COUNT() as well to get counts of how many times each tumor type appears, i.e. how many gene associations each tumor type has! This leads us to the following script:

In [ ]:
%%bigquery unique_tumor_types --project YOUR_PROJECT_ID

-- Getting the unique somatic tumor types and the number of genes each is associated with
-- the Tumor_Types_Somatic column

WITH
split_tb AS (
  SELECT SPLIT(Tumour_Types_Somatic, ", ") AS tts # MUST have delimiter as ", "
  FROM `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census` 
) 
# Even though we don't have a column named'tumor_type' yet, we're going to create it 
#   in the 'FROM' statement and select directly from it
# To use UNNEST properly, we have to also select the data table that we're unnesting from
SELECT tumor_type, COUNT(*) AS gene_count 
FROM split_tb, UNNEST(split_tb.tts) AS tumor_type
GROUP BY tumor_type
ORDER BY gene_count DESC; # order the table from most gene associations to least

And now, we display the table of results with the option to view any number of rows you want:

In [ ]:
# Displaying the first 15 tumor types and their gene counts

view_num = 15;  # set how many rows you want to view
print(tabulate(unique_tumor_types.head(view_num), headers='keys', tablefmt='simple'))
if len(unique_tumor_types) <= view_num: 
  print("\n Showing all rows")
else:
  print("\n ({} more rows)".format(len(unique_tumor_types.index) - view_num)) # display no. of unseen rows

And voila! We have now isolated the unique set of tumor types and have counts for their numbers of associations as well. We might want to plot this data to visualize how the gene associations are distributed across tumor types, leading us to our next section.

Visualizing the gene association counts for the different tumor types

  • Using Plotly to create a barplot from a Pandas dataframe

This last step is fairly straightforward, in comparison to the last two sections. We can use the Plotly module plotly.express to create clean, beautiful barplot of our table of tumor type association results above.

In [ ]:
# Plotting the above results as a bar plot, using Plotly
import plotly.express as px # Plotly Express module within Plotly is good for 

view_num=15
fig1 = px.bar(unique_tumor_types.head(view_num), x='tumor_type', y='gene_count', 
              labels={'tumor_type':'Tumor Type', 'gene_count':'Number of Genes'}, 
              text='gene_count')
fig1.update_layout(title_text='Number of Gene Associations for Different Types of \
Tumors in the Cancer Gene Census') # title of the plot
fig1.update_traces(marker_color='rgb(0, 92, 215)', marker_line_color='rgb(0, 92, 215)',
                  marker_line_width=1.5, opacity=1)
fig1.show() # command to display plot

Clinical Discussion

The barplot above shows the 15 most associated types of tumors in the Cancer Gene Census. AML, acute myeloid leukemias, are the most associated by a significant margin. There's a steep drop until NSCLC, non-squamous cell lung cancer, and then a more gradual decline from there. This could possibly be due to the many kinds of acute myeloid leukemias that are known, allowing it more thoroughly researched and associated with genes. In addition, there's significant overlap between these top 15 tumor types and this list of common cancer types, as reported by the NCI. What this graph probably shows, then, is how research resources are distributed across different kinds of tumors.

Some other tumors in the barplot:

  • T-ALL: T-cell acute lymphoblastic leukemia
  • ALL: Acute lymphoblastic leukemia
  • Spitzoid tumor: a type of rare skin growth that typically impacts young patients
  • MDS: Myelodysplastic syndrome, a form of bone marrow cancer
  • NHL: Non-Hodgkin lymphoma
  • CLL: Chronic lymphocytic leukemia
  • DLBCL: Diffuse large B-Cell lymphoma

Investigating the mutation type column Mutation_Types

Like with the last exploration, let's first see what Mutation_Types actually looks like.

In [ ]:
%%bigquery --project YOUR_PROJECT_ID

SELECT 
  Gene_Symbol,
  Mutation_Types
FROM
  `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census`
LIMIT 10;

As we can see, once again the mutation types associated with each gene are listed out as longer strings, so working with this data is going to be very similar to working with Tumour_Types_Somatic. The one catch is that Mutation_Types is not consistent in separating each of its mutations with ", ", perhaps because of how the data was entered during the curation process. We have to still be able to deal with this data effectively, so we'll attempt to repeat the unique value analysis we did last time with this mutation datat.

Isolating the unique mutation types and counting their associations in the table

  • Using regular expressions to split string lists

We know the Mutation_Types data isn't consistent with its delimiters because in the creation of this notebook, we tried to perform the exact same analysis as we did with Tumour_Types_Somatic, and we failed. This is what happens:

In [ ]:
%%bigquery --project YOUR_PROJECT_ID

WITH
split_tb AS (
  SELECT SPLIT(Mutation_Types, ", ") AS mt
  FROM `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census` 
) 

SELECT mutation_type, COUNT(*) AS gene_count 
FROM split_tb, UNNEST(split_tb.mt) AS mutation_type
GROUP BY mutation_type
ORDER BY gene_count DESC; 

You can see that some of the entries have been separated by , with no space, or ;, or even .. What this means is we can't rely on using a delimiter to extract the individual mutation types. Instead, we have to select them using a more advanced string-matching technique: regular expressions (regexes). Once you learn regexes, they're an incredibly powerful tool for working with strings. Until then, however, they can be quite confusing and frustrating. We don't want to turn this into a regex tutorial, but a great website to learn quickly and thoroughly about using regexes is RegExr, which allows you to even run regex code on your own text in real-time. It was even used to test the regex used in the script that you'll see shortly!

BigQuery SQL has a function called REGEXP_EXTRACT_ALL() which allows you to search for specific strings within other strings, sans-delimiter.

REGEXP_EXTRACT_ALL(string, regex) takes a string and returns an array containing all the matches in that string to the regex.

Hence, if we can write a regex that will identify specifically the mutation names, which are alphabetical, and leave out the separating characters between the names, which are non-alphabetical, then we can continue with our analysis successfully!

The appropriate regex to extract just the words, i.e. the alphabetical parts, of a string is:

r'[A-Za-z]+'

# r -- indicates that the regex string is raw, so every 
#   character is just that - a raw character
# [] -- indicate that we're searching for characters 
#    meeting the criteria in the brackets
# A-Za-z -- indicates that our criteria is any character 
#        that is alphabetical, uppercase or lowercase
# + -- indicates we want to group several consecutive 
#   matches, i.e. letters, together into a single string 
#   to retrieve the whole word

So running REGEXP_EXTRACT_ALL(Mutation_Types, r'[A-Za-z]+'), will return the Mutation_Types column with every entry converted into an array of the individual mutations, just as SPLIT() did, and then we can perform the same UNNEST() function with a GROUP BY to get the unique values and their counts:

In [ ]:
# Frequencies of mutation types + what these mutations mean (same thing like with tumor types)

%%bigquery unique_mutation_types --project YOUR_PROJECT_ID

WITH
split_tb AS (
  SELECT REGEXP_EXTRACT_ALL(Mutation_Types, r'[A-Za-z]+') AS mt 
  FROM `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census` 
) 
SELECT mutation_type, COUNT(*) AS gene_count
FROM split_tb, UNNEST(split_tb.mt) AS mutation_type
GROUP BY mutation_type
ORDER BY gene_count DESC;
In [ ]:
view_num = 15;  # set how many rows you want to view
print(tabulate(unique_mutation_types.head(view_num), headers='keys', tablefmt='simple'))
if len(unique_mutation_types) <= view_num: 
  print("\n Showing all rows")
else:
  print("\n ({} more rows)".format(len(unique_mutation_types.index) - view_num))

There we go! We've successfully dealt with the inconsistencies in the database, which may be useful for future data you work with, especially if it is manually curated and naturally prone to some human error. The next step is to visualize the results.

Visualizing the gene association counts for the different mutation types

In [ ]:
view_num=15
if view_num > len(unique_mutation_types.index):
  print("Fewer than {} categories; showing all".format(view_num))

fig2 = px.bar(unique_mutation_types.head(view_num), x='mutation_type', y='gene_count', 
              labels={'mutation_type':'Mutation Type', 'gene_count':'Number of Genes'}, 
              text='gene_count')
fig2.update_layout(title_text='Number of Gene Associations for Each Mutation Type in the Cancer Gene Census') # title of the plot
fig2.update_traces(marker_color='rgb(0, 92, 215)', marker_line_color='rgb(0, 92, 215)',
                  marker_line_width=1.5, opacity=1)
fig2.show() # command to display plot

Clinical Discussion

Before discussing the clinical implications of the plot above, it would probably be helpful to know what each of these mutation abbreviations means!

Abbreviation Mutation Type
T Translocation
Mis Missense
N Nonsense
F Frameshift
S Splice site
D Large deletion
O Other
A Amplification
M Mesenchymal
(Promoter) Mutation in promoter site

The large number of missense mutations might be expected, since those are the most likely deleterious mutation to occur. However, it's interesting that so many of the genes also have been observed to have translocation mutations causing certain cancers, and perhaps this makes sense, since translocation mutations move entire sections of DNA around and thus tend to be quite serious.

For further reference, COSMIC's documentation explains all of its acronyms here.

The presence of 3 "mesenchymal" mutations is surprising, since that's not a typical mutation classification. After checking the COSMIC Mutant table, we found that the census genes which had these mesenchymal mutations -- CBLC, PRF1, and SDHAF2 -- actually had a majority missense mutations but no documented mesenchymal ones, meaning that the "M" could have been an accidental mistypying of "Mis". Hence, it's useful to know how to identify a gene's specific mutations recorded in COSMIC, a skill we'll go over in our notebook about the Mutant data table!

Filtering census genes by location using the Genome_Location column

Note: This skill is explained in much more detail in the "Working with Genomic Coordinates" notebook!

As always, let's first see what Genome_Location looks like.

In [ ]:
%%bigquery --project YOUR_PROJECT_ID

SELECT
  Gene_Symbol,
  Genome_Location
FROM
  `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census`
LIMIT
  5;

Extracting the location data from the Genomic Location string

As we can see, the genomic location is stored as a string containing "chromosome:start base-stop base". If we want to be able to filter for location, it will be helpful to break this string up into its chromosome, start, and stop components as separate columns that can be queried. To do that, we use two SQL functions: SUBSTR, which substrings the string to extract the information which is separated by either : or -; and STRPOS, which helps SUBSTR identify the location of the ; and - so it can count the appropriate number of characters to extract.

The following code gets the chromosome, which is everything before the :; start, which is everything between : and -; and stop, which is everything after -.

After that, the chr, start, and stop columns will be almost useable -- the only issue is that start and stop are still string types, since SUBSTR only creates a smaller string. It's fine for chr to be a string, since we'd want to search for a coordinate range within a single chromosome and can simply match on the chromosome string. However, if we want to compare start and stop coordinates, they have to be integer type, int64, so that the numeric comparisons can be performed correctly (e.g. the string "200" would come after the string "10000"). Hence, we have to also cast the start and stop columns as int64 using CAST and filter out any NULL or blank string rows first.

Once we've casted, we can then filter the genes first based on their chromosome, then on the range that either their start or stop coordinate fall between. In this strategy, we include genes that have any overlap with our region of interest, not exclusively genes with complete overlap.

Here, we have a script which does all of the above and searches for genes within the region of chromosome 1, base pairs 0 to 10,000,000 (i.e. 1:0-10000000):

In [ ]:
%%bigquery --project YOUR_PROJECT_ID

# searching for all genes within predefined range, such as 1:0-3000000
WITH loc_split AS(
    SELECT Gene_Symbol, 
      Genome_Location, 
      SUBSTR(Genome_Location, 1, STRPOS(Genome_Location, ":")-1) as chr,  # subtring from the first char to the char before the ":"
      # substring from char after ":" to the char right before "-" as the first, or 'start', coordinate
      SUBSTR(Genome_Location, STRPOS(Genome_Location, ":")+1, 
        STRPOS(Genome_Location, "-")-STRPOS(Genome_Location, ":")-1) as start,
      # substring from char after "-" to the very end as the second, or 'stop', coordinate
      SUBSTR(Genome_Location, STRPOS(Genome_Location, "-")+1, 
        LENGTH(Genome_Location)-STRPOS(Genome_Location, "-")) as stop
    FROM `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census`
  ),

  loc_split_final AS(
      SELECT Gene_Symbol, Genome_Location as Genome_Loc_str, chr, 
        SAFE_CAST(start as int64) as start, SAFE_CAST(stop as int64) as stop  # cast values as appropriate data types
      FROM loc_split 
      WHERE SAFE_CAST(start as int64) IS NOT NULL and SAFE_CAST(stop as int64) IS NOT NULL  # select rows where cast works properly
  )

SELECT Gene_Symbol, Genome_Loc_str, chr, start, stop
FROM loc_split_final 
WHERE (chr LIKE "1")  # chr column matches chromosome 1
    AND ((0 <= start AND start <= 10000000) OR (0 <= stop AND stop <= 10000000)); # start or stop values fall within region of interest

Allowing user input to search by custom location

We can see that we've successfully filtered for the genes within this region, returning two census genes, TNFRSF14 and SKI. But what if we didn't want to have to keep manually editing the script to select for different regions? What if we could enter in our region of interest as user input instead?

We can incorporate user input into our filtering process by working in Python. After using input() to request a region of interest, we can separate this "region of interest" string into its own chr, start, and stop components using String.split().

Once we have this information, we can use String.format() to insert these values into a SQL query string of the exact same form as the one above, just with placeholders for the chr and the start/stop range. This yields the following script:

In [ ]:
# Genomic coordiantes must be in format chr:base_pair
# Eg. 1:1000-2000

# Ask for user input
roi = input("Region of interest (must be in the format chr:start_basepair-stop_basepair): \n")
region = roi.split(sep=":") # split input into chromosome and start-stop sections

try:
  chr_interest = str(region[0]) # extract chromosome from the split
  # take the start-stop portion of split and split further by the "-" character to separate start and stop values
  start_interest = int(region[1].split(sep="-")[0])
  stop_interest = int(region[1].split(sep="-")[1])  

  QUERY = '''
    with loc_split as(
      SELECT Gene_Symbol, 
        Genome_Location, 
        SUBSTR(Genome_Location, 1, STRPOS(Genome_Location, ":")-1) as chr,
        SUBSTR(Genome_Location, STRPOS(Genome_Location, ":")+1, STRPOS(Genome_Location, "-")-STRPOS(Genome_Location, ":")-1) as start,
        SUBSTR(Genome_Location, STRPOS(Genome_Location, "-")+1, LENGTH(Genome_Location)-STRPOS(Genome_Location, "-")) as stop
      FROM `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census`
    ),

    loc_final as(
      SELECT Gene_Symbol, Genome_Location as Genome_Loc_str, chr, 
        SAFE_CAST(start as int64) as start, SAFE_CAST(stop as int64) as stop 
      FROM loc_split 
      WHERE SAFE_CAST(start as int64) IS NOT NULL and SAFE_CAST(stop as int64) IS NOT NULL
    )

    SELECT Gene_Symbol, Genome_Loc_str, chr, start, stop
    FROM loc_final 
    WHERE (chr LIKE "{}") AND (({} <= start AND start <= {}) OR ({} <= stop AND stop <= {}))
    ORDER BY start ASC;

  '''.format(chr_interest, start_interest, stop_interest, start_interest, stop_interest)

  genes_in_range = client.query(QUERY).result().to_dataframe()

  display(genes_in_range)
  #print(tabulate(genes_in_range, headers='keys', tablefmt='simple'))

except:
  print("Please enter a location in the appropriate format.")

Making the gene-filtering process interactive

Now we've successfully added user input! As a last extension, what if we wanted to make this user input continuous and fast? It would be nice, for example, to simply be able to select a chromosome from a drop-down list and have the script display all the genes from that chromosome.

To add this level of interactivity, we can use an IPython feature called widgets. Widgets can be drop-downs, radio buttons, text boxes, and a whole other array of interactive features that can be linked to functions. Changing the value of a widget can change the output of a function, allowing users to create quick, interactive outputs. The widgets documentation can be found here.

Hence, we can create a drop-down widget for the value of the chromosome we're interested in, and link this widget to a function which performs the genomic coordinate filtering query using the chromsome value selected.

In fact, we can make the genomic coordinate filter as interactive as we want! For example, if we want to take advantage of other information in the Cancer_Gene_Census table, we can create another widget linked to the filtering function which will apply a second criteria to the WHERE statement to select for only those genes that are hallmark (Hallmark value is 'Yes').

It might also be helpful to be able to view as few or as many rows as we want. We can create a third widget, this one a text box, which will allow us to set a row view number on the query result. This widget, also linked to the filtering function, would act on the Python code following the SQL query.

Hence, the resulting script incorporating these widgets is:

In [ ]:
import ipywidgets as widgets  # import the widgets library
from ipywidgets import interact, interactive, interact_manual, fixed  # import certain widgets functions to use directly

def get_genes(chr_of_interest, hallmark, view_num):
  """
  This function is going to take a user-input chromosome via a dropdown widget and run 
  a modified version of the genomic coordinate-splitting SQL query which simply selects 
  all genes whose 'chr' values match that given chromosome (and thus, we omit the whole 
  start and stop comparison).
  """

  QUERY = '''
    with loc_split as(
          SELECT 
            Gene_Symbol,  
            Genome_Location,
            Chr_Band,
            Hallmark,
            SUBSTR(Genome_Location, 1, STRPOS(Genome_Location, ":")-1) as chr,
            SUBSTR(Genome_Location, STRPOS(Genome_Location, ":")+1, STRPOS(Genome_Location, "-")-STRPOS(Genome_Location, ":")-1) as start,
            SUBSTR(Genome_Location, STRPOS(Genome_Location, "-")+1, LENGTH(Genome_Location)-STRPOS(Genome_Location, "-")) as stop
          FROM `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census`
        ),

        loc_final as(
          SELECT 
            Gene_Symbol,  
            Genome_Location as Genome_Loc_str, 
            Chr_Band,
            Hallmark,
            chr,
            SAFE_CAST(start as int64) as start, 
            SAFE_CAST(stop as int64) as stop 
          FROM loc_split 
          WHERE SAFE_CAST(start as int64) IS NOT NULL and SAFE_CAST(stop as int64) IS NOT NULL
        )

        SELECT Gene_Symbol, chr, start, stop, Genome_Loc_str, Chr_Band, Hallmark
        FROM loc_final 
        WHERE (chr LIKE "{}") {}
        ORDER BY start ASC;

    '''
  if hallmark == True:  # create the option to view just the hallmark genes
    QUERY = QUERY.format(chr_of_interest, 'AND Hallmark LIKE "Yes"')
  else:
    QUERY = QUERY.format(chr_of_interest, '')


  genes_in_chr = client.query(QUERY).result().to_dataframe()  

  try:  # we display using try-except in case view_num happens to be NULL
    display(genes_in_chr.head(int(view_num))) 
  except:
    display("Please enter a number of rows to view") # in case of an error, will tell user to define view_num

style = {'description_width': 'initial'}  # so that the descriptions for each widget are fully displayed
chr_chooser = widgets.Dropdown( # create the chromosome drop-down widget
    options = ['1','2','3','4','5','6','7','8','9','10',
              '11','12','13','14','15','16','17','18','19','20',
              '21','22','X'], # our options are the 23 chromosomes in the Cancer Gene Census
    
    description = 'Chromosome:',
    style = style 
)

show_rows = widgets.Text( # create the row display text box widget
    value='10', # program will auto-display 10 rows at the start, as default
    placeholder='Enter number of rows', # text that shows when text box is empty
    description='Show how many rows:', 
    style = style
)

hallmark_toggle = widgets.Dropdown( # create the hallmark option widget as a 2-option drop-down
    options=[('Yes', True), ('No, show all', False)], # associate the Yes and No options with True and False values
    # when 'Yes' selected, hallmark_toggle widget will input a True value into the filtering function
    description='Select hallmark genes?',
    style=style,
    value=False # program will auto-display all rows at the start, as default
)

# interactive(f, args) is the command used to link the widgets to the function! 
# it will auto-initiate the widgets and run the function with the widgets' default values
interactive(get_genes, chr_of_interest = chr_chooser, hallmark = hallmark_toggle, view_num = show_rows)

We have now taken a static filtering query and made it much more interactive, creating a tool that can be used to browse the census genes much more freely. This technique can be very useful for future explorations of the COSMIC data!

Looking up HGNC ID's for census genes

A last application of the Cancer Gene Census actually integrates another table, HGNC. Suppose we wanted to do further research into a cohort of genes we had assembled through a query, such as the one above querying for all the genes on a specific chromosome. It might be useful to know the genes' IDs in multiple databases, and COSMIC contains the table HGNC which stores the HUGO Gene Nomenclature Committee (HGNC) ID for every gene in COSMIC.

How can we quickly match the HGNC IDs to the census genes in our query result? We can perform the SQL JOIN command, which joins the rows of two tables together based on a matching column between the two tables. If we can identify a column in the Cancer_Gene_Census table which matches a column in HGNC, then we can perform a join and bring in the HGNC.HGNC_ID column's info to the census genes. Luckily, Cancer_Gene_Census has a column called Entrez_GeneId which corresponds to the Entrez_id column in HGNC, so we can perform the join and match the appropriate HGNC rows to the census genes. The following is a script which demonstrates the JOIN between HGNC and the whole Cancer_Gene_Census table. In addition to bringing in the HGNC ID's, we also bring in a column from HGNC which contains whether a gene is a census gene or not, so that we can verify the JOIN is matching appropriately:

In [ ]:
%%bigquery --project YOUR_PROJECT_ID

SELECT t1.Gene_Symbol as Census_Gene, t1.Entrez_GeneId as Entrez_ID, t2.HGNC_ID, t2.Cancer_census 
FROM `isb-cgc.COSMIC_v91_grch38.Cancer_Gene_Census` as t1
JOIN `isb-cgc.COSMIC_v91_grch38.HGNC` as t2
ON t1.Entrez_GeneId = t2.Entrez_id;

As you can see, we now have an HGNC_ID column for each of the census genes in the table. In addition, we see that every visible row in the Cancer_census column displays "y", indicating that our JOIN performed correctly!

Where to Go Next

Explore, Discover, and Analyze the Data provided by ISB-CGC along with side by side with your own :)

More COSMIC Notebooks coming soon to ISB-CGC! Check the Github repo for the latest notebooks.

ISB-CGC Links:

Google Tutorials: