This notebook will use the Clustergrammer-Widget to visualize the Cancer cell line Encyclopedia gene expression data (Broad-Institute CCLE). The CCLE project measured genetic data from over 1000 cancer cell lines. We'lll use Clustergrammer-Widget to visualize the data. We will start by importing required libraries and initializing the Clustergrammer Network object:
import pandas as pd from clustergrammer_widget import * net = Network(clustergrammer_widget)
We are using a slightly reformatted version of the CCLE gene expression data with modified cell line meta-data (category) formatting. You can see below how cell-line categorical information (e.g. tissue) information is encoded as column tuples. The matrix has 18,874 rows (genes) and 1,037 columns (cell-lines).
net.load_file('../original_data/CCLE.txt') ccle = net.export_df() print(ccle.shape)
The CCLE dataset is too large to directly visualize using the Clustergrammer-Widget, but we can directly visualize tissue-specific subsets of the data. Here we will visualize the top 250 most variable genes in bone cancers, which will help us identify subsets within bone cancer. We will also initialize the visualization with enrichment for Gene Ontology terms using Enrichr and the
# filter for columns with category 1 bone net.filter_cat('col', 1, 'tissue: bone') # keep top 250 most variable genes net.filter_N_top('row', 250, 'var') # normalize gene expression across bone cancers net.normalize(axis='row', norm_type='zscore') # perform enrichment analysis on genes net.enrichrgram('GO_Biological_Process_2015') net.cluster() net.widget()
Osteosarcoma, Giant Cell Tumor, and Chondrosarcoma cluster separately from Ewings sarcoma-peripheral-primitive-neuroectodermal-tumors.
We can see that the up-regulated genes in Chondrosarcomas and Osteosarcomas are enriched in extracellular-matrix related terms, which makes sense for bone cancers. These ontology terms are associated with genes that are up-regulated in osteosarcomas and chondrosarcomas, but not Ewings-sarcoma. To investigate the up-regulated gnes in Ewings-sarcoma, we can use the interactive dendrogram to select these up-regulated genes and perform enrichment analysis. After updating our Gene Ontology enrichment (using Enrichrgram we see that these genes are enriched for behavior- and neuronal-related terms, which seems to agree with the neuroectodermal association of this tumor type.
There are 180 haematopoietic and lymphoid tissue cell lines. Below we will visualize the top 250 variably expressed genes across this tissue type and initialize the visualization with enrichment for Gene Ontology for biological process.
net.load_df(ccle) net.filter_cat('col', 1, 'tissue: haematopoietic_and_lymphoid_tissue') # setting some colors manually net.set_cat_color('col', 2, 'histology: lymphoid_neoplasm', 'blue') net.set_cat_color('col', 3, 'sub-histology: acute_myeloid_leukaemia', 'yellow') net.filter_N_top('row', 250, 'var') net.normalize(axis='row', norm_type='zscore') net.enrichrgram('GO_Biological_Process_2015') net.cluster() net.widget()
We see clear clustering of these cell lines based on their histology and sub-histology. We can also link up-regulated genes to specific histologies. For instance, there are two clusters of up-regulated genes (two top clusters) associated with cell lines with the histology haematopoietic_neoplasm.
These differntially expressed genes across haematopoietic and lymphoid tissue are enriched for Gene Ontology biological process terms associated with cell activation, leukocyte activation, and immune related terms.
We would like to get an overview of the entire CCLE gene expression data, but the dataset is too large to visualize direcly using Clustergrammer. Also, we are probably not interested in the expression data of all 18,000 genes, but only in a subset of genes; e.g. those that are 'differentially expressed' across some subset of tissues.
We will use downsampling and filtering to get a more managable dataset, which we can visualize using Clustergrammer. First, we will use K-means to downsample the 1,037 cell lines down to 100 clusters and then we will filter for the top 1,000 differentially expressed genes.
We'll do the downsampling first and save it to
net.load_df(ccle) net.downsample(ds_type='kmeans', axis='col', num_samples=100) ccle_ds = net.export_df() ccle_ds.shape
Now our downsampled data,
ccle_ds, only has 100 columns. We have also dropped some column categories and are only keeping track of the maojrity tissue in each cell-line-cluster and the number of cell-lines in each cluster. We can see how this is encoded in the column names as tuples below:
Next, we will filter out genes based on variance -- we will only keep the top 2,000 genes based on their variance. After this, we will Z-score normalize the genes across all cell-line-clusters to more easily compare their differential expression. We will perform these operations within the
net.load_df(ccle_ds) net.filter_N_top('row', 1000, rank_type='var') net.normalize(axis='row', norm_type='zscore', keep_orig=True) print('now our matrix has 1000 rows and 100 columns') net.dat['mat'].shape
now our matrix has 1000 rows and 100 columns
Finally, we will use Clustergrammer to hierarchically cluster and visualize the downsampled and filtered dataset.
net.set_cat_color('col', 1, 'Majority-tissue: haematopoietic_and_lymphoid_tissue', 'yellow') net.set_cat_color('col', 1, 'Majority-tissue: breast', 'red') net.set_cat_color('col', 1, 'Majority-tissue: upper_aerodigestive_tract', 'orange') net.set_cat_color('col', 1, 'Majority-tissue: central_nervous_system', 'blue') net.set_cat_color('col', 1, 'Majority-tissue: lung', 'purple') net.set_cat_color('col', 1, 'Majority-tissue: central_nervous_system', 'blue') net.set_cat_color('col', 1, 'Majority-tissue: kidney', '#32cd32') net.set_cat_color('col', 3, 'Majority-sub-histology: adenocarcinoma', 'yellow')
net.cluster(enrichrgram=True, views=) net.widget()
We see that cell-line-clusters (refered to as cell lines) cluster based on their tissue (majority tissue). We can also see that the cell lines fall into three large clusters (dendrogram level 6) that primarily consist of the following cell lines:
Cluster 2 contains lung and upper aerodigestive tract tissues, which are physiologically related tissues. Cluster 3 contains haematopoietic and lymphoid tissue and bone, which are also physiologically related.
This overview shows us that we have four large gene clusters and three cell line clusters. We can zoom into the clusters to find out which genes are differentially expressed and mouseover specific gene name to bring up their full names and descriptions (via Harmonizome). Users can crop into clusters of genes and run enrichment analysis to investigate their functions.