CCLE Tissue Expression Clustergrammer Visualizations

This notebook will use the Clustergrammer-Widget to visualize the Cancer cell line Encyclopedia gene expression data (Broad-Institute CCLE). The CCLE project measured genetic data from over 1000 cancer cell lines. We'lll use Clustergrammer-Widget to visualize the data. We will start by importing required libraries and initializing the Clustergrammer Network object:

In [1]:
from clustergrammer_widget import *
import pandas as pd
import numpy as np
net = Network()

Reformatted CCLE data

We are using a slightly reformatted version of the CCLE gene expression data with modified cell line meta-data (category) formatting. You can see below how cell-line categorical information (e.g. tissue) information is encoded as column tuples. The matrix has 18,874 rows (genes) and 1,037 columns (cell-lines).

In [2]:
net.load_file('../original_data/CCLE.txt')
ccle = net.export_df()
ccle.head()
Out[2]:
(cell line: LN18, tissue: central_nervous_system, histology: glioma, sub-histology: astrocytoma_Grade_IV, gender: M) (cell line: 769P, tissue: kidney, histology: carcinoma, sub-histology: clear_cell_renal_cell_carcinoma, gender: F) (cell line: 786O, tissue: kidney, histology: carcinoma, sub-histology: clear_cell_renal_cell_carcinoma, gender: M) (cell line: CAOV3, tissue: ovary, histology: carcinoma, sub-histology: adenocarcinoma, gender: F) (cell line: HEPG2, tissue: liver, histology: carcinoma, sub-histology: hepatocellular_carcinoma, gender: M) (cell line: MOLT4, tissue: haematopoietic_and_lymphoid_tissue, histology: lymphoid_neoplasm, sub-histology: acute_lymphoblastic_T_cell_leukaemia, gender: M) (cell line: NCIH524, tissue: lung, histology: carcinoma, sub-histology: small_cell_carcinoma, gender: M) (cell line: NCIH209, tissue: lung, histology: carcinoma, sub-histology: small_cell_carcinoma, gender: M) (cell line: MIAPACA2, tissue: pancreas, histology: carcinoma, sub-histology: ductal_carcinoma, gender: M) (cell line: MCAS, tissue: ovary, histology: carcinoma, sub-histology: adenocarcinoma, gender: F) ... (cell line: SLR21, tissue: kidney, histology: carcinoma, sub-histology: renal_cell_carcinoma, gender: NA) (cell line: LNZ308, tissue: central_nervous_system, histology: glioma, sub-histology: astrocytoma_Grade_IV, gender: NA) (cell line: LN340, tissue: central_nervous_system, histology: glioma, sub-histology: astrocytoma_Grade_IV, gender: NA) (cell line: HCC827GR5, tissue: lung, histology: carcinoma, sub-histology: adenocarcinoma, gender: NA) (cell line: SLR20, tissue: kidney, histology: carcinoma, sub-histology: renal_cell_carcinoma, gender: NA) (cell line: HK2, tissue: kidney, histology: other, sub-histology: immortalized_epithelial, gender: NA) (cell line: EW8, tissue: bone, histology: Ewings_sarcoma-peripheral_primitive_neuroectodermal_tumour, sub-histology: NS, gender: NA) (cell line: UOK101, tissue: kidney, histology: carcinoma, sub-histology: clear_cell_renal_cell_carcinoma, gender: NA) (cell line: JHESOAD1, tissue: oesophagus, histology: carcinoma, sub-histology: barrett_associated_adenocarcinoma, gender: NA) (cell line: CH157MN, tissue: central_nervous_system, histology: meningioma, sub-histology: NS, gender: NA)
LOC100009676 5.987545 5.444892 5.838828 6.074743 5.788600 5.459675 5.755560 7.190493 5.449818 5.801820 ... 5.473156 5.517208 5.858379 5.196033 5.831437 5.362021 5.799747 5.865606 5.463812 5.720593
AKT3 6.230233 7.544216 7.328450 4.270720 4.478293 6.212102 7.562398 8.642669 5.556191 6.808673 ... 6.375324 6.119814 6.561409 4.521773 6.830904 7.031690 4.881235 6.914640 5.313795 5.757825
MED6 9.363550 8.715909 8.410834 9.845271 9.761157 10.532820 10.393960 9.478429 9.112954 9.815614 ... 8.849773 8.767192 8.521635 8.224544 9.325785 8.362727 8.990524 8.958629 9.748100 9.758431
NR2E3 3.803069 4.173643 3.776557 3.934091 3.822202 3.949198 3.807546 3.930186 4.161937 4.028581 ... 3.717506 3.977377 3.659459 3.933996 4.515748 4.434658 4.127832 3.942736 4.062648 4.074257
NAALAD2 3.586430 3.663081 4.047007 3.817250 6.444302 4.081071 5.462774 4.252446 3.932451 3.835827 ... 3.520843 4.036661 4.168351 3.535915 4.445632 3.622032 5.436580 3.666404 3.556565 3.728828

5 rows × 1037 columns

In [3]:
ccle.shape
Out[3]:
(18874, 1037)

Tissue-Specific Expression

Above we obtained a coarse-grained overview of the CCLE gene expression data. With this overview we saw that cell lines cluster according to their tissue and uisng Enrichr we verified that clusters of differentially expressed genes are informative about their assocaited tissues. Below we will look at the expression in specific tissues and see whether we can identify sub-clusters of tissues (e.g. based on their histology).

We will do this using Pandas to filter for specific cell lines of interest.

In [4]:
# get a list of cell line names (tuples)
cols = ccle.columns.tolist()

Bone Tissue Expression

Next, we will visualize gene expression across bone tissue and we will again process the data such that we highlight genes with variable expression across bone cancers, which will help us characterize subsets of bone caners.

In [5]:
# gather lung cells
bone_names = [i for i in cols if i[1] == 'tissue: bone']
bone_data = ccle[bone_names]
In [6]:
net.load_df(bone_data)
net.filter_N_top('row', 500, 'var')
net.normalize(axis='row', norm_type='zscore')
net.dat['mat'].shape
Out[6]:
(500, 29)
In [7]:
net.make_clust()
clustergrammer_widget(network=net.widget())

Bone Cancer Cell Lines Cluster Based on their Histology

Osteosarcoma, Giant Cell Tumor, and Chondrosarcoma cluster separately from Ewings sarcoma-peripheral-primitive-neuroectodermal-tumors.

Gene Ontology: Biological Process

We can see that the up-regulated genes in Chondrosarcomas and Osteosarcomas are enriched in extracellular-matrix related terms, which makes sense for bone cancers. These extracellular-matrix terms do not 'point' to the up-regulated genes in Ewings-Sarcoma and we will use the row dendrogram crop button to further investigate these genes. We see that these genes are enriched for behavior and neuronal-related terms. This seems to agree with the neuroectodermal association of this tumor type.

Lung Tissue Expression

Here we will visualize the expression in lung tissue and we will process the data such that we highlight the variability of gene expression across all lung cell lines. This will help us understand the differences between the lung cell lines and hopefully identify subpopulations.

There are 187 lung tissue cell lines. We will start by selecting lung tissue cell lines, then we will filter for the top 500 differentally expressed genes, and finally we will normalize the genes across all cell lines to easily compare their differential expression across the celll lines.

In [8]:
# gather lung cells
lung_names = [i for i in cols if i[1] == 'tissue: lung']
lung_data = ccle[lung_names]

Note, that we can use the same net object as before since loading a new DataFrame clears out the old data.

In [9]:
net.load_df(lung_data)
net.filter_N_top('row', 500, 'var')
net.normalize(axis='row', norm_type='zscore')
net.dat['mat'].shape
Out[9]:
(500, 187)
In [10]:
net.make_clust()
clustergrammer_widget(network=net.widget())

Lung Cell Lines Cluster According to Sub-Histology

Lung cell lines almost all have the same histology, carcinoma, but have several sub-histoogies. We see that cell lines cluster according to these sub-histologies and using Enrichrgram we can see the biological processes occurring in these clusters.

Enrichr Results

Non-Small Cell Carcinoma: endopeptidase, ECM, etc

If we do GO Biological Function enrichment we get terms related to: regulation of endopeptidase activity, response to acid and inorganic substances, extracellular matrix organization, etc. These terms point to up-regulated genes in adenocarcinoma and large cell carcinomas (NSCLC), but we not to the genes up-regulated in small call carcinomas (SCLC).

Small Cell Carcinoma: Neuron differentiation and function

To investigate the function of these up-regulated SCLC genes we can enrich for just this cluster of up-regulated genes. We see with the same enrichment analysis that the most specific functions are commonly neuron related. SCLC are known to display characteristics of neuronal cells (Onganer et al. 2005).

Overview of the Entire CCLE

We would like to get an overview of the entire CCLE gene expression data, but the dataset is too large to visualize direcly using Clustergrammer. Also, we are probably not interested in the expression data of all 18,000 genes, but only in a subset of genes; e.g. those that are 'differentially expressed' across some subset of tissues.

We will use downsampling and filtering to get a more managable dataset, which we can visualize using Clustergrammer. First, we will use K-means to downsample the 1,037 cell lines down to 100 clusters and then we will filter for the top 2,000 differentially expressed genes.

Cell-line Downsampling

We'll do the downsampling first and save it to ccle_ds:

In [11]:
net.load_df(ccle)
net.downsample(ds_type='kmeans', axis='col', num_samples=100)
ccle_ds = net.export_df()
ccle_ds.shape
Out[11]:
(18874, 100)

Now our downsampled data, ccle_ds, only has 100 columns. We have also dropped some column categories and are only keeping track of the maojrity tissue in each cell-line-cluster and the number of cell-lines in each cluster. We can see how this is encoded in the column names as tuples below:

In [12]:
ccle_ds.head()
Out[12]:
(Cluster: cluster-0, Majority-tissue: haematopoietic_and_lymphoid_tissue, number in clust: 2) (Cluster: cluster-1, Majority-tissue: lung, number in clust: 2) (Cluster: cluster-2, Majority-tissue: upper_aerodigestive_tract, number in clust: 47) (Cluster: cluster-3, Majority-tissue: autonomic_ganglia, number in clust: 11) (Cluster: cluster-4, Majority-tissue: skin, number in clust: 50) (Cluster: cluster-5, Majority-tissue: lung, number in clust: 26) (Cluster: cluster-6, Majority-tissue: haematopoietic_and_lymphoid_tissue, number in clust: 31) (Cluster: cluster-7, Majority-tissue: large_intestine, number in clust: 2) (Cluster: cluster-8, Majority-tissue: lung, number in clust: 15) (Cluster: cluster-9, Majority-tissue: liver, number in clust: 16) ... (Cluster: cluster-90, Majority-tissue: stomach, number in clust: 1) (Cluster: cluster-91, Majority-tissue: breast, number in clust: 1) (Cluster: cluster-92, Majority-tissue: haematopoietic_and_lymphoid_tissue, number in clust: 1) (Cluster: cluster-93, Majority-tissue: central_nervous_system, number in clust: 17) (Cluster: cluster-94, Majority-tissue: autonomic_ganglia, number in clust: 2) (Cluster: cluster-95, Majority-tissue: thyroid, number in clust: 1) (Cluster: cluster-96, Majority-tissue: stomach, number in clust: 1) (Cluster: cluster-97, Majority-tissue: oesophagus, number in clust: 1) (Cluster: cluster-98, Majority-tissue: kidney, number in clust: 2) (Cluster: cluster-99, Majority-tissue: liver, number in clust: 8)
LOC100009676 5.665593 5.284315 5.673244 5.363738 6.057420 5.840425 5.841230 5.685825 5.680019 5.610437 ... 6.599907 5.425919 6.363237 5.773302 4.631328 5.855566 6.583266 5.077296 5.626497 5.672558
AKT3 6.435427 6.952967 5.605452 8.122674 7.267956 6.011819 5.094038 4.558152 7.085084 6.115605 ... 7.688763 4.401639 7.052142 6.299563 8.859103 7.864197 4.341758 5.051193 5.370956 4.430994
MED6 9.518722 8.762060 9.502653 9.341522 8.839631 9.507497 9.699576 9.673724 8.788507 8.855337 ... 9.539184 8.672265 9.594567 8.579562 9.669472 9.377676 9.125494 10.045597 8.782129 9.287517
NR2E3 3.989407 3.901817 4.051622 3.875381 3.804977 3.931573 3.993905 3.990742 3.864638 3.927532 ... 4.132590 4.118239 3.706916 3.829621 3.748747 4.071493 3.678901 3.869146 4.106636 3.887731
NAALAD2 4.389125 4.678008 3.844582 7.318395 4.123212 4.142046 3.873645 3.872935 3.751903 4.138593 ... 3.763763 3.873275 3.599781 3.703137 6.849774 3.681915 3.948763 3.678124 3.535314 6.393138

5 rows × 100 columns

Gene-Filtering and Normalization

Next, we will filter out genes based on variance -- we will only keep the top 2,000 genes based on their variance. After this, we will Z-score normalize the genes across all cell-line-clusters to more easily compare their differential expression. We will perform these operations within the net object:

In [13]:
net.load_df(ccle_ds)
net.filter_N_top('row', 2000, rank_type='var')
net.normalize(axis='row', norm_type='zscore', keep_orig=True)
print('now our matrix has 2000 rows and 100 columns')
net.dat['mat'].shape
now our matrix has 2000 rows and 100 columns
Out[13]:
(2000, 100)

Visualizing Downsampled Filtered CCLE Dataset

Finally, we will use Clustergrammer to hierarchically cluster and visualize the downsampled and filtered dataset.

In [14]:
net.make_clust()
clustergrammer_widget(network=net.widget())

Cell Line Tissue Clustering

We can immediately see that cell-line-clusters (refered to as cell lines) cluster based on their tissue (specifically the majority tissue in the cluster) -- note the colored bands under the column labels. The second category 'number in clust' gives the number of cell lines represented by the K-means cluster -- the darker the category the more cell lines in each cluster.

From the column dendrogram (gray trapezoids on under the heatmap) that cell lines cluster into six large clusters. On the right we can see 'Haematopoietic and Lymphoid Tissue' forms a big cluster with a set of highly expressed genes on the bottom right.

Cell-line and Gene Clusters

This overview shows us that we have four big gene clusters and six cell line 'clusters' (made up of K-means clusters). We can zoom into the clusters to find out which genes are differentially expressed and mouseover specific gene name to bring up their full names and descriptions (via Harmonizome).

Haematopoietic and Lymphoid Tissue

Clustergrammer leverages other Ma'ayan lab web-tools (e.g. Enrichr) to facilitate the exploration of biological gene-level data. We can use the Enrichrgram functionality to find biological information specific to our genes of interest. We will do this by first selecting our cluster of interest, the up-regulated genes in Haematopoietic/Lymphoid Tissue, using the dendrogram crop button (the triangle pointing at the dendrogram cluster). Once we have filtered for only these genes we can either export our genes to Enrichr or import Enrichr results into our visualization.

We'll enrich for Gene Ontology Biologial Process and we see that we get a lot of immune related processes that 'point' right at the cluster of up-regulated genes, which makes sense. We can also enrich for up-stream transcription factors that might be responsible for the expression of these genes. We see that the transcription factor IRF1 targets a subset of up-regulated genes and is known to be involved immune response. We see that the transcription factor MECOM targets these up-regulated genes and this trancription factor is known to be involved in hematopoiesis.

We can also use Enrichr to investigate other subsets.

In [ ]: