#!/usr/bin/env python
# coding: utf-8

# # CCLE Tissue Expression Clustergrammer Visualizations
# This notebook will use the [Clustergrammer-Widget](http://clustergrammer.readthedocs.io/clustergrammer_widget.html) to visualize the Cancer cell line Encyclopedia gene expression data ([Broad-Institute CCLE](https://software.broadinstitute.org/software/cprg/?q=node/11)). The CCLE project measured genetic data from over 1000 cancer cell lines. We'lll use Clustergrammer-Widget to visualize the data. We will start by importing required libraries and initializing the Clustergrammer [Network](http://clustergrammer.readthedocs.io/clustergrammer_py.html#clustergrammer-py-api) object: 
# 

# In[1]:


import pandas as pd
from clustergrammer_widget import *
net = Network(clustergrammer_widget)


# ### Load CCLE data
# We are using a slightly reformatted version of the CCLE gene expression data with modified cell line meta-data (category) formatting. You can see below how cell-line categorical information (e.g. tissue) information is encoded as column tuples. The matrix has 18,874 rows (genes) and 1,037 columns (cell-lines).

# In[2]:


# load primary data (hidden from git repo)
net.load_file('../original_data/CCLE.txt')
ccle = net.export_df()
print(ccle.shape)


# # Bone Tissue Expression
# The CCLE dataset is too large to directly visualize using the Clustergrammer-Widget, but we can directly visualize tissue-specific subsets of the data. Here we will visualize the top 250 most variable genes in bone cancers, which will help us identify subsets within bone cancer. We will also initialize the visualization with enrichment for Gene Ontology terms using [Enrichr](http://amp.pharm.mssm.edu/Enrichr/) and the ``enrichrgram`` method.

# In[3]:


# filter for columns with category 1 bone
net.filter_cat('col', 1, 'tissue: bone')
# keep top 250 most variable genes
net.filter_N_top('row', 250, 'var')
# normalize gene expression across bone cancers
net.normalize(axis='row', norm_type='zscore')
# perform enrichment analysis on genes
net.enrichrgram('GO_Biological_Process_2015')
net.cluster()
net.widget()


# ### Bone Cancer Cell Lines Cluster Based on their Histology
# Osteosarcoma, Giant Cell Tumor, and Chondrosarcoma cluster separately from Ewings sarcoma-peripheral-primitive-neuroectodermal-tumors.
# 
# ### Gene Ontology: Biological Process
# We can see that the up-regulated genes in Chondrosarcomas and Osteosarcomas are enriched in extracellular-matrix related terms, which makes sense for bone cancers. These ontology terms are associated with genes that are up-regulated in osteosarcomas and chondrosarcomas, but not Ewings-sarcoma. To investigate the up-regulated gnes in Ewings-sarcoma, we can use the [interactive dendrogram](http://clustergrammer.readthedocs.io/interacting_with_viz.html#interactive-dendrogram) to select these up-regulated genes and perform enrichment analysis. After updating our Gene Ontology enrichment (using [Enrichrgram](http://clustergrammer.readthedocs.io/biology_specific_features.html#enrichment-analysis) we see that these genes are enriched for behavior- and neuronal-related terms, which seems to agree with the neuroectodermal association of this tumor type. 

# # Haematopoietic and Lymphoid Tissue
# There are 180 haematopoietic and lymphoid tissue cell lines. Below we will visualize the top 250 variably expressed genes across this tissue type and initialize the visualization with enrichment for Gene Ontology for biological process. 

# In[4]:


net.load_df(ccle)
net.filter_cat('col', 1, 'tissue: haematopoietic_and_lymphoid_tissue')

# setting some colors manually
net.set_cat_color('col', 2, 'histology: lymphoid_neoplasm', 'blue')
net.set_cat_color('col', 3, 'sub-histology: acute_myeloid_leukaemia', 'yellow')

net.filter_N_top('row', 250, 'var')
net.normalize(axis='row', norm_type='zscore')
net.enrichrgram('GO_Biological_Process_2015')
net.cluster()
net.widget()


# ### Haematopoietic and Lymphoid Tissue Cell Lines Cluster Based on Histology
# We see clear clustering of these cell lines based on their histology and sub-histology. We can also link up-regulated genes to specific histologies. For instance, there are two clusters of up-regulated genes (two top clusters) associated with cell lines with the histology haematopoietic_neoplasm. 
# 
# ### Gene Ontology: Biological Process
# These differntially expressed genes across haematopoietic and lymphoid tissue are enriched for Gene Ontology biological process terms associated with cell activation, leukocyte activation, and immune related terms. 

# ## Overview of the Entire CCLE
# We would like to get an overview of the entire CCLE gene expression data, but the dataset is too large to visualize direcly using Clustergrammer. Also, we are probably not interested in the expression data of all 18,000 genes, but only in a subset of genes; e.g. those that are 'differentially expressed' across some subset of tissues. 
# 
# We will use downsampling and filtering to get a more managable dataset, which we can visualize using Clustergrammer. First, we will use K-means to downsample the 1,037 cell lines down to 100 clusters and then we will filter for the top 1,000 differentially expressed genes. 
# 
# ### Cell-line Downsampling
# We'll do the downsampling first and save it to ``ccle_ds``:

# In[5]:


net.load_df(ccle)
net.downsample(ds_type='kmeans', axis='col', num_samples=100)
ccle_ds = net.export_df()
ccle_ds.shape


# Now our downsampled data, ``ccle_ds``, only has 100 columns. We have also dropped some column categories and are only keeping track of the maojrity tissue in each cell-line-cluster and the number of cell-lines in each cluster. We can see how this is encoded in the column names as tuples below:

# ### Gene-Filtering and Normalization
# Next, we will filter out genes based on variance -- we will only keep the top 2,000 genes based on their variance. After this, we will Z-score normalize the genes across all cell-line-clusters to more easily compare their differential expression. We will perform these operations within the ``net`` object:

# In[6]:


net.load_df(ccle_ds)
net.filter_N_top('row', 1000, rank_type='var')
net.normalize(axis='row', norm_type='zscore', keep_orig=True)
print('now our matrix has 1000 rows and 100 columns')
net.dat['mat'].shape


# ## Visualizing Downsampled Filtered CCLE Dataset
# Finally, we will use Clustergrammer to hierarchically cluster and visualize the downsampled and filtered dataset.

# In[7]:


net.set_cat_color('col', 1, 'Majority-tissue: haematopoietic_and_lymphoid_tissue', 'yellow')
net.set_cat_color('col', 1, 'Majority-tissue: breast', 'red')
net.set_cat_color('col', 1, 'Majority-tissue: upper_aerodigestive_tract', 'orange')
net.set_cat_color('col', 1, 'Majority-tissue: central_nervous_system', 'blue')
net.set_cat_color('col', 1, 'Majority-tissue: lung', 'purple')
net.set_cat_color('col', 1, 'Majority-tissue: central_nervous_system', 'blue')
net.set_cat_color('col', 1, 'Majority-tissue: kidney', '#32cd32')

net.set_cat_color('col', 3, 'Majority-sub-histology: adenocarcinoma', 'yellow')


# In[8]:


net.cluster(enrichrgram=True, views=[])
net.widget()


# ### Cell Line Tissue Clustering
# We see that cell-line-clusters (refered to as cell lines) cluster based on their tissue (majority tissue). We can also see that the cell lines fall into three large clusters (dendrogram level 6) that primarily consist of the following cell lines:
# 
# ![cell_type_comparisons](img/CCLE_ds_cluster_breakdowns.png)
# 
# Cluster 2 contains lung and upper aerodigestive tract tissues, which are physiologically related tissues. Cluster 3 contains haematopoietic and lymphoid tissue and bone, which are also physiologically related.
# 
# ### Cell-line and Gene Clusters
# This overview shows us that we have four large gene clusters and three cell line clusters. We can zoom into the clusters to find out which genes are differentially expressed and mouseover specific gene name to bring up their full names and descriptions (via [Harmonizome](http://amp.pharm.mssm.edu/Harmonizome/)). Users can crop into clusters of genes and run enrichment analysis to investigate their functions. 

# In[ ]: