#!/usr/bin/env python # coding: utf-8 # In[1]: # Initialize Notebook get_ipython().run_line_magic('run', '../join-collection-1/library/init.ipy') HTML(''' ''') # # RNA Binding Protein Profiling in UACC62 p53-Mutated Metastatic Melanoma Cell Line | BioJupies # --- # # Introduction # This notebook contains an analysis of GEO dataset GSE88741 (https://www.ncbi.nlm.nih.gov/gds/?term=GSE88741) created using the BioJupies Generator. # ### Table of Contents # The notebook is divided into the following sections: #

Load Dataset - Loads and previews the input dataset in the notebook environment.
PCA - Linear dimensionality reduction technique to visualize similarity between samples
Clustergrammer - Interactive hierarchical clustering heatmap visualization
Library Size Analysis - Analysis of readcount distribution for the samples within the dataset
Differential Expression Table - Differential expression analysis between two groups of samples
Volcano Plot - Plot the logFC and logP values resulting from a differential expression analysis
MA Plot - Plot the logFC and average expression values resulting from a differential expression analysis
Enrichr Links - Links to enrichment analysis results of the differentially expressed genes via Enrichr
Gene Ontology Enrichment Analysis - Identifies Gene Ontology terms which are enriched in the differentially expressed genes
Pathway Enrichment Analysis - Identifies biological pathways which are enriched in the differentially expressed genes
Transcription Factor Enrichment Analysis - Identifies transcription factors whose targets are enriched in the differentially expressed genes
Kinase Enrichment Analysis - Identifies protein kinases whose substrates are enriched in the differentially expressed genes
miRNA Enrichment Analysis - Identifies miRNAs whose targets are enriched in the differentially expressed genes
L1000CDS² Query - Identifies small molecules which mimic or reverse a given differential gene expression signature
L1000FWD Query - Projects signatures on a 2-dimensional visualization of the L1000 signature database

# --- # # Results # ## 1. Load Dataset # Here, the GEO dataset GSE88741 is loaded into the notebook. Expression data was quantified as gene-level counts using the ARCHS⁴ pipeline (Lachmann et al., 2017), available at http://amp.pharm.mssm.edu/archs4/. # In[2]: # Load dataset dataset = load_dataset(source='archs4', gse='GSE88741', platform='GPL16791') # Preview expression data preview_data(dataset) # **Table 1 | RNA-seq expression data.** The table displays the first 5 rows of the quantified RNA-seq expression dataset. Rows represent genes, columns represent samples, and values show the number of mapped reads. # In[3]: # Display metadata display_metadata(dataset) # **Table 2 | Sample metadata.** The table displays the metadata associated with the samples in the RNA-seq dataset. Rows represent RNA-seq samples, columns represent metadata categories. # In[4]: # Configure signatures dataset['signature_metadata'] = { 'Normal vs Perturbation': { 'A': ['GSM2344965', 'GSM2344966', 'GSM2344967'], 'B': ['GSM2344974', 'GSM2344975', 'GSM2344976'] } } # Generate signatures for label, groups in dataset['signature_metadata'].items(): signatures[label] = generate_signature(group_A=groups['A'], group_B=groups['B'], method='limma', dataset=dataset) # --- # ## 2. PCA # Principal Component Analysis (PCA) is a statistical technique used to identify global patterns in high-dimensional datasets. It is commonly used to explore the similarity of biological samples in RNA-seq datasets. To achieve this, gene expression values are transformed into Principal Components (PCs), a set of linearly uncorrelated features which represent the most relevant sources of variance in the data, and subsequently visualized using a scatter plot. # In[5]: # Run analysis results['pca'] = analyze(dataset=dataset, tool='pca', nr_genes=2500, normalization='logCPM', z_score='True') # Display results plot(results['pca']) # --- # ## 3. Clustergrammer # Clustergrammer is a web-based tool for visualizing and analyzing high-dimensional data as interactive and hierarchically clustered heatmaps. It is commonly used to explore the similarity between samples in an RNA-seq dataset. In addition to identifying clusters of samples, it also allows to identify the genes which contribute to the clustering. # In[6]: # Run analysis results['clustergrammer'] = analyze(dataset=dataset, tool='clustergrammer', nr_genes=2500, normalization='logCPM', z_score='True') # Display results plot(results['clustergrammer']) # --- # ## 4. Library Size Analysis # In order to quantify gene expression in an RNA-seq dataset, reads generated from the sequencing step are mapped to a reference genome and subsequently aggregated into numeric gene counts. Due to experimental variations and random technical noise, samples in an RNA-seq datasets often have variable amounts of the total RNA. Library size analysis calculates and displays the total number of reads mapped for each sample in the RNA-seq dataset, facilitating the identification of outlying samples and the assessment of the overall quality of the data. # In[7]: # Run analysis results['library_size_analysis'] = analyze(dataset=dataset, tool='library_size_analysis') # Display results plot(results['library_size_analysis']) # --- # ## 5. Differential Expression Table # Gene expression signatures are alterations in the patterns of gene expression that occur as a result of cellular perturbations such as drug treatments, gene knock-downs or diseases. They can be quantified using differential gene expression (DGE) methods, which compare gene expression between two groups of samples to identify genes whose expression is significantly altered in the perturbation. The signature table is used to interactively display the results of such analyses. # In[8]: # Initialize results results['signature_table'] = {} # Loop through signatures for label, signature in signatures.items(): # Run analysis results['signature_table'][label] = analyze(signature=signature, tool='signature_table', signature_label=label) # Display results plot(results['signature_table'][label]) # --- # ## 6. Volcano Plot # Volcano plots are a type of scatter plot commonly used to display the results of a differential gene expression analysis. They can be used to quickly identify genes whose expression is significantly altered in a perturbation, and to assess the global similarity of gene expression in two groups of biological samples. Each point in the scatter plot represents a gene; the axes display the significance versus fold-change estimated by the differential expression analysis. # In[9]: # Initialize results results['volcano_plot'] = {} # Loop through signatures for label, signature in signatures.items(): # Run analysis results['volcano_plot'][label] = analyze(signature=signature, tool='volcano_plot', signature_label=label, pvalue_threshold=0.05, logfc_threshold=1.5) # Display results plot(results['volcano_plot'][label]) # --- # ## 7. MA Plot # Volcano plots are a type of scatter plot commonly used to display the results of a differential gene expression analysis. They can be used to quickly identify genes whose expression is significantly altered in a perturbation, and to assess the global similarity of gene expression in two groups of biological samples. Each point in the scatter plot represents a gene; the axes display the average gene expression versus fold-change estimated by the differential expression analysis. # In[10]: # Initialize results results['ma_plot'] = {} # Loop through signatures for label, signature in signatures.items(): # Run analysis results['ma_plot'][label] = analyze(signature=signature, tool='ma_plot', signature_label=label, pvalue_threshold=0.05, logfc_threshold=1.5) # Display results plot(results['ma_plot'][label]) # --- # ## 8. Enrichr Links # Enrichment analysis is a statistical procedure used to identify biological terms which are over-represented in a given gene set. These include signaling pathways, molecular functions, diseases, and a wide variety of other biological terms obtained by integrating prior knowledge of gene function from multiple resources. Enrichr is a web-based application which allows to perform enrichment analysis using a large collection of gene-set libraries and various interactive approaches to display enrichment results. # In[11]: # Initialize results results['enrichr'] = {} # Loop through signatures for label, signature in signatures.items(): # Run analysis results['enrichr'][label] = analyze(signature=signature, tool='enrichr', signature_label=label, geneset_size=500) # Display results plot(results['enrichr'][label]) # --- # ## 9. Gene Ontology Enrichment Analysis # Gene Ontology (GO) is a major bioinformatics initiative aimed at unifying the representation of gene attributes across all species. It contains a large collection of experimentally validated and predicted associations between genes and biological terms. This information can be leveraged by Enrichr to identify the biological processes, molecular functions and cellular components which are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples. # In[12]: # Initialize results results['go_enrichment'] = {} # Loop through results for label, enrichr_results in results['enrichr'].items(): # Run analysis results['go_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='go_enrichment', signature_label=label) # Display results plot(results['go_enrichment'][label]) # --- # ## 10. Pathway Enrichment Analysis # Biological pathways are sequences of interactions between biochemical compounds which play a key role in determining cellular behavior. Databases such as KEGG, Reactome and WikiPathways contain a large number of associations between such pathways and genes. This information can be leveraged by Enrichr to identify the biological pathways which are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples. # In[13]: # Initialize results results['pathway_enrichment'] = {} # Loop through results for label, enrichr_results in results['enrichr'].items(): # Run analysis results['pathway_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='pathway_enrichment', signature_label=label) # Display results plot(results['pathway_enrichment'][label]) # --- # ## 11. Transcription Factor Enrichment Analysis # Transcription Factors (TFs) are proteins involved in the transcriptional regulation of gene expression. Databases such as ChEA and ENCODE contain a large number of associations between TFs and their transcriptional targets. This information can be leveraged by Enrichr to identify the transcription factors whose targets are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples. # In[14]: # Initialize results results['tf_enrichment'] = {} # Loop through results for label, enrichr_results in results['enrichr'].items(): # Run analysis results['tf_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='tf_enrichment', signature_label=label) # Display results plot(results['tf_enrichment'][label]) # --- # ## 12. Kinase Enrichment Analysis # Protein kinases are enzymes that modify other proteins by chemically adding phosphate groups. Databases such as KEA contain a large number of associations between kinases and their substrates. This information can be leveraged by Enrichr to identify the protein kinases whose substrates are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples. # In[15]: # Initialize results results['kinase_enrichment'] = {} # Loop through results for label, enrichr_results in results['enrichr'].items(): # Run analysis results['kinase_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='kinase_enrichment', signature_label=label) # Display results plot(results['kinase_enrichment'][label]) # --- # ## 13. miRNA Enrichment Analysis # microRNAs (miRNAs) are small non-coding RNA molecules which play a key role in the post-transcriptional regulation of gene expression. Databases such as TargetScan and MiRTarBase contain a large number of associations between miRNAs and their targets. This information can be leveraged by Enrichr to identify the miRNAs whose targets are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples. # In[16]: # Initialize results results['mirna_enrichment'] = {} # Loop through results for label, enrichr_results in results['enrichr'].items(): # Run analysis results['mirna_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='mirna_enrichment', signature_label=label) # Display results plot(results['mirna_enrichment'][label]) # --- # ## 14. L1000CDS² Query # L1000CDS2 is a web-based tool for querying gene expression signatures against signatures created from human cell lines treated with over 20,000 small molecules and drugs for the LINCS project. It is commonly used to identify small molecules which mimic or reverse the effects of a gene expression signature generated from a differential gene expression analysis. # In[17]: # Initialize results results['l1000cds2'] = {} # Loop through signatures for label, signature in signatures.items(): # Run analysis results['l1000cds2'][label] = analyze(signature=signature, tool='l1000cds2', signature_label=label) # Display results plot(results['l1000cds2'][label]) # --- # ## 15. L1000FWD Query # L1000FWD is a web-based tool for querying gene expression signatures against signatures created from human cell lines treated with over 20,000 small molecules and drugs for the LINCS project. # In[18]: # Initialize results results['l1000fwd'] = {} # Loop through signatures for label, signature in signatures.items(): # Run analysis results['l1000fwd'][label] = analyze(signature=signature, tool='l1000fwd', signature_label=label) # Display results plot(results['l1000fwd'][label]) # --- # # Methods # ### Data # ##### Data Source # Raw RNA-seq data for GEO dataset GSE88741 was downloaded from the SRA database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE88741) and quantified to gene-level counts using the ARCHS4 pipeline (Lachmann et al., 2017). Gene counts were downloaded from the ARCHS4 gene expression matrix v1.1. For more information about ARCHS4, as well as free access to the quantified gene expression matrix, visit the project home page at the following URL: http://amp.pharm.mssm.edu/archs4/download.html. # # ##### Data Normalization # ##### logCPM # Raw counts were normalized to log10-Counts Per Million (logCPM) by dividing each column by the total sum of its counts, multiplying it by 10⁶, followed by the application of a log10-transform. # # ### Signature Generation # The gene expression signature was generated by comparing gene expression levels between the control group and the experimental group using the limma R package (Ritchie et al., Nucleic Acids Research 2015), available on Bioconductor: http://bioconductor.org/packages/release/bioc/html/limma.html. # # ### PCA # Principal Component Analysis was performed using the PCA function from in the sklearn Python module. Prior to performing PCA, the raw gene counts were normalized using the logCPM method, filtered by selecting the 2500 genes with most variable expression, and finally transformed using the Z-score method. # # ### Clustergrammer # The interactive heatmap was generated using Clustergrammer (Fernandez et al., 2017) which is freely available at http://amp.pharm.mssm.edu/clustergrammer/. Prior to displaying the heatmap, the raw gene counts were normalized using the logCPM method, filtered by selecting the 2500 genes with most variable expression, and finally transformed using the Z-score method. # # ### Library Size Analysis # Read counts were calculated by performing the sum for each column in the raw gene count matrix. Total counts were subsequently divided by 106 and displayed as million reads. # # ### Differential Expression Table # The gene expression signature was generated by performing differential gene expression analysis using the methods described in the Differential Gene Expression section. # # ### Volcano Plot # Gene fold changes were transformed using log2 and displayed on the x axis; P-values were corrected using the Benjamini-Hochberg method, transformed using –log10, and displayed on the y axis. See the Differential Gene Expression section for more information on the methods used to generate these values. # # ### MA Plot # Average gene expression was identified by calculating the mean of the normalized gene expression values and displayed on the x axis; P-values were corrected using the Benjamini-Hochberg method, transformed using –log10, and displayed on the y axis. For more information on the methods used to generate the signature, see the Differential Gene Expression section. # # ### Enrichr Links # The up-regulated and down-regulated gene sets were generated by extracting the 500 genes with the respectively highest and lowest values from the gene expression signature. The gene sets were subsequently submitted to Enrichr (Kuleshov et al., 2016), which is freely available at http://amp.pharm.mssm.edu/Enrichr/, using the gene set upload API. For more information on the methods used to generate the signature, see the Differential Gene Expression section. # # ### Gene Ontology Enrichment Analysis # Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: GO_Biological_Process_2017b, GO_Molecular_Function_2017b, GO_Cellular_Component_2017b. Significant terms are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section. # # ### Pathway Enrichment Analysis # Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: KEGG_2016, Reactome_2016, WikiPathways_2016. Significant terms are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section. # # ### Transcription Factor Enrichment Analysis # Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: ChEA_2016, ENCODE_TF_ChIP-seq_2015, ARCHS4_TFs_Coexp. Significant results are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section. # # ### Kinase Enrichment Analysis # Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: KEA_2015, ARCHS4_Kinases_Coexp. Significant results are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section. # # ### miRNA Enrichment Analysis # Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: TargetScan_microRNA_2017, miRTarBase_2017. Significant results are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section. # # ### L1000CDS² Query # The L1000CDS2 analysis (Duan et al., 2016) was performed by submitting the top 2000 genes in the gene expression signature to the L1000CDS2 signature search API. For more information on the methods used to generate the signature, see the Differential Gene Expression section. # # ### L1000FWD Query # The L1000FWD analysis (Wang et al., 2018) was performed by submitting the top 2000 genes in the gene expression signature to the L1000FWD signature search API. For more information on the methods used to generate the signature, see the Differential Gene Expression section. # --- # ## References # Duan, Q., Reid, S.P., Clark, N.R., Wang, Z., Fernandez, N.F., Rouillard, A.D., Readhead, B., Tritsch, S.R., Hodos, R., Hafner, M., et al. (2016). L1000CDS2: LINCS L1000 characteristic direction signatures search engine. Npj Systems Biology and Applications 2. doi: https://doi.org/10.1038/npjsba.2016.15

Fernandez, N.F., Gundersen, G.W., Rahman, A., Grimes, M.L., Rikova, K., Hornbeck, P., and Ma'ayan, A. (2017). Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data. Scientific Data 4, 170151. doi: http://dx.doi.org/10.1038/sdata.2017.151

Kuleshov, M.V., Jones, M.R., Rouillard, A.D., Fernandez, N.F., Duan, Q., Wang, Z., Koplev, S., Jenkins, S.L., Jagodnik, K.M., Lachmann, A., et al. (2016). Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research 44, W90ÐW97. doi: https://dx.doi.org/10.1093/nar/gkw377

Lachmann, A., Torre, D., Keenan, A.B., Jagodnik, K.M., Lee, H.J., Silverstein, M.C., Wang, L., and Ma’ayan, A. (2017). Massive Mining of Publicly Available RNA-seq Data from Human and Mouse (Cold Spring Harbor Laboratory). doi: https://doi.org/10.1101/189092

Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559Ð572. doi: https://doi.org/10.1080/14786440109462720

Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47–e47. doi: https://doi.org/10.1093/nar/gkv007

Wang, Z., Lachmann, A., Keenan, A.B., and Ma’ayan, A. (2018). L1000FWD: fireworks visualization of drug-induced transcriptomic signatures. Bioinformatics. doi: https://doi.org/10.1093/bioinformatics/bty060
# --- #

The Jupyter Notebook Generator is being developed by the Ma'ayan Lab at the Icahn School of Medicine at Mount Sinai
and is an open source project available on GitHub.