In [ ]:
# Initialize Notebook
%run ../join-collection-1/library/init.ipy
HTML('''<script> code_show=true;  function code_toggle() {  if (code_show){  $('div.input').hide();  } else {  $('div.input').show();  }  code_show = !code_show }  $( document ).ready(code_toggle); </script> <form action="javascript:code_toggle()"><input type="submit" value="Toggle Code"></form>''')

Polycomb Protein BMI1 Overexpression Analysis in A375 Melanoma Cell Line | BioJupies


Introduction

This notebook contains an analysis of GEO dataset GSE71890 (https://www.ncbi.nlm.nih.gov/gds/?term=GSE71890) created using the BioJupies Generator.

Table of Contents

The notebook is divided into the following sections:

  1. Load Dataset - Loads and previews the input dataset in the notebook environment.
  2. PCA - Linear dimensionality reduction technique to visualize similarity between samples
  3. Clustergrammer - Interactive hierarchical clustering heatmap visualization
  4. Library Size Analysis - Analysis of readcount distribution for the samples within the dataset
  5. Differential Expression Table - Differential expression analysis between two groups of samples
  6. Volcano Plot - Plot the logFC and logP values resulting from a differential expression analysis
  7. MA Plot - Plot the logFC and average expression values resulting from a differential expression analysis
  8. Enrichr Links - Links to enrichment analysis results of the differentially expressed genes via Enrichr
  9. Gene Ontology Enrichment Analysis - Identifies Gene Ontology terms which are enriched in the differentially expressed genes
  10. Pathway Enrichment Analysis - Identifies biological pathways which are enriched in the differentially expressed genes
  11. Transcription Factor Enrichment Analysis - Identifies transcription factors whose targets are enriched in the differentially expressed genes
  12. Kinase Enrichment Analysis - Identifies protein kinases whose substrates are enriched in the differentially expressed genes
  13. miRNA Enrichment Analysis - Identifies miRNAs whose targets are enriched in the differentially expressed genes
  14. L1000CDS2 Query - Identifies small molecules which mimic or reverse a given differential gene expression signature
  15. L1000FWD Query - Projects signatures on a 2-dimensional visualization of the L1000 signature database

Results

1. Load Dataset

Here, the GEO dataset GSE71890 is loaded into the notebook. Expression data was quantified as gene-level counts using the ARCHS4 pipeline (Lachmann et al., 2017), available at http://amp.pharm.mssm.edu/archs4/.

In [2]:
# Load dataset
dataset = load_dataset(source='archs4', gse='GSE71890', platform='GPL11154')

# Preview expression data
preview_data(dataset)
GSM1847225 GSM1847223 GSM1847221 GSM1847222 GSM1847224 GSM1847226
A1BG 163 245 183 206 218 243
A1CF 1 2 2 3 1 3
A2M 5450 4181 2720 3236 6152 6359
A2ML1 5 9 3 7 7 7
A2MP1 5 10 4 8 10 15

Table 1 | RNA-seq expression data. The table displays the first 5 rows of the quantified RNA-seq expression dataset. Rows represent genes, columns represent samples, and values show the number of mapped reads.

In [3]:
# Display metadata
display_metadata(dataset)
Sample Title cell line
Sample_geo_accession
GSM1847225 BMI2 A375
GSM1847223 GFP3 A375
GSM1847221 GFP1 A375
GSM1847222 GFP2 A375
GSM1847224 BMI1 A375
GSM1847226 BMI3 A375

Table 2 | Sample metadata. The table displays the metadata associated with the samples in the RNA-seq dataset. Rows represent RNA-seq samples, columns represent metadata categories.

In [4]:
# Configure signatures
dataset['signature_metadata'] = {
    'Control vs Perturbation': {
        'A': ['GSM1847221', 'GSM1847222', 'GSM1847223'],
        'B': ['GSM1847224', 'GSM1847225', 'GSM1847226']
    }
}

# Generate signatures
for label, groups in dataset['signature_metadata'].items():
    signatures[label] = generate_signature(group_A=groups['A'], group_B=groups['B'], method='limma', dataset=dataset)

2. PCA

Principal Component Analysis (PCA) is a statistical technique used to identify global patterns in high-dimensional datasets. It is commonly used to explore the similarity of biological samples in RNA-seq datasets. To achieve this, gene expression values are transformed into Principal Components (PCs), a set of linearly uncorrelated features which represent the most relevant sources of variance in the data, and subsequently visualized using a scatter plot.

In [5]:
# Run analysis
results['pca'] = analyze(dataset=dataset, tool='pca', nr_genes=2500, normalization='logCPM', z_score='True')

# Display results
plot(results['pca'])

Figure 1 | Principal Component Analysis results. The figure displays an interactive, three-dimensional scatter plot of the first three Principal Components (PCs) of the data. Each point represents an RNA-seq sample. Samples with similar gene expression profiles are closer in the three-dimensional space. If provided, sample groups are indicated using different colors, allowing for easier interpretation of the results.


3. Clustergrammer

Clustergrammer is a web-based tool for visualizing and analyzing high-dimensional data as interactive and hierarchically clustered heatmaps. It is commonly used to explore the similarity between samples in an RNA-seq dataset. In addition to identifying clusters of samples, it also allows to identify the genes which contribute to the clustering.

In [6]:
# Run analysis
results['clustergrammer'] = analyze(dataset=dataset, tool='clustergrammer', nr_genes=2500, normalization='logCPM', z_score='True')

# Display results
plot(results['clustergrammer'])

Figure 2 | Clustergrammer analysis. The figure contains an interactive heatmap displaying gene expression for each sample in the RNA-seq dataset. Every row of the heatmap represents a gene, every column represents a sample, and every cell displays normalized gene expression values. The heatmap additionally features color bars beside each column which represent prior knowledge of each sample, such as the tissue of origin or experimental treatment.


4. Library Size Analysis

In order to quantify gene expression in an RNA-seq dataset, reads generated from the sequencing step are mapped to a reference genome and subsequently aggregated into numeric gene counts. Due to experimental variations and random technical noise, samples in an RNA-seq datasets often have variable amounts of the total RNA. Library size analysis calculates and displays the total number of reads mapped for each sample in the RNA-seq dataset, facilitating the identification of outlying samples and the assessment of the overall quality of the data.

In [7]:
# Run analysis
results['library_size_analysis'] = analyze(dataset=dataset, tool='library_size_analysis')

# Display results
plot(results['library_size_analysis'])

Figure 3 | Library Size Analysis results. The figure contains an interactive bar chart which displays the total number of reads mapped to each RNA-seq sample in the dataset. Additional information for each sample is available by hovering over the bars. If provided, sample groups are indicated using different colors, thus allowing for easier interpretation of the results


5. Differential Expression Table

Gene expression signatures are alterations in the patterns of gene expression that occur as a result of cellular perturbations such as drug treatments, gene knock-downs or diseases. They can be quantified using differential gene expression (DGE) methods, which compare gene expression between two groups of samples to identify genes whose expression is significantly altered in the perturbation. The signature table is used to interactively display the results of such analyses.

In [8]:
# Initialize results
results['signature_table'] = {}

# Loop through signatures
for label, signature in signatures.items():

    # Run analysis
    results['signature_table'][label] = analyze(signature=signature, tool='signature_table', signature_label=label)

    # Display results
    plot(results['signature_table'][label])
logFC AveExpr P-value FDR
Gene
*FN1 3.23 7.88 2.197613e-12 7.743947e-08
*PXDN 4.32 4.85 1.817462e-11 3.202186e-07
*IGFBP7 2.07 5.67 2.511680e-10 2.950219e-06
*EEF1A2 3.22 7.18 1.124937e-09 9.700912e-06
*TSPAN18 4.43 2.09 1.410753e-09 9.700912e-06
*CRLF1 4.02 5.19 1.651781e-09 9.700912e-06
*SLC14A1 2.61 2.74 3.106124e-09 1.468173e-05
*TMEM59L 3.50 4.21 3.333158e-09 1.468173e-05
*CPE 1.51 5.34 4.790491e-09 1.612316e-05
*AKR1C2 2.02 4.91 4.997100e-09 1.612316e-05
*IGFBP3 1.27 6.02 5.372942e-09 1.612316e-05
*SEL1L3 1.29 6.26 5.490605e-09 1.612316e-05
*FAM20A 2.26 3.61 6.318753e-09 1.625332e-05
*ACAN 2.49 4.78 6.457417e-09 1.625332e-05
*APLP2 0.81 7.93 8.917144e-09 1.924028e-05
*KIAA0040 1.68 5.63 9.281409e-09 1.924028e-05
*EFEMP1 2.11 3.56 9.282161e-09 1.924028e-05
*CCDC88C 3.37 2.18 1.015388e-08 1.987791e-05
*NFASC 2.10 3.75 1.082482e-08 2.007605e-05
*PMEPA1 1.83 6.97 1.200677e-08 2.079352e-05
*BMP7 2.95 3.87 1.239185e-08 2.079352e-05
*NXN 2.98 2.80 1.308980e-08 2.096629e-05
*TIMP3 0.78 9.49 1.645655e-08 2.521287e-05
*PTPRU 1.97 4.24 1.947034e-08 2.858732e-05
*FSTL1 1.56 5.23 2.118124e-08 2.985538e-05
*UNC13A 2.14 5.06 2.650973e-08 3.446232e-05
*EMP1 0.88 7.32 2.730127e-08 3.446232e-05
*TP53I11 2.49 4.66 2.738364e-08 3.446232e-05
*PCSK6 2.41 2.92 3.185111e-08 3.737467e-05
*CEMIP 1.18 9.07 3.187430e-08 3.737467e-05
*DAAM2 1.31 5.10 3.309893e-08 3.737467e-05
*CXCL1 2.55 3.29 3.449975e-08 3.737467e-05
*EGR3 1.28 4.91 3.500097e-08 3.737467e-05
*CRABP2 3.42 1.64 3.810556e-08 3.949305e-05
*SPARC 0.60 10.88 4.231747e-08 4.184310e-05
*HTRA3 2.59 2.32 4.274793e-08 4.184310e-05
*KCNN1 2.16 2.13 4.843415e-08 4.506690e-05
*TGFBI 0.89 8.37 4.899408e-08 4.506690e-05
*HTRA1 0.84 6.38 4.987824e-08 4.506690e-05
*MRPS6 1.01 6.39 5.120067e-08 4.510523e-05
*ABLIM1 1.20 5.49 5.384404e-08 4.627698e-05
*VIT 3.18 1.83 6.219207e-08 4.899394e-05
*NPTX2 3.50 3.01 6.221977e-08 4.899394e-05
*PRRX2 2.54 2.40 6.463037e-08 4.899394e-05
*ALX4 2.65 3.18 6.477435e-08 4.899394e-05
*RAB3B 2.13 2.03 6.528314e-08 4.899394e-05
*VAT1L 1.24 4.17 6.534750e-08 4.899394e-05
*LMO7 1.31 5.44 6.955946e-08 5.106534e-05
*MYRF 1.81 4.14 7.223741e-08 5.194902e-05
*COL5A1 1.75 5.41 7.831904e-08 5.507334e-05
*SYNJ2 1.07 5.69 8.003316e-08 5.507334e-05
*PIEZO2 2.71 0.93 8.127061e-08 5.507334e-05
*PODXL2 1.76 4.06 8.492576e-08 5.646441e-05
*NTNG2 2.75 1.84 9.040039e-08 5.899128e-05
*SOX8 2.79 3.42 9.522594e-08 6.029799e-05
*SEPT3 1.88 3.42 9.582517e-08 6.029799e-05
*NES 1.15 7.68 1.005729e-07 6.217520e-05
*CXCL8 4.11 4.25 1.127233e-07 6.848526e-05
*ELFN2 3.96 1.28 1.222553e-07 6.940196e-05
*KIF1A 1.42 4.77 1.226007e-07 6.940196e-05
*A2M 1.00 7.43 1.229127e-07 6.940196e-05
*FST 1.07 7.45 1.265494e-07 6.940196e-05
*GFPT2 0.68 6.29 1.268634e-07 6.940196e-05
*MICAL2 1.88 2.62 1.273329e-07 6.940196e-05
*MGP 1.59 6.57 1.288063e-07 6.940196e-05
*MYO1D 1.12 4.74 1.301556e-07 6.940196e-05
*SUSD5 0.92 4.82 1.350380e-07 6.940196e-05
*ADD2 1.35 5.18 1.362029e-07 6.940196e-05
*DNAH9 -1.84 4.09 1.377430e-07 6.940196e-05
*FOXF1 2.23 2.19 1.378664e-07 6.940196e-05
*EPAS1 1.85 2.58 1.411283e-07 6.985712e-05
*RCN3 2.03 3.45 1.444392e-07 6.985712e-05
*SCARA3 3.07 2.16 1.447179e-07 6.985712e-05
*LAMC3 3.35 1.31 1.581740e-07 7.424966e-05
*CD82 1.95 3.06 1.583064e-07 7.424966e-05
*MYO10 1.10 4.55 1.601389e-07 7.424966e-05
*STC1 1.65 2.91 1.759590e-07 8.052524e-05
*INHBA 1.44 7.91 1.832732e-07 8.279718e-05
*PDZD4 1.94 3.03 1.933798e-07 8.481446e-05
*STMN3 1.70 4.54 1.950581e-07 8.481446e-05
*TRPA1 2.24 1.84 2.008146e-07 8.481446e-05
*GPRC5C 3.44 0.32 2.015463e-07 8.481446e-05
*GNAS -0.64 11.80 2.019302e-07 8.481446e-05
*FGFR1 1.09 5.49 2.035567e-07 8.481446e-05
*LINGO1 2.70 1.89 2.045868e-07 8.481446e-05
*LAD1 1.62 4.74 2.111162e-07 8.492161e-05
*SORCS2 2.44 2.85 2.132867e-07 8.492161e-05
*GNE -0.73 7.97 2.142520e-07 8.492161e-05
*CA12 2.69 1.16 2.144850e-07 8.492161e-05
*ST8SIA4 -1.33 3.60 2.218706e-07 8.492422e-05
*NNMT 0.96 5.92 2.246819e-07 8.492422e-05
*FLI1 1.75 3.18 2.274192e-07 8.492422e-05
*TPM1 0.83 6.15 2.278768e-07 8.492422e-05
*TUBB4A 1.79 4.12 2.283140e-07 8.492422e-05
*ABI3BP 1.74 3.10 2.289517e-07 8.492422e-05
*ID1 1.42 6.23 2.404384e-07 8.739073e-05
*SSTR5 6.10 -2.74 2.407644e-07 8.739073e-05
*PROCR 0.80 5.57 2.430414e-07 8.739073e-05
*PLPP4 2.44 1.37 2.457365e-07 8.746729e-05
*NRCAM 1.64 6.07 2.521544e-07 8.820500e-05