#!/usr/bin/env python # coding: utf-8 # # An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study # # _Zichen Wang_¹ and _Avi Ma'ayan_^1* # # ¹Department of Pharmacology and Systems Therapeutics; # BD2K-LINCS Data Coordination and Integration Center; # Mount Sinai Knowledge Management Center for Illuminating the Druggable Genome; # Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029 USA # # *Correspondence: [avi.maayan@mssm.edu](mailto:avi.maayan@mssm.edu) # # --- # ## Abstract # # RNA-seq analysis is becoming a standard method for global gene expression profiling. However, open and standard pipelines to perform RNA-seq analysis by non-experts remain challenging due to the large size of the raw data files and the hardware requirements for running the alignment step. Here we introduce a reproducible open source RNA-seq pipeline delivered as an IPython notebook and a Docker image. The pipeline uses state-of-the-art tools and can run on various platforms with minimal configuration overhead. The pipeline enables the extraction of knowledge from typical RNA-seq studies by generating interactive principal component analysis (PCA) and hierarchical clustering (HC) plots, performing enrichment analyses against over 90 gene set libraries, and obtaining lists of small molecules that are predicted to either mimic or reverse the observed changes in mRNA expression. We apply the pipeline to a recently published RNA-seq dataset collected from human neuronal progenitors infected with the Zika virus (ZIKV). In addition to confirming the presence of cell cycle genes among the genes that are downregulated by ZIKV, our analysis uncovers significant overlap with upregulated genes that when knocked out in mice induce defects in brain morphology. This result potentially points to the molecular processes associated with the microcephaly phenotype observed in newborns from pregnant mothers infected with the virus. In addition, our analysis predicts small molecules that can either mimic or reverse the expression changes induced by ZIKV. The IPython notebook and Docker image are freely available at: http://nbviewer.jupyter.org/github/maayanlab/Zika-RNAseq-Pipeline/blob/master/Zika.ipynb and https://hub.docker.com/r/maayanlab/zika/ # ### Keywords # # Systems biology, bioinformatics pipeline, gene expression analysis, RNA-seq # # --- # # ### Introduction # # The increase in awareness about the irreproducibility of scientific research requires the development of methods that make experimental and computational protocols easily repeatable and transparent [[1]](#ref1). The advent of interactive notebooks for data analysis pipelines significantly enhances the recording and sharing of data, source code, and figures [[2]](#ref2). In a subset of recent publications, an interactive notebook was published alongside customary manuscripts [[3]](#ref3). Similarly, here we present an interactive IPython notebook (http://nbviewer.jupyter.org/github/maayanlab/Zika-RNAseq-Pipeline/blob/master/Zika.ipynb) that serves as a tutorial for performing a standard RNA-seq pipeline. The IPython notebook pipeline provides scripts (http://dx.doi.org/10.5281/zenodo.56311) that process the raw data into interactive figures and permits other downstream analyses that can enable others to quickly and properly repeat our analysis as well as extract knowledge from their own data. As an example, we applied the pipeline to RNA-seq data from a recent publication where human induced pluripotent stem cells were differentiated to neuronal progenitors and then infected with Zika virus (ZIKV) [[4]](#ref4). The aim of the study was to begin to understand the molecular mechanisms that induce the observed devastating phenotype of newborn-microcephaly from pregnant mothers infected with the virus. # # ### Methods and results # # The first publicly available study profiling gene expression changes after ZIKV infection of human cells was deposited into NCBI's Gene Expression Omnibus (GEO) in March 2016. The raw data is available (ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP070/SRP070895/) from the Sequence Read Archive (SRA) with accession number GSE78711. In this study, gene expression was measured by RNA-seq using two platforms: MiSeq and NextSeq [[4]](#ref4) in duplicates. The total number of samples is eight, with four untreated samples and four infected samples. We first downloaded the raw sequencing files from SRA and then converted to FASTQ files. Quality Control (QC) for the RNA-Seq reads was assessed using FastQC [[5]](#ref5). The reports generated by FastQC were in HTML format and can be accessed through hyperlinks from the IPython notebook. The reads in the FASTQ files were aligned to the human genome with Spliced Transcripts Alignment to a Reference (STAR) [[6]](#ref6). STAR is a leading aligner that accomplishes the alignment step faster and more accurately than other current alternatives [[6]](#ref6). We next applied featureCounts [[7]](#ref7) to assign reads to genes, and then applied the edgeR Bioconductor package [[8]](#ref8) to compute counts per million (CPM) and reads per kilobase million (RPKM). The next steps are performed in Python within the IPython notebook. We first filtered out genes that are not expressed or lowly expressed. Subsequently, we performed principal component analysis (PCA) (Fig. 1). The PCA plots show that the samples cluster by infected vs. control cells, but also by platform. Next, we visualized the 800 genes with the largest variance using an interactive hierarchical clustering (HC) plot (Fig. 2). This analysis separates the groups of genes that are differentially expressed by infected vs. control from those that are differential by platform. The visualization of the clusters is implemented with an interactive external web-based data visualization tool called clustergrammer (http://amp.pharm.mssm.edu/clustergrammer/). Clustergrammer provides interactive searching, sorting and zoom capabilities. # # The following step is to identify the differentially expressed genes (DEG) between the two conditions. This is achieved with a unique method we developed called the Characteristic Direction (CD) [[9]](#ref9). The CD method is a multivariate method that we have previously demonstrated to outperform other leading methods that compute differential expression between two conditions [[9]](#ref9). Once we have ranked the lists of DEG, we submit these for signature analysis using two tools: Enrichr [[10]](#ref10) and L1000CDS2 [[11]](#ref11). Enrichr queries the up and down gene sets against over 180,000 annotated gene sets belonging to 90 gene set libraries covering pathway databases, ontologies, disease databases, and more [[10]](#ref10). The results from this enrichment analysis confirm that the downregulated genes after ZIKV infection are enriched for genes involved in cell cycle-related processes (Fig. 3a). These genes are enriched for targets of the transcription factors E2F4 and FOXM1 (Fig. 3b). Both transcription factors are known to regulate cell proliferation and play central role in many cancers. The downregulation of cell cycle genes was already reported in the original publication; nevertheless, we obtained more interesting results for the enriched terms that appeared most significant for the upregulated genes. Particularly, the top two terms from the mouse genome informatics (MGI) Mammalian Phenotype Level 4 library are abnormal nervous system (MP0003861) and abnormal brain morphology (MP0002152) (Table S1). This library associates gene knockouts in mice with mammalian phenotypes. These enriched terms enlist a short set of genes that potentially link ZIKV infection with the concerning observed microcephaly phenotype. Finally, to identify small molecules that can potentially either reverse or mimic ZIKV-induced gene expression changes, we query the ZIKV-induced signatures against the LINCS L1000 data. For this, we utilize L1000CDS2 [[11]](#ref11), a search engine that prioritize small molecules given a gene expression signature as input. L1000CDS2 contains 30,000 significant signatures that were processed from the LINCS L1000 data with the CD method. The results suggest small molecules that could be tested in follow-up studies in human cells for potential efficacy against ZIKV (Table S2). # # To ensure the reproducibility of the computational environment used for the whole RNA-Seq pipeline, we packaged all the software components used in this tutorial, including the command line tools, R packages, and Python packages into a Docker image. This Docker image is made publically available at https://hub.docker.com/r/maayanlab/zika/. The Docker image was created based on the specifications outlined on the official IPython’s Scipy Stack image (https://hub.docker.com/r/IPython/scipystack/). The additional command line tools, R scripts, and Python packages together with their dependencies were compiled and installed into the Docker image. The RNA-Seq pipeline Docker image was deployed onto our Mesos cluster, which allows users to run the IPython notebook interactively. The Docker image can also be downloaded and executed on local computers and servers, or deployed in the cloud if users have access to cloud provider services with a Docker Toolbox installed (https://www.docker.com/products/docker-toolbox). We also provide detailed instructions on how to download and execute the Docker image (https://hub.docker.com/r/maayanlab/zika/). # # The ‘Dockerization’ of the RNA-Seq pipeline facilitates reproducibility of the pipeline at the software level because the Docker image ensures that all versions of the software components are consistent and static. Dockerization also helps users to handle the complex installation of many dependencies required for the computational pipeline. Moreover, the Docker image can be executed on a single computer, clusters/servers and on the cloud. The only limitation of having a Docker image is that it prevents users from adding or altering the various steps which require additional software components and packages. However, advanced users can build their own Docker images based on our initial image to customize it for their needs. # In[1]: import os import numpy as np import pandas as pd # Below we assign some global variables that will be used across the rest of the notebook. # **Please change these variables accordingly if you intend to use this for other studies. ** # In[2]: # The URL for the SRA study (project), usually in a SRPxxxxx folder including several SRRxxxxx folders (samples) os.environ['FTP_URL'] = 'ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP070/SRP070895/' # The working directory to store all the sequencing data, will be created if not exists os.environ['WORKDIR'] = 'data/Zika/' # The directory for the reference genome os.environ['GENOMEDIR'] = 'genomes/Homo_sapiens/UCSC/hg19' # In[3]: # Download the SRA files # This bash script is commented out because we don't want to download the files every time. get_ipython().system('mkdir -p $WORKDIR') get_ipython().system('wget -r $FTP_URL --no-parent -nH --cut-dirs=8 -P $WORKDIR') # In[4]: ## Examine the downloaded SRA files downloaded get_ipython().system('ls -lh $WORKDIR/SRR*/*.sra') # ### The downloaded SRA files are next processed by following these steps: # 1. `fastq-dump` in the SRA-toolkit to generate .fastq files # 2. `FastQC`^[3](#ref3) to perform Quality Controls and generate QC report for the input RNA-seq data # 2. `STAR`^[3](#ref3) for the read alignment # 3. `featureCounts`^[4](#ref4) for assigning reads to genes # 4. `edgeR` Bioconductor package^[5](#ref5) were used to compute CPM and RPKM # --- # Steps 1-4 are processed by this bash script [`analyze_sra.sh`](https://github.com/MaayanLab/Zika-RNAseq-Pipeline/blob/master/analyze_sra.sh). This bash script can take command line arguments specifying the location of the reference genome and working directory for the SRA files. # In[5]: get_ipython().system('bash analyze_sra.sh -h') # We run the bash script by specifying the working directory and the genome directory and pipe the log into a `analyze_sra.log` file # In[6]: get_ipython().system('bash analyze_sra.sh -w $WORKDIR -g $GENOMEDIR | tee analyze_sra.log') # Step 5 is done with this R script [`normalize.R`](https://github.com/MaayanLab/Zika-RNAseq-Pipeline/blob/master/normalize.R) # In[7]: get_ipython().system('Rscript normalize.R $WORKDIR') # In[8]: ## We can examine the QC reports from the FastQC program to evaluate the quality of the data from IPython.display import FileLinks FileLinks(os.path.join(os.environ['WORKDIR'], 'fastQC_output'), included_suffixes=['.html']) # In[9]: ## Check the alignment stats ## This will output the first 10 lines of all summary files from the featureCounts folder get_ipython().system('head $WORKDIR/featureCount_output/*.summary') # After you completed successfully the above steps, you can start to analyze the processed expression matrix of gene expression in Python # In[10]: ## Load the expression matrix expr_df = pd.read_csv(os.path.join(os.environ['WORKDIR'], 'repCpmMatrix_featureCounts.csv')) expr_df = expr_df.set_index(expr_df.columns[0]) expr_df.head() # In[11]: print expr_df.shape # In[12]: ## Filter out non-expressed genes expr_df = expr_df.loc[expr_df.sum(axis=1) > 0, :] print expr_df.shape ## Filter out lowly expressed genes mask_low_vals = (expr_df > 0.3).sum(axis=1) > 2 expr_df = expr_df.loc[mask_low_vals, :] print expr_df.shape # Obtain more metadata about the samples by clicking on the `RunInfo Table` button on the SRP page available in this URL [http://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP070895](http://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP070895). Clicking on this button downloads a spreadsheet with additional metadata. Next read this file and extract the relavant variables from it. # In[13]: meta_df = pd.read_csv(os.path.join(os.environ['WORKDIR'], 'SraRunTable.txt'), sep='\t').set_index('Run_s') print meta_df.shape # re-order the index to make it the same with expr_df meta_df = meta_df.ix[expr_df.columns] meta_df # Now we have everything setup, the first thing to do is to generate PCA plots to observe whether the samples cluster # as expected: controls with controls, and treatments with treatments. # In[14]: import matplotlib.pyplot as plt from matplotlib import rcParams rcParams['pdf.fonttype'] = 42 ## Output Type 3 (Type3) or Type 42 (TrueType) rcParams['font.sans-serif'] = 'Arial' # ignore FutureWarning that may pop up when plotting import warnings warnings.filterwarnings("ignore") import urllib3 urllib3.disable_warnings() # In[15]: from IPython.display import HTML, display # to display hyperlink as tag in output cells def display_link(url): raw_html = '%s' % (url, url) return display(HTML(raw_html)) # You can obtain the script [`RNAseq`](https://github.com/MaayanLab/Zika-RNAseq-Pipeline/blob/master/RNAseq.py) from this repo. # In[16]: import RNAseq # In[17]: # plot PCA get_ipython().run_line_magic('matplotlib', 'inline') RNAseq.PCA_plot(expr_df.values, meta_df['infection_status_s'], standardize=2, log=True, show_text=False, sep=' ', legend_loc='upper right') # The PCA plot below is the same as above, except that we color the samples by platform (two Illumina sequencing machines used: [Illumina MiSeq](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL15520) for paired-end and [Illumina NextSeq 500](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL18573) for single-end). # In[18]: d_layout_platform = {'PAIRED': 'MiSeq', 'SINGLE': 'NextSeq 500'} meta_df['platform'] = [d_layout_platform[l] for l in meta_df['LibraryLayout_s']] RNAseq.PCA_plot(expr_df.values, meta_df['platform'], standardize=2, log=True, show_text=False, sep=' ', legend_loc='upper right') # We can also plot a 3D interactive PCA plot using [plotly](https://plot.ly/), which has a nice integration with Jupyter notebooks. # In[19]: # Compute the coordinates of samples in the PCA space variance_explained, pca_transformed = RNAseq.perform_PCA(expr_df.values, standardize=2, log=True) # Bind x, y, z coordinates to meta_df meta_df['x'] = pca_transformed[:,0] meta_df['y'] = pca_transformed[:,1] meta_df['z'] = pca_transformed[:,2] # In[20]: import plotly plotly.offline.init_notebook_mode() # To embed plots in the output cell of the notebook import plotly.graph_objs as go conditions = meta_df['infection_status_s'].unique().tolist() platforms = meta_df['platform'].unique().tolist() SYMBOLS = ['circle', 'square'] COLORS = RNAseq.COLORS10 data = [] # To collect all Scatter3d instances for (condition, platform), meta_df_sub in meta_df.groupby(['infection_status_s', 'platform']): # Iteratate through samples grouped by condition and platform display_name = '%s, %s' % (condition, platform) # Initiate a Scatter3d instance for each group of samples specifying their coordinates # and displaying attributes including color, shape, size and etc. trace = go.Scatter3d( x=meta_df_sub['x'], y=meta_df_sub['y'], z=meta_df_sub['z'], text=meta_df_sub.index, mode='markers', marker=dict( size=10, color=COLORS[conditions.index(condition)], # Color by infection status symbol=SYMBOLS[platforms.index(platform)], # Shaped by sequencing platforms opacity=.8, ), name=display_name, ) data.append(trace) # Configs for layout and axes layout=dict(height=1000, width=1000, title='3D PCA plot for samples in Zika study', scene=dict( xaxis=dict(title='PC1 (%.2f%% variance)' % variance_explained[0]), yaxis=dict(title='PC2 (%.2f%% variance)' % variance_explained[1]), zaxis=dict(title='PC3 (%.2f%% variance)' % variance_explained[2]) ) ) fig=dict(data=data, layout=layout) plotly.offline.iplot(fig) # Alternatively, we can visualize the gene expression matrix using [Clustergrammer](http://amp.pharm.mssm.edu/clustergrammer/). Clustergrammer is a visualization tool that we developed to enable users and web-based applications to easily generate interactive and shareable clustergram-heatmap visualizations from a matrices of data. In the following code, we display a subset the expression matrix using genes with the largest variance. We then log transform and z-score center the expression matrix so that it has an average of zero, and a standard deviation of unity for each gene on the rows. We write the subset of expression matrix into a text file, and then use a HTTP `POST` of this file to the [API of Clustergrammer](http://amp.pharm.mssm.edu/clustergrammer/help#api). The API then responds with a URL to the interactive clustergram. # In[21]: # Subset the expression DataFrame using top 800 genes with largest variance variances = np.var(expr_df, axis=1) srt_idx = variances.argsort()[::-1] expr_df_sub = expr_df.iloc[srt_idx].iloc[:800] print expr_df_sub.shape expr_df_sub.head() # In[25]: # Log transform and z-score standardize the data and write to a .txt file expr_df_sub.index.name='' expr_df_sub = np.log1p(expr_df_sub) expr_df_sub = expr_df_sub.apply(lambda x: (x-x.mean())/x.std(ddof=0), axis=1) # prettify sample names sample_names = ['-'.join([x, d_layout_platform[y], z]) for x,y,z in zip(meta_df['infection_status_s'], meta_df['LibraryLayout_s'], expr_df_sub.columns)] expr_df_sub.columns = sample_names expr_df_sub_file = os.path.join(os.environ['WORKDIR'], 'expression_matrix_top800_genes.txt') expr_df_sub.to_csv(expr_df_sub_file, sep='\t') # In[26]: # POST the expression matrix to Clustergrammer and get the URL import requests, json clustergrammer_url = 'http://amp.pharm.mssm.edu/clustergrammer/matrix_upload/' r = requests.post(clustergrammer_url, files={'file': open(expr_df_sub_file, 'rb')}) link = r.text display_link(link) # We can also display the result in this notebook using `