Reference transcripts are updated on Jan 2018
## Download reference cDNA sequences
wget ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
mv Mus_musculus.GRCm38.cdna.all.fa MM_cDNA.fa
## Download reference ncRNA sequences
wget ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/ncrna/Mus_musculus.GRCm38.ncrna.fa.gz
mv Mus_musculus.GRCm38.ncrna.fa MM_ncRNA.fa
## Create Salmon index (k=31) [27922648.bc]
salmon index -t MM_all_RNA.fa -i ~/MM_all_RNA_Jun28_k31_salmon_index --type quasi -k 31
Illumina Universal Adaptor sequence: AGATCGGAAGAGC
## FastQC
fastqc -o . -f fastq --extract SRR5412186.fastq.gz -t 8
## rCorrector
perl ~/install/Rcorrector-master/run_rcorrector.pl -t 8 -s ./SRR5412186.fastq.gz
## Filter
python ~/FilterUncorrectabledSEfastq.py -i SRR5412186.cor.fq.gz -o filtered
## fastp
fastp -i filtered_SRR5412186.cor.fq -o filtered_SRR5412186_fastp.cor.fq \
-a AGATCGGAAGAGC -q 5 -c -p -w 10 \
-h filtered_SRR5412186_fastp.html -j filtered_SRR5412186_fastp.json \
-R "filtered_SRR5412186_fastp report"
## FastQC
fastqc -o . -f fastq --extract filtered_SRR5412186_fastp.cor.fq -t 10
## Mapping (k=31) ---> 88.5444% reads mapped [27923028.bc
salmon quant -i ~/MM_all_RNA_Jun28_k31_salmon_index -l A \
-r filtered_SRR5412186_fastp.cor.fq \
-o ~/filtered_SRR5412186_fastp_MM_all_RNA_Jun29_k31_salmon_quant
# FastQC
fastqc -o . -f fastq --extract SRR5412187.fastq.gz -t 10
## rCorrector
perl ~/install/Rcorrector-master/run_rcorrector.pl -t 10 -s ./SRR5412187.fastq.gz
## Filter
python ~/FilterUncorrectabledSEfastq.py -i SRR5412187.cor.fq.gz -o filtered
## fastp
fastp -i filtered_SRR5412187.cor.fq -o filtered_SRR5412187_fastp.cor.fq \
-a AGATCGGAAGAGC -q 5 -c -p -w 10 \
-h filtered_SRR5412187_fastp.html -j filtered_SRR5412187_fastp.json \
-R "filtered_SRR5412187_fastp report"
## FastQC
fastqc -o . -f fastq --extract filtered_SRR5412187_fastp.cor.fq -t 10
## Mapping (k=31) ---> 89.0693% reads mapped [27923029.bc]
salmon quant -i ~/MM_all_RNA_Jun28_k31_salmon_index -l A \
-r filtered_SRR5412187_fastp.cor.fq \
-o ~/filtered_SRR5412187_fastp_MM_all_RNA_Jun29_k31_salmon_quant
## FastQC
fastqc -o . -f fastq --extract SRR5412188.fastq.gz -t 10
## rCorrector
perl ~/install/Rcorrector-master/run_rcorrector.pl -t 10 -s ./SRR5412188.fastq.gz
## Filter
python ~/FilterUncorrectabledSEfastq.py -i SRR5412188.cor.fq.gz -o filtered
## fastp
fastp -i filtered_SRR5412188.cor.fq -o filtered_SRR5412188_fastp.cor.fq \
-a AGATCGGAAGAGC -q 5 -c -p -w 10 \
-h filtered_SRR5412188_fastp.html -j filtered_SRR5412188_fastp.json \
-R "filtered_SRR5412188_fastp report"
## FastQC
fastqc -o . -f fastq --extract filtered_SRR5412188_fastp.cor.fq -t 10
## Mapping (k=31) ---> 89.256% reads mapped
salmon quant -i ~/MM_all_RNA_Jun28_k31_salmon_index -l A \
-r filtered_SRR5412188_fastp.cor.fq \
-o ~/filtered_SRR5412188_fastp_MM_all_RNA_Jun29_k31_salmon_quant
## FastQC
fastqc -o . -f fastq --extract SRR5412189.fastq.gz -t 10
## rCorrector
perl ~/install/Rcorrector-master/run_rcorrector.pl -t 10 -s ./SRR5412189.fastq.gz
## Filter
python ~/FilterUncorrectabledSEfastq.py -i SRR5412189.cor.fq.gz -o filtered
## fastp
fastp -i filtered_SRR5412189.cor.fq -o filtered_SRR5412189_fastp.cor.fq \
-a AGATCGGAAGAGC -q 5 -c -p -w 10 \
-h filtered_SRR5412189_fastp.html -j filtered_SRR5412189_fastp.json \
-R "filtered_SRR5412189_fastp report"
## FastQC
fastqc -o . -f fastq --extract filtered_SRR5412189_fastp.cor.fq -t 10
## Mapping (k=31) ---> 89.2007% reads mapped
salmon quant -i ~/MM_all_RNA_Jun28_k31_salmon_index -l A \
-r filtered_SRR5412189_fastp.cor.fq \
-o ~/filtered_SRR5412189_fastp_MM_all_RNA_Jun29_k31_salmon_quant
## MultiQC command to summarize all FastQC and Salmon output logs in the current folder
multiqc .
## Import libraries
import pandas as pd
import os
os.chdir("/home/zhanglab1/ndong/Lymnaea_CNS_transcriptome_files/7_Interspecies_comparison/7c_Mouse")
## Define function
def extract_non0(salmon_output_filename, library_ID):
with open(salmon_output_filename, "r") as infile:
lib = pd.read_csv(infile, sep='\t')
lib_non0 = lib.loc[lib["TPM"]>0]
lib_non0_counts = lib_non0[["Name", "TPM"]]
print("There are", lib_non0_counts.shape[0], "transcripts with TPM>0 in the reads library", library_ID)
return(lib_non0_counts)
## Analyses
lib86 = extract_non0("./filtered_SRR5412186_fastp_MM_all_RNA_Jun29_k31_salmon_quant/quant.sf", "SRR5412186")
lib87 = extract_non0("./filtered_SRR5412187_fastp_MM_all_RNA_Jun29_k31_salmon_quant/quant.sf", "SRR5412187")
lib88 = extract_non0("./filtered_SRR5412188_fastp_MM_all_RNA_Jun29_k31_salmon_quant/quant.sf", "SRR5412188")
lib89 = extract_non0("./filtered_SRR5412189_fastp_MM_all_RNA_Jun29_k31_salmon_quant/quant.sf", "SRR5412189")
There are 72036 transcripts with TPM>0 in the reads library SRR5412186 There are 74997 transcripts with TPM>0 in the reads library SRR5412187 There are 73521 transcripts with TPM>0 in the reads library SRR5412188 There are 77844 transcripts with TPM>0 in the reads library SRR5412189
%get lib86 --from Python3
%get lib87 --from Python3
head(lib86, 10)
head(lib87, 10)
Name | TPM | |
---|---|---|
33 | ENSMUST00000103269.2 | 0.36996300 |
43 | ENSMUST00000185651.1 | 1.25688000 |
46 | ENSMUST00000103569.2 | 0.73992700 |
79 | ENSMUST00000177703.2 | 0.51339300 |
89 | ENSMUST00000103600.2 | 0.23959200 |
128 | ENSMUST00000103627.2 | 0.23959200 |
144 | ENSMUST00000181768.2 | 1.10703000 |
145 | ENSMUST00000183835.1 | 0.38827200 |
156 | ENSMUST00000103670.3 | 0.24074200 |
226 | ENSMUST00000189634.6 | 0.00700598 |
Name | TPM | |
---|---|---|
33 | ENSMUST00000103269.2 | 0.36996300 |
43 | ENSMUST00000185651.1 | 1.25688000 |
46 | ENSMUST00000103569.2 | 0.73992700 |
79 | ENSMUST00000177703.2 | 0.51339300 |
89 | ENSMUST00000103600.2 | 0.23959200 |
128 | ENSMUST00000103627.2 | 0.23959200 |
144 | ENSMUST00000181768.2 | 1.10703000 |
145 | ENSMUST00000183835.1 | 0.38827200 |
156 | ENSMUST00000103670.3 | 0.24074200 |
226 | ENSMUST00000189634.6 | 0.00700598 |
## Define function
def sorted_merged_table(df1, df2, df3, df4,
lib_name1, lib_name2, lib_name3, lib_name4):
non0_2 = df1.merge(df2, on='Name') # Create lookup merged table of all libraries
non0_3 = non0_2.merge(df3, on='Name')
non0_4 = non0_3.merge(df4, on='Name')
non0_4['Median'] = non0_4.median(axis=1) # Calculate median read count of each transcript across all libraries
non0_4_median_sorted = non0_4.sort_values(by="Median", ascending=False) # Sort transcripts by median read count
non0_4_median_sorted.columns = ("Name", lib_name1, lib_name2, lib_name3, lib_name4, "Median") # Rename columns
return non0_4_median_sorted
## Create median-sorted lookup table
MM_sorted_merged = sorted_merged_table(lib86, lib87, lib88, lib89,
"SRR5412186", "SRR5412187", "SRR5412188", "SRR5412189")
print("Dataframe dimensions:", MM_sorted_merged.shape)
## Output non-0 transcript IDs to file
MM_sorted_merged["Name"].to_csv("./MM_6789_non0_Nov10.txt", index=None)
Dataframe dimensions: (54775, 6)
%get MM_sorted_merged --from Python3
head(MM_sorted_merged, 10)
Name | SRR5412186 | SRR5412187 | SRR5412188 | SRR5412189 | Median | |
---|---|---|---|---|---|---|
40 | ENSMUST00000082407.1 | 149567.00 | 158290.0 | 124023.00 | 127249.0 | 138408.00 |
50383 | ENSMUST00000082403.1 | 17884.90 | 36408.9 | 47544.30 | 33898.1 | 35153.50 |
50097 | ENSMUST00000187117.1 | 22286.90 | 21194.7 | 20448.70 | 23908.8 | 21740.80 |
36 | ENSMUST00000084013.1 | 18408.00 | 16610.2 | 17065.10 | 16259.7 | 16837.65 |
38 | ENSMUST00000082409.1 | 14061.50 | 15864.6 | 8649.28 | 14477.3 | 14269.40 |
50092 | ENSMUST00000189941.1 | 13124.50 | 12909.7 | 14215.90 | 11344.8 | 13017.10 |
52000 | ENSMUST00000227063.1 | 61348.20 | 13308.1 | 4878.67 | 12047.6 | 12677.85 |
50096 | ENSMUST00000188231.1 | 3812.45 | 12408.2 | 16769.20 | 12561.9 | 12485.05 |
42 | ENSMUST00000082402.1 | 12343.70 | 11872.5 | 12274.10 | 12106.7 | 12190.40 |
39 | ENSMUST00000082408.1 | 11906.00 | 10750.8 | 6576.71 | 11196.3 | 10973.55 |
MM_median_sorted_IDs = MM_sorted_merged["Name"]
## Load zebrafish Biomart dataset
library(biomaRt)
MM_ensembl = useMart("ensembl", dataset="mmusculus_gene_ensembl")
%get MM_median_sorted_IDs --from Python3
## Create getBM() query for converting transcript IDs to gene IDs
MM_biomart_output <- getBM(attributes = c('ensembl_transcript_id_version', 'ensembl_gene_id', "description",
'external_gene_name', 'transcript_biotype', 'gene_biotype'),
filters = 'ensembl_transcript_id_version',
values = MM_median_sorted_IDs,
mart = MM_ensembl)
Batch submitting query [>------------------------------] 2% eta: 1m Batch submitting query [>------------------------------] 3% eta: 2m Batch submitting query [>------------------------------] 4% eta: 2m Batch submitting query [>------------------------------] 5% eta: 2m Batch submitting query [=>-----------------------------] 5% eta: 2m Batch submitting query [=>-----------------------------] 6% eta: 2m Batch submitting query [=>-----------------------------] 7% eta: 2m Batch submitting query [==>----------------------------] 8% eta: 2m Batch submitting query [==>----------------------------] 9% eta: 2m Batch submitting query [==>----------------------------] 10% eta: 2m Batch submitting query [==>----------------------------] 11% eta: 2m Batch submitting query [===>---------------------------] 12% eta: 2m Batch submitting query [===>---------------------------] 13% eta: 2m Batch submitting query [===>---------------------------] 14% eta: 2m Batch submitting query [====>--------------------------] 15% eta: 2m Batch submitting query [====>--------------------------] 16% eta: 2m Batch submitting query [====>--------------------------] 17% eta: 2m Batch submitting query [=====>-------------------------] 18% eta: 2m Batch submitting query [=====>-------------------------] 19% eta: 2m Batch submitting query [=====>-------------------------] 20% eta: 2m Batch submitting query [=====>-------------------------] 21% eta: 1m Batch submitting query [======>------------------------] 22% eta: 1m Batch submitting query [======>------------------------] 23% eta: 1m Batch submitting query [======>------------------------] 24% eta: 1m Batch submitting query [=======>-----------------------] 25% eta: 1m Batch submitting query [=======>-----------------------] 26% eta: 1m Batch submitting query [=======>-----------------------] 27% eta: 1m Batch submitting query [========>----------------------] 28% eta: 1m Batch submitting query [========>----------------------] 29% eta: 1m Batch submitting query [========>----------------------] 30% eta: 1m Batch submitting query [=========>---------------------] 31% eta: 1m Batch submitting query [=========>---------------------] 32% eta: 1m Batch submitting query [=========>---------------------] 33% eta: 1m Batch submitting query [=========>---------------------] 34% eta: 1m Batch submitting query [==========>--------------------] 35% eta: 1m Batch submitting query [==========>--------------------] 36% eta: 1m Batch submitting query [===========>-------------------] 37% eta: 1m Batch submitting query [===========>-------------------] 38% eta: 1m Batch submitting query [===========>-------------------] 39% eta: 1m Batch submitting query [===========>-------------------] 40% eta: 1m Batch submitting query [============>------------------] 41% eta: 1m Batch submitting query [============>------------------] 42% eta: 1m Batch submitting query [============>------------------] 43% eta: 1m Batch submitting query [=============>-----------------] 44% eta: 1m Batch submitting query [=============>-----------------] 45% eta: 1m Batch submitting query [=============>-----------------] 46% eta: 1m Batch submitting query [==============>----------------] 47% eta: 1m Batch submitting query [==============>----------------] 48% eta: 1m Batch submitting query [==============>----------------] 49% eta: 1m Batch submitting query [===============>---------------] 50% eta: 1m Batch submitting query [===============>---------------] 51% eta: 1m Batch submitting query [===============>---------------] 52% eta: 1m Batch submitting query [===============>---------------] 53% eta: 1m Batch submitting query [================>--------------] 54% eta: 1m Batch submitting query [================>--------------] 55% eta: 1m Batch submitting query [================>--------------] 56% eta: 49s Batch submitting query [=================>-------------] 57% eta: 48s Batch submitting query [=================>-------------] 58% eta: 47s Batch submitting query [=================>-------------] 59% eta: 46s Batch submitting query [==================>------------] 60% eta: 45s Batch submitting query [==================>------------] 61% eta: 44s Batch submitting query [==================>------------] 62% eta: 43s Batch submitting query [==================>------------] 63% eta: 42s Batch submitting query [===================>-----------] 64% eta: 41s Batch submitting query [===================>-----------] 65% eta: 40s Batch submitting query [===================>-----------] 65% eta: 39s Batch submitting query [====================>----------] 66% eta: 38s Batch submitting query [====================>----------] 67% eta: 37s Batch submitting query [====================>----------] 68% eta: 36s Batch submitting query [====================>----------] 69% eta: 35s Batch submitting query [=====================>---------] 70% eta: 34s Batch submitting query [=====================>---------] 71% eta: 33s Batch submitting query [=====================>---------] 72% eta: 32s Batch submitting query [======================>--------] 73% eta: 30s Batch submitting query [======================>--------] 74% eta: 29s Batch submitting query [======================>--------] 75% eta: 29s Batch submitting query [======================>--------] 75% eta: 28s Batch submitting query [=======================>-------] 76% eta: 27s Batch submitting query [=======================>-------] 77% eta: 26s Batch submitting query [=======================>-------] 78% eta: 25s Batch submitting query [========================>------] 79% eta: 24s Batch submitting query [========================>------] 80% eta: 23s Batch submitting query [========================>------] 81% eta: 22s Batch submitting query [========================>------] 82% eta: 21s Batch submitting query [=========================>-----] 83% eta: 20s Batch submitting query [=========================>-----] 84% eta: 19s Batch submitting query [=========================>-----] 85% eta: 18s Batch submitting query [=========================>-----] 85% eta: 17s Batch submitting query [==========================>----] 86% eta: 16s Batch submitting query [==========================>----] 87% eta: 15s Batch submitting query [==========================>----] 88% eta: 14s Batch submitting query [===========================>---] 89% eta: 13s Batch submitting query [===========================>---] 90% eta: 12s Batch submitting query [===========================>---] 91% eta: 10s Batch submitting query [===========================>---] 92% eta: 9s Batch submitting query [============================>--] 93% eta: 8s Batch submitting query [============================>--] 94% eta: 7s Batch submitting query [============================>--] 95% eta: 6s Batch submitting query [=============================>-] 95% eta: 5s Batch submitting query [=============================>-] 96% eta: 4s Batch submitting query [=============================>-] 97% eta: 3s Batch submitting query [=============================>-] 98% eta: 2s Batch submitting query [==============================>] 99% eta: 1s Batch submitting query [===============================] 100% eta: 0s
## Preview results
dim(MM_biomart_output)
head(MM_biomart_output, 20)
ensembl_transcript_id_version | ensembl_gene_id | external_gene_name | transcript_biotype | gene_biotype |
---|---|---|---|---|
ENSMUST00000000090.7 | ENSMUSG00000000088 | Cox5a | protein_coding | protein_coding |
ENSMUST00000001304.8 | ENSMUSG00000001270 | Ckb | protein_coding | protein_coding |
ENSMUST00000002518.8 | ENSMUSG00000054452 | Aes | protein_coding | protein_coding |
ENSMUST00000004072.9 | ENSMUSG00000003970 | Rpl8 | protein_coding | protein_coding |
ENSMUST00000004480.4 | ENSMUSG00000004366 | Sst | protein_coding | protein_coding |
ENSMUST00000004554.13 | ENSMUSG00000012848 | Rps5 | protein_coding | protein_coding |
ENSMUST00000004673.14 | ENSMUSG00000004558 | Ndrg2 | protein_coding | protein_coding |
ENSMUST00000005406.11 | ENSMUSG00000022892 | App | protein_coding | protein_coding |
ENSMUST00000006435.7 | ENSMUSG00000006273 | Atp6v1b2 | protein_coding | protein_coding |
ENSMUST00000006828.8 | ENSMUSG00000006651 | Aplp1 | protein_coding | protein_coding |
ENSMUST00000007708.13 | ENSMUSG00000007564 | Ppp2r1a | protein_coding | protein_coding |
ENSMUST00000008036.8 | ENSMUSG00000007892 | Rplp1 | protein_coding | protein_coding |
ENSMUST00000011896.6 | ENSMUSG00000011752 | Pgam1 | protein_coding | protein_coding |
ENSMUST00000013842.11 | ENSMUSG00000013698 | Pea15a | protein_coding | protein_coding |
ENSMUST00000014457.14 | ENSMUSG00000014313 | Cox6c | protein_coding | protein_coding |
ENSMUST00000016396.7 | ENSMUSG00000016252 | Atp5e | protein_coding | protein_coding |
ENSMUST00000016463.3 | ENSMUSG00000016319 | Slc25a5 | protein_coding | protein_coding |
ENSMUST00000016571.7 | ENSMUSG00000016427 | Ndufa1 | protein_coding | protein_coding |
ENSMUST00000017922.8 | ENSMUSG00000017778 | Cox7c | processed_transcript | protein_coding |
ENSMUST00000018470.9 | ENSMUSG00000018326 | Ywhab | protein_coding | protein_coding |
%get MM_biomart_output --from R
## Check dataframe dimension after import
print("Dataframe dimensions:", MM_biomart_output.shape, "\n")
## Change transcript ID column header to match TPM>0 read counts table for merging
MM_biomart_output_df = MM_biomart_output.rename(columns = {'ensembl_transcript_id_version':'Name'})
Loading required package: feather
Dataframe dimensions: (54187, 6)
## Merge Biomart output & read counts table sorted by median read count
MM_sorted_counts_BM = MM_sorted_merged.merge(MM_biomart_output_df, on='Name')
## Extract entries of protein-coding transcripts
MM_PC_transcripts = MM_sorted_counts_BM.loc[MM_sorted_counts_BM["transcript_biotype"].str.contains("protein_coding")]
## Check dataframe dimensions
print(MM_PC_transcripts.shape)
(28773, 11)
%get MM_PC_transcripts --from Python3
## Preview the dataframe
head(MM_PC_transcripts, 20)
## Extract to CSV
write.csv(head(MM_PC_transcripts, 20), file = "MM_top_20_PC_transcripts.csv")
Name | SRR5412186 | SRR5412187 | SRR5412188 | SRR5412189 | Median | ensembl_gene_id | description | external_gene_name | transcript_biotype | gene_biotype | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSMUST00000082407.1 | 149567.000 | 158290.000 | 124023.000 | 127249.000 | 138408.0000 | ENSMUSG00000064356 | mitochondrially encoded ATP synthase 8 [Source:MGI Symbol;Acc:MGI:99926] | mt-Atp8 | protein_coding | protein_coding |
3 | ENSMUST00000084013.1 | 18408.000 | 16610.200 | 17065.100 | 16259.700 | 16837.6500 | ENSMUSG00000065947 | mitochondrially encoded NADH dehydrogenase 4L [Source:MGI Symbol;Acc:MGI:102497] | mt-Nd4l | protein_coding | protein_coding |
4 | ENSMUST00000082409.1 | 14061.500 | 15864.600 | 8649.280 | 14477.300 | 14269.4000 | ENSMUSG00000064358 | mitochondrially encoded cytochrome c oxidase III [Source:MGI Symbol;Acc:MGI:102502] | mt-Co3 | protein_coding | protein_coding |
8 | ENSMUST00000082402.1 | 12343.700 | 11872.500 | 12274.100 | 12106.700 | 12190.4000 | ENSMUSG00000064351 | mitochondrially encoded cytochrome c oxidase I [Source:MGI Symbol;Acc:MGI:102504] | mt-Co1 | protein_coding | protein_coding |
9 | ENSMUST00000082408.1 | 11906.000 | 10750.800 | 6576.710 | 11196.300 | 10973.5500 | ENSMUSG00000064357 | mitochondrially encoded ATP synthase 6 [Source:MGI Symbol;Acc:MGI:99927] | mt-Atp6 | protein_coding | protein_coding |
10 | ENSMUST00000082392.1 | 9334.150 | 8862.070 | 7866.730 | 8228.830 | 8545.4500 | ENSMUSG00000064341 | mitochondrially encoded NADH dehydrogenase 1 [Source:MGI Symbol;Acc:MGI:101787] | mt-Nd1 | protein_coding | protein_coding |
11 | ENSMUST00000082405.1 | 16590.000 | 8717.470 | 2515.120 | 7933.510 | 8325.4900 | ENSMUSG00000064354 | mitochondrially encoded cytochrome c oxidase II [Source:MGI Symbol;Acc:MGI:102503] | mt-Co2 | protein_coding | protein_coding |
12 | ENSMUST00000082421.1 | 9164.250 | 7803.200 | 7885.020 | 8024.630 | 7954.8250 | ENSMUSG00000064370 | mitochondrially encoded cytochrome b [Source:MGI Symbol;Acc:MGI:102501] | mt-Cytb | protein_coding | protein_coding |
15 | ENSMUST00000194059.1 | 5123.110 | 7072.100 | 6831.800 | 6382.660 | 6607.2300 | ENSMUSG00000102627 | SNRPN upstream reading frame [Source:MGI Symbol;Acc:MGI:1891236] | Snurf | protein_coding | protein_coding |
16 | ENSMUST00000082396.1 | 6200.230 | 6568.710 | 5731.150 | 6274.040 | 6237.1350 | ENSMUSG00000064345 | mitochondrially encoded NADH dehydrogenase 2 [Source:MGI Symbol;Acc:MGI:102500] | mt-Nd2 | protein_coding | protein_coding |
18 | ENSMUST00000082414.1 | 5940.290 | 5942.430 | 5941.220 | 5882.640 | 5940.7550 | ENSMUSG00000064363 | mitochondrially encoded NADH dehydrogenase 4 [Source:MGI Symbol;Acc:MGI:102498] | mt-Nd4 | protein_coding | protein_coding |
21 | ENSMUST00000082419.1 | 6830.540 | 4057.200 | 3750.910 | 4639.650 | 4348.4250 | ENSMUSG00000064368 | mitochondrially encoded NADH dehydrogenase 6 [Source:MGI Symbol;Acc:MGI:102495] | mt-Nd6 | protein_coding | protein_coding |
27 | ENSMUST00000082418.1 | 2712.540 | 2410.310 | 2297.620 | 2588.100 | 2499.2050 | ENSMUSG00000064367 | mitochondrially encoded NADH dehydrogenase 5 [Source:MGI Symbol;Acc:MGI:102496] | mt-Nd5 | protein_coding | protein_coding |
42 | ENSMUST00000128482.7 | 1381.500 | 1247.840 | 1248.690 | 1355.200 | 1301.9450 | ENSMUSG00000008683 | ribosomal protein S15A [Source:MGI Symbol;Acc:MGI:2389091] | Rps15a | protein_coding | protein_coding |
43 | ENSMUST00000133193.7 | 1376.850 | 1572.400 | 1199.520 | 1221.830 | 1299.3400 | ENSMUSG00000041607 | myelin basic protein [Source:MGI Symbol;Acc:MGI:96925] | Mbp | protein_coding | protein_coding |
44 | ENSMUST00000082411.1 | 1142.170 | 1516.200 | 1172.820 | 1384.290 | 1278.5550 | ENSMUSG00000064360 | mitochondrially encoded NADH dehydrogenase 3 [Source:MGI Symbol;Acc:MGI:102499] | mt-Nd3 | protein_coding | protein_coding |
46 | ENSMUST00000025563.6 | 1152.680 | 1169.650 | 1484.350 | 1299.900 | 1234.7750 | ENSMUSG00000024661 | ferritin heavy polypeptide 1 [Source:MGI Symbol;Acc:MGI:95588] | Fth1 | protein_coding | protein_coding |
49 | ENSMUST00000040440.6 | 669.157 | 1278.780 | 1121.750 | 1112.190 | 1116.9700 | ENSMUSG00000036438 | calmodulin 2 [Source:MGI Symbol;Acc:MGI:103250] | Calm2 | protein_coding | protein_coding |
55 | ENSMUST00000108707.2 | 727.787 | 878.755 | 956.682 | 964.800 | 917.7185 | ENSMUSG00000008348 | ubiquitin C [Source:MGI Symbol;Acc:MGI:98889] | Ubc | protein_coding | protein_coding |
59 | ENSMUST00000053150.7 | 725.477 | 864.904 | 927.182 | 888.634 | 876.7690 | ENSMUSG00000050621 | ribosomal protein S27, retrogene [Source:MGI Symbol;Acc:MGI:3704345] | Rps27rt | protein_coding | protein_coding |
MM_PC_transcripts_ID <- MM_PC_transcripts$Name
## Create getBM() query for obtaining peptide sequences
MM_BM_peptide_seqs <- getSequence(id = MM_PC_transcripts_ID,
type = 'ensembl_transcript_id_version',
seqType = 'peptide',
mart = MM_ensembl)
## Export to FASTA
exportFASTA(MM_BM_peptide_seqs, file='./MM_non0_pep_Nov11.fasta')
Batch submitting query [>------------------------------] 3% eta: 4m Batch submitting query [=>-----------------------------] 5% eta: 5m Batch submitting query [=>-----------------------------] 7% eta: 5m Batch submitting query [==>----------------------------] 9% eta: 5m Batch submitting query [==>----------------------------] 10% eta: 5m Batch submitting query [===>---------------------------] 12% eta: 5m Batch submitting query [===>---------------------------] 14% eta: 5m Batch submitting query [====>--------------------------] 16% eta: 5m Batch submitting query [====>--------------------------] 17% eta: 5m Batch submitting query [=====>-------------------------] 19% eta: 5m Batch submitting query [=====>-------------------------] 21% eta: 5m Batch submitting query [======>------------------------] 22% eta: 5m Batch submitting query [======>------------------------] 24% eta: 4m Batch submitting query [=======>-----------------------] 26% eta: 4m Batch submitting query [========>----------------------] 28% eta: 4m Batch submitting query [========>----------------------] 29% eta: 4m Batch submitting query [=========>---------------------] 31% eta: 4m Batch submitting query [=========>---------------------] 33% eta: 4m Batch submitting query [==========>--------------------] 34% eta: 4m Batch submitting query [==========>--------------------] 36% eta: 4m Batch submitting query [===========>-------------------] 38% eta: 4m Batch submitting query [===========>-------------------] 40% eta: 3m Batch submitting query [============>------------------] 41% eta: 3m Batch submitting query [============>------------------] 43% eta: 3m Batch submitting query [=============>-----------------] 45% eta: 3m Batch submitting query [=============>-----------------] 47% eta: 3m Batch submitting query [==============>----------------] 48% eta: 3m Batch submitting query [===============>---------------] 50% eta: 3m Batch submitting query [===============>---------------] 52% eta: 3m Batch submitting query [================>--------------] 53% eta: 3m Batch submitting query [================>--------------] 55% eta: 3m Batch submitting query [=================>-------------] 57% eta: 2m Batch submitting query [=================>-------------] 59% eta: 2m Batch submitting query [==================>------------] 60% eta: 2m Batch submitting query [==================>------------] 62% eta: 2m Batch submitting query [===================>-----------] 64% eta: 2m Batch submitting query [===================>-----------] 66% eta: 2m Batch submitting query [====================>----------] 67% eta: 2m Batch submitting query [====================>----------] 69% eta: 2m Batch submitting query [=====================>---------] 71% eta: 2m Batch submitting query [=====================>---------] 72% eta: 2m Batch submitting query [======================>--------] 74% eta: 1m Batch submitting query [=======================>-------] 76% eta: 1m Batch submitting query [=======================>-------] 78% eta: 1m Batch submitting query [========================>------] 79% eta: 1m Batch submitting query [========================>------] 81% eta: 1m Batch submitting query [=========================>-----] 83% eta: 1m Batch submitting query [=========================>-----] 84% eta: 1m Batch submitting query [==========================>----] 86% eta: 45s Batch submitting query [==========================>----] 88% eta: 39s Batch submitting query [===========================>---] 90% eta: 33s Batch submitting query [===========================>---] 91% eta: 28s Batch submitting query [============================>--] 93% eta: 22s Batch submitting query [============================>--] 95% eta: 17s Batch submitting query [=============================>-] 97% eta: 11s Batch submitting query [=============================>-] 98% eta: 6s Batch submitting query [===============================] 100% eta: 0s
## [28616196.bc]
rpsblast -query ~/Lymnaea_CNS_transcriptome_files/7_Interspecies_comparison/7c_Mouse/MM_non0_pep_Nov11.fasta -db Kog \
-out ~/MM_CNS_pep_Nov11_KOG.txt -evalue 1E-5 \
-outfmt "6 qseqid sseqid stitle pident length mismatch gapopen qlen qstart qend slen sstart send evalue bitscore qcovhsp qcovs" \
-max_hsps 1 -max_target_seqs 1
## Replace "[" and "]" with "#" for later import as dataframe
sed -i 's/\[/#/g' MM_CNS_pep_Nov11_KOG.txt
sed -i 's/\]./#/g' MM_CNS_pep_Nov11_KOG.txt
import pandas as pd
import os
os.chdir("/home/zhanglab1/ndong/Lymnaea_CNS_transcriptome_files/7_Interspecies_comparison/7c_Mouse")
MM_KOG = pd.read_csv("MM_CNS_pep_Nov11_KOG.txt", sep='#', header=None, engine="python")
type(MM_KOG)
pandas.core.frame.DataFrame
%get MM_KOG --from Python3
head(MM_KOG, 10)
0 | 1 | 2 | |
---|---|---|---|
0 | ENSMUST00000003156.14 gnl|CDD|229697 KOG1758, KOG1758, KOG1758, Mitochondrial F1F0-ATP synthase, subunit delta/ATP16 | Energy production and conversion | 52.976 168 70 2 169 1 168 159 1 159 1.09e-66 198 99 99 |
1 | ENSMUST00000003912.6 gnl|CDD|228621 KOG0674, KOG0674, KOG0674, Calreticulin | Posttranslational modification, protein turnover, chaperones | 80.523 344 66 1 417 20 362 406 20 363 0.0 589 82 82 |
2 | ENSMUST00000000090.7 gnl|CDD|232008 KOG4077, KOG4077, KOG4077, Cytochrome c oxidase, subunit Va/COX6 | Energy production and conversion | 65.179 112 38 1 147 35 146 149 38 148 1.06e-71 209 76 76 |
3 | ENSMUST00000006027.5 gnl|CDD|229664 KOG1725, KOG1725, KOG1725, Protein involved in membrane traffic (YOP1/TB2/DP1/HVA22 family) | Intracellular trafficking, secretion, and vesicular transport | 50.955 157 73 2 190 23 176 186 26 181 1.26e-63 192 81 81 |
4 | ENSMUST00000001304.8 gnl|CDD|231516 KOG3581, KOG3581, KOG3581, Creatine kinases | Energy production and conversion | 57.641 373 145 5 382 1 370 363 1 363 0.0 625 97 97 |
5 | ENSMUST00000004378.14 gnl|CDD|230609 KOG2670, KOG2670, KOG2670, Enolase | Carbohydrate transport and metabolism | 78.984 433 89 2 435 1 431 433 1 433 0.0 875 99 99 |
6 | ENSMUST00000004673.14 gnl|CDD|230870 KOG2931, KOG2931, KOG2931, Differentiation-related gene 1 protein (NDR1 protein), related proteins | Function unknown | 57.609 184 70 2 372 159 334 222 39 222 2.81e-98 288 47 47 |
7 | ENSMUST00000001566.9 gnl|CDD|229317 KOG1375, KOG1375, KOG1375, Beta tubulin | Cytoskeleton | 84.518 394 20 8 445 37 429 369 1 354 0.0 624 88 88 |
8 | ENSMUST00000002518.8 gnl|CDD|228586 KOG0639, KOG0639, KOG0639, Transducin-like enhancer of split protein (contains WD40 repeats) | Chromatin structure and dynamics | 60.406 197 63 4 198 2 197 705 1 183 5.82e-104 312 99 99 |
9 | ENSMUST00000006828.8 gnl|CDD|231476 KOG3540, KOG3540, KOG3540, Beta amyloid precursor protein | General function prediction only | 41.094 640 298 16 655 43 654 615 27 615 0.0 698 93 93 |
## Count the number of occurrences of each category
KOGs= ["RNA processing and modification", "Chromatin structure and dynamics", "Energy production and conversion", "Cell cycle control",
"Amino acid transport and metabolism", "Nucleotide transport and metabolism", "Carbohydrate transport and metabolism", "Coenzyme transport and metabolism",
"Lipid transport and metabolism", "Translation, ribosomal structure and biogenesis", "Transcription", "Replication, recombination and repair",
"Cell wall/membrane/envelope biogenesis", "Cell motility", "Posttranslational modification", "Inorganic ion transport and metabolism",
"Secondary metabolites", "General function prediction only", "Function unknown", "Signal transduction mechanisms", "Intracellular trafficking",
"Defense mechanisms", "Extracellular structures", "Nuclear structure", "Cytoskeleton"]
data = []
for KOG in KOGs:
print(KOG, MM_KOG[1].str.contains(KOG).sum())
data.append([KOG, MM_KOG[1].str.contains(KOG).sum()])
df = pd.DataFrame(data)
df.columns = ["KOG", "Count"]
df["MM_Percentage"] = df["Count"]/df["Count"].sum()*100
print(df)
df[["KOG", "MM_Percentage"]].to_csv("MM_KOG_summary.txt", sep="\t", index=None)
RNA processing and modification 901 Chromatin structure and dynamics 446 Energy production and conversion 506 Cell cycle control 644 Amino acid transport and metabolism 496 Nucleotide transport and metabolism 255 Carbohydrate transport and metabolism 535 Coenzyme transport and metabolism 157 Lipid transport and metabolism 702 Translation, ribosomal structure and biogenesis 728 Transcription 2510 Replication, recombination and repair 388 Cell wall/membrane/envelope biogenesis 259 Cell motility 53 Posttranslational modification 1903 Inorganic ion transport and metabolism 644 Secondary metabolites 176 General function prediction only 3063 Function unknown 1707 Signal transduction mechanisms 5175 Intracellular trafficking 1479 Defense mechanisms 219 Extracellular structures 421 Nuclear structure 135 Cytoskeleton 1358 KOG Count MM_Percentage 0 RNA processing and modification 901 3.624296 1 Chromatin structure and dynamics 446 1.794047 2 Energy production and conversion 506 2.035398 3 Cell cycle control 644 2.590507 4 Amino acid transport and metabolism 496 1.995173 5 Nucleotide transport and metabolism 255 1.025744 6 Carbohydrate transport and metabolism 535 2.152051 7 Coenzyme transport and metabolism 157 0.631537 8 Lipid transport and metabolism 702 2.823813 9 Translation, ribosomal structure and biogenesis 728 2.928399 10 Transcription 2510 10.096541 11 Replication, recombination and repair 388 1.560740 12 Cell wall/membrane/envelope biogenesis 259 1.041834 13 Cell motility 53 0.213194 14 Posttranslational modification 1903 7.654867 15 Inorganic ion transport and metabolism 644 2.590507 16 Secondary metabolites 176 0.707965 17 General function prediction only 3063 12.320998 18 Function unknown 1707 6.866452 19 Signal transduction mechanisms 5175 20.816573 20 Intracellular trafficking 1479 5.949316 21 Defense mechanisms 219 0.880933 22 Extracellular structures 421 1.693484 23 Nuclear structure 135 0.543041 24 Cytoskeleton 1358 5.462591
MM_non0_transcripts = merge_read_libraries(lib86, lib87, lib88, lib89)
print("\033[1m"+"Rename column headers with library name"+"\033[0;0m")
MM_non0_transcripts.columns = ("Name", "SRR5412186", "SRR5412187", "SRR5412188", "SRR5412189")
MM_non0_transcripts.head(5)
There are 54775 transcripts that are TPM>0 in all libraries Dataframe preview
Name | TPM_x | TPM_y | TPM_x | TPM_y | |
---|---|---|---|---|---|
0 | ENSMUST00000100960.10 | 0.110745 | 0.303512 | 0.826241 | 0.268737 |
1 | ENSMUST00000171587.1 | 0.137712 | 0.594797 | 0.336414 | 0.070419 |
2 | ENSMUST00000194248.1 | 5.069370 | 3.113770 | 0.929923 | 7.161060 |
3 | ENSMUST00000179719.1 | 1.904900 | 1.364820 | 0.587888 | 1.404270 |
4 | ENSMUST00000132882.1 | 1.957810 | 1.315650 | 1.900130 | 1.070410 |
Rename column headers with library name
Name | SRR5412186 | SRR5412187 | SRR5412188 | SRR5412189 | |
---|---|---|---|---|---|
0 | ENSMUST00000100960.10 | 0.110745 | 0.303512 | 0.826241 | 0.268737 |
1 | ENSMUST00000171587.1 | 0.137712 | 0.594797 | 0.336414 | 0.070419 |
2 | ENSMUST00000194248.1 | 5.069370 | 3.113770 | 0.929923 | 7.161060 |
3 | ENSMUST00000179719.1 | 1.904900 | 1.364820 | 0.587888 | 1.404270 |
4 | ENSMUST00000132882.1 | 1.957810 | 1.315650 | 1.900130 | 1.070410 |
## Create data frame of transcripts sorted by median read counts across libraries
median_sorted_MM_non0_transcripts = median_sort_transcripts(MM_non0_transcripts)
Dataframe preview:
Name | SRR5412186 | SRR5412187 | SRR5412188 | SRR5412189 | Median | |
---|---|---|---|---|---|---|
40 | ENSMUST00000082407.1 | 149567.0 | 158290.0 | 124023.00 | 127249.0 | 138408.00 |
50383 | ENSMUST00000082403.1 | 17884.9 | 36408.9 | 47544.30 | 33898.1 | 35153.50 |
50097 | ENSMUST00000187117.1 | 22286.9 | 21194.7 | 20448.70 | 23908.8 | 21740.80 |
36 | ENSMUST00000084013.1 | 18408.0 | 16610.2 | 17065.10 | 16259.7 | 16837.65 |
38 | ENSMUST00000082409.1 | 14061.5 | 15864.6 | 8649.28 | 14477.3 | 14269.40 |
## Create data frame of transcript IDs sorted by median read counts across libraries
sorted_MM_transcripts_IDs = median_sorted_transcript_IDs(MM_non0_transcripts)
There are 54775 TPM>0 transcript IDs.
Name | |
---|---|
0 | ENSMUST00000100960.10 |
1 | ENSMUST00000171587.1 |
2 | ENSMUST00000194248.1 |
3 | ENSMUST00000179719.1 |
4 | ENSMUST00000132882.1 |
MM_CNS_pep_Nov10.fa
---> 31,917 protein sequences'''
## Download refernece protein sequences
wget ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/pep/Mus_musculus.GRCm38.pep.all.fa.gz
mv Mus_musculus.GRCm38.pep.all.fa MM_pep.fa
'''
## Extract [28614463.bc]
filterbyname.sh in=MM_pep.fa out=MM_6789_non0_Nov10.fa names=MM_6789_non0_Nov10.txt include=t substring
## KOG analysis [27923034.bc]
> rpsblast -query ~/CNS-transcriptomes/MM_Marin/MM_CNS_pep_Jun30.fa -db Kog \
-out ~/CNS-transcriptomes/MM_Marin/MM_CNS_pep_Jun30_KOG.txt -evalue 1E-5 \
-outfmt "6 qseqid sseqid stitle pident length mismatch gapopen qlen qstart qend slen sstart send evalue bitscore qcovhsp qcovs" \
-max_hsps 1 -max_target_seqs 1