## Combine predicted protein-coding sequences from current assembly, FX, and FPS sets.
cat Pfam_or_SP_FPS_Apr23.fasta Pfam_or_SP_FX_non0.fasta Pfam_SP_Apr18.fasta > all.fasta
## Simplify FASTA header
sed 's/\ .*//' all.fasta > all_simplified.fasta
## CD-hit
cd-hit -i all_simplified.fasta -o LS_all_CDHIT.fasta -G 1 -T 0 -t 1
diamond blastp -d ~/nr_diamond.dmnd -q LS_all_CDHIT.fasta \
-o /home/zhanglab1/ndong/LS_all_CDHIT_nr_Apr25.txt --max-target-seqs 1 --max-hsps 1 \
-f 6 qseqid qtitle sseqid stitle pident length mismatch gapopen qlen qstart qend slen sstart send evalue bitscore qcovhsp \
--more-sensitive --evalue 1E-5 --unal 1 -p 10
Results in ./run_BUSCO_LS_all_CDHIT_Apr25
folder
python run_BUSCO.py -i ~/LS_all_CDHIT.fasta \
-o BUSCO_LS_all_CDHIT_Apr25 -l ~/metazoa_odb9/ -m prot -c 10 -e 1e-05
rpsblast -query LS_all_CDHIT.fasta \
-db Kog -out ~/LS_all_CDHIT_KOG_Apr24.txt \
-evalue 1E-5 \
-outfmt "6 qseqid sseqid stitle pident length mismatch gapopen qlen qstart qend slen sstart send evalue bitscore qcovhsp qcovs" \
-max_hsps 1 -max_target_seqs 1
## Replace "[" and "]" with "#" for later import as dataframe
sed -i 's/\[/#/g' LS_all_CDHIT_KOG_Apr24.txt
sed -i 's/\]./#/g' LS_all_CDHIT_KOG_Apr24.txt
import pandas as pd
import os
os.chdir("/home/zhanglab1/ndong/Lymnaea_CNS_transcriptome_files/6_Aggregate_annotate_PC_sequences")
LS_KOG = pd.read_csv("LS_all_CDHIT_KOG_Apr24.txt", sep='#', header=None, engine="python")
type(LS_KOG)
pandas.core.frame.DataFrame
%get LS_KOG --from Python3
head(LS_KOG, 10)
0 | 1 | 2 | |
---|---|---|---|
0 | evgLocus_FPS_392 gnl|CDD|229706 KOG1767, KOG1767, KOG1767, 40S ribosomal protein S25 | Translation, ribosomal structure and biogenesis | 67.089 79 26 0 122 39 117 110 31 109 7.90e-34 111 65 65 |
1 | evgLocus_FPS_535 gnl|CDD|228047 KOG0096, KOG0096, KOG0096, GTPase Ran/TC4/GSP1 (nuclear protein transport pathway), small G protein superfamily | Intracellular trafficking, secretion, and vesicular transport | 78.605 215 46 0 215 1 215 216 2 216 1.37e-140 389 100 100 |
2 | evgLocus_FPS_601 gnl|CDD|228369 KOG0420, KOG0420, KOG0420, Ubiquitin-protein ligase | Posttranslational modification, protein turnover, chaperones | 63.043 184 63 3 180 1 179 184 1 184 9.51e-103 290 99 99 |
3 | evgLocus_FPS_855 gnl|CDD|228140 KOG0191, KOG0191, KOG0191, Thioredoxin/protein disulfide isomerase | Posttranslational modification, protein turnover, chaperones | 34.091 132 78 3 254 116 247 383 36 158 8.02e-20 85.3 52 52 |
4 | evgLocus_FPS_1362 gnl|CDD|227992 KOG0041, KOG0041, KOG0041, Predicted Ca2+-binding protein, EF-Hand protein superfamily | General function prediction only | 62.105 190 71 1 198 5 194 244 56 244 8.88e-80 235 96 96 |
5 | evgLocus_FPS_1587 gnl|CDD|232591 KOG4664, KOG4664, KOG4664, Cytochrome oxidase subunit III and related proteins | Energy production and conversion | 61.667 60 23 0 60 1 60 261 182 241 3.18e-18 72.5 100 100 |
6 | evgLocus_FPS_1742 gnl|CDD|230462 KOG2523, KOG2523, KOG2523, Predicted RNA-binding protein with PUA domain | Translation, ribosomal structure and biogenesis | 64.088 181 63 2 204 23 202 181 1 180 1.19e-96 276 88 88 |
7 | evgLocus_FPS_2112 gnl|CDD|228823 KOG0877, KOG0877, KOG0877, 40S ribosomal protein S2/30S ribosomal protein S5 | Translation, ribosomal structure and biogenesis | 89.130 138 14 1 242 45 182 213 3 139 3.70e-90 262 57 57 |
8 | evgLocus_FPS_2218 gnl|CDD|228807 KOG0861, KOG0861, KOG0861, SNARE protein YKT6, synaptobrevin/VAMP syperfamily | Intracellular trafficking, secretion, and vesicular transport | 59.296 199 80 1 199 1 199 198 1 198 5.62e-103 292 100 100 |
9 | evgLocus_FPS_2324 gnl|CDD|231450 KOG3513, KOG3513, KOG3513, Neural cell adhesion molecule L1 | Signal transduction mechanisms | 27.586 87 48 3 236 148 234 1051 257 328 3.33e-08 51.2 37 37 |
## Count the number of occurrences of each category
KOGs= ["RNA processing and modification", "Chromatin structure and dynamics", "Energy production and conversion", "Cell cycle control",
"Amino acid transport and metabolism", "Nucleotide transport and metabolism", "Carbohydrate transport and metabolism", "Coenzyme transport and metabolism",
"Lipid transport and metabolism", "Translation, ribosomal structure and biogenesis", "Transcription", "Replication, recombination and repair",
"Cell wall/membrane/envelope biogenesis", "Cell motility", "Posttranslational modification", "Inorganic ion transport and metabolism",
"Secondary metabolites", "General function prediction only", "Function unknown", "Signal transduction mechanisms", "Intracellular trafficking",
"Defense mechanisms", "Extracellular structures", "Nuclear structure", "Cytoskeleton"]
data = []
for KOG in KOGs:
print(KOG, LS_KOG[1].str.contains(KOG).sum())
data.append([KOG, LS_KOG[1].str.contains(KOG).sum()])
df = pd.DataFrame(data)
df.columns = ["KOG", "Count"]
df["LS_Percentage"] = df["Count"]/df["Count"].sum()*100
print(df)
df[["KOG", "LS_Percentage"]].to_csv("LS_KOG_summary.txt", sep="\t", index=None)
RNA processing and modification 387 Chromatin structure and dynamics 200 Energy production and conversion 286 Cell cycle control 360 Amino acid transport and metabolism 352 Nucleotide transport and metabolism 133 Carbohydrate transport and metabolism 387 Coenzyme transport and metabolism 103 Lipid transport and metabolism 376 Translation, ribosomal structure and biogenesis 391 Transcription 1002 Replication, recombination and repair 286 Cell wall/membrane/envelope biogenesis 259 Cell motility 21 Posttranslational modification 994 Inorganic ion transport and metabolism 316 Secondary metabolites 210 General function prediction only 1624 Function unknown 825 Signal transduction mechanisms 2249 Intracellular trafficking 611 Defense mechanisms 103 Extracellular structures 226 Nuclear structure 58 Cytoskeleton 490 KOG Count LS_Percentage 0 RNA processing and modification 387 3.159442 1 Chromatin structure and dynamics 200 1.632786 2 Energy production and conversion 286 2.334884 3 Cell cycle control 360 2.939015 4 Amino acid transport and metabolism 352 2.873704 5 Nucleotide transport and metabolism 133 1.085803 6 Carbohydrate transport and metabolism 387 3.159442 7 Coenzyme transport and metabolism 103 0.840885 8 Lipid transport and metabolism 376 3.069638 9 Translation, ribosomal structure and biogenesis 391 3.192097 10 Transcription 1002 8.180260 11 Replication, recombination and repair 286 2.334884 12 Cell wall/membrane/envelope biogenesis 259 2.114458 13 Cell motility 21 0.171443 14 Posttranslational modification 994 8.114948 15 Inorganic ion transport and metabolism 316 2.579802 16 Secondary metabolites 210 1.714426 17 General function prediction only 1624 13.258225 18 Function unknown 825 6.735244 19 Signal transduction mechanisms 2249 18.360683 20 Intracellular trafficking 611 4.988162 21 Defense mechanisms 103 0.840885 22 Extracellular structures 226 1.845049 23 Nuclear structure 58 0.473508 24 Cytoskeleton 490 4.000327