This notebook will take three specified budding yeast genes and identify & collect the protein sequences of homologs among a collection of deduced proteins available for a genome from Ensembl.
This notebook builds on the previous ones. See those for introduction and credits.
This is meant to demonstrate using BLAST+, Python, and a few of my Python scripts to accomplish a task in a series of steps.
How to modify the input to use your own sequence or sequences is described.
The file used here contains the deduced sequences predicted proteins encoded by the genome of Encephalitozoon cuniculi. (The details of these sequences are just below in an expandable section; however, the option doesn't render statically via Github; use nbviewer or launch an active session via MyBinder.)
From the Ensembl genome page for Encephalitozoon cuniculi, under 'Gene Annoation', you'd click 'FASTA' next to the text 'Download genes, cDNAs, ncRNA, proteins -'. That leads to the Downloads page which lists:
cdna/ 8/17/20, 1:24:00 PM
cds/ 8/17/20, 1:24:00 PM
dna/ 8/17/20, 1:24:00 PM
dna_index/ 8/17/20, 1:24:00 PM
ncrna/ 8/17/20, 1:24:00 PM
pep/ 8/17/20, 1:24:00 PM
Specifically the deduced protein sequences is the pep
entry at the bottom of that list and can be directly retrieved from:
ftp://ftp.ensemblgenomes.org/pub/fungi/release-48/fasta/fungi_microsporidia1_collection/encephalitozoon_cuniculi_ecuniii_l_gca_001078035/pep/
Source information for the Encephalitozoon cuniculi sequence data:
https://www.ebi.ac.uk/ena/browser/view/GCA_001078035.1
Sadly, the sequence file we use here cannot be directly retrieved via curl
or wget
from the source because the port for ftp is blocked on MyBinder.org to prevent abuse. However, you could easily download the file and then upload it to the active session running via MyBinder.org. To save on that step it is already included in this repository and so is present when the session launches.
In this demonstration we'll get protein sequences for three genes of S. cerevisiae and make a multi-FASTA file from them. If you already have your own protein sequences in a file, you can upload that to this running session and skip to the step below where the input file is specified and edit the seq_file_to_use
variable to be the name of your file. The sequence file in FASTA format you provide will need to be in the notebooks
directory alongside this notebook. This will be revisited in greater detail below.
To get the demonstration protein sequences, we'll use a script I developed that fetches protein sequences for S. cerevisiae genes when provided wit a gene's systematic name, standard name, or alias as defined at gene page at yeastgenome.org. See more about that script here.
!curl -O https://raw.githubusercontent.com/fomightez/yeastmine/master/get_protein_seq_as_FASTA.py
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 11399 100 11399 0 0 50823 0 --:--:-- --:--:-- --:--:-- 50888
The intermine webservices package that script relies on has not been updated on package distribution sites in a while and so we are going to get the current version directly from Github using the following command. We'll also add a package called pyfaidx
that makes handling FASTA-formated files easier.
%pip install https://github.com/intermine/intermine-ws-python/archive/dev.zip
%pip install pyfaidx
Collecting https://github.com/intermine/intermine-ws-python/archive/dev.zip Downloading https://github.com/intermine/intermine-ws-python/archive/dev.zip - 150.5 kB 4.9 MB/s 0:00:00 Preparing metadata (setup.py) ... done Building wheels for collected packages: intermine Building wheel for intermine (setup.py) ... done Created wheel for intermine: filename=intermine-1.13.0-py3-none-any.whl size=75660 sha256=5f2f7c4d4b2ecae2cac9ff52c43a956dd2dc03ad8c7c2f24ffc543e1d6d299c5 Stored in directory: /tmp/pip-ephem-wheel-cache-_wq3fh2r/wheels/74/4c/cc/b2bc6341109f116ae83cc8424243b1a820cd4fcdf99a4f6932 Successfully built intermine Installing collected packages: intermine Successfully installed intermine-1.13.0 Note: you may need to restart the kernel to use updated packages. Collecting pyfaidx Downloading pyfaidx-0.8.1.1-py3-none-any.whl.metadata (25 kB) Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pyfaidx) (68.2.2) Requirement already satisfied: importlib-metadata in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pyfaidx) (6.8.0) Requirement already satisfied: zipp>=0.5 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from importlib-metadata->pyfaidx) (3.17.0) Downloading pyfaidx-0.8.1.1-py3-none-any.whl (28 kB) Installing collected packages: pyfaidx Successfully installed pyfaidx-0.8.1.1 Note: you may need to restart the kernel to use updated packages.
Now we'll use it to get the protein sequences corresponding to the three genes form a list.
(If you see, errors like ModuleNotFoundError: No module named 'urlparse'
...ImportError: cannot import name 'MutableMapping' from 'collections' (/srv/conda/envs/notebook/lib/python3.10/collections/__init__.py)
as you run this next cell, then read below this next cell for possible solution.)
seqs_to_get = ["NOP1","VPH1","RPB1"]
for seq in seqs_to_get:
%run get_protein_seq_as_FASTA.py {seq}
looking up the gene associated with NOP1...getting protein sequence... File of protein sequence saved as 'S288C_YDL014W_NOP1_protein.fsa'. Finished. looking up the gene associated with VPH1...getting protein sequence... File of protein sequence saved as 'S288C_YOR270C_VPH1_protein.fsa'. Finished. looking up the gene associated with RPB1...getting protein sequence... File of protein sequence saved as 'S288C_YDL140C_RPO21_protein.fsa'. Finished.
(If you see basil2
among the output, disregard it. I believe it simply an indicator that it is using the development version of the code.)
(If you saw errors, like ModuleNotFoundError: No module named 'urlparse'
...ImportError: cannot import name 'MutableMapping' from 'collections' (/srv/conda/envs/notebook/lib/python3.10/collections/__init__.py)
as you run the cell, it is possibly because intermine needs updating. Foruntately it is easily fixed by making a new cell and running two commands: Rename the code to fix it by running in the new cell !mv possible_webservice_fix_for_intermine_code.py webservice.py
. Then to fix run !cp ./webservice.py /srv/conda/envs/notebook/lib/python3.10/site-packages/intermine/webservice.py
. Now delete that extra cell and try running the cell above again, and then if it works continue on with the rest.)
With those protein sequences we'll combine them into one multi-FASTA sequence file using the cell below. (Most of the code in it is concerned with getting a list of the individual FASTA files without needing to hardcode them into this notebook. The putting the files together is really done in the !cat
command.)
The collected individual sequences are funneled into one sequence file by running the next cells. This may seem counterproductive because later we split the sequences back out to individual files again. However, it is done in this manner because BLAST needs a file as input and this way it is easier for others to plug-in their own sequences in FASTA form right into this notebook by uploading it here and editing the value of file name for the seq_file_to_use
variable below. This way they don't need to relist their gene names to adapt this notebook.
import os
import sys
import fnmatch
fasta_files = []
name_part_to_match = ".fsa"
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*'+name_part_to_match):
#print (file)
#first_part_filen = file.rsplit(extension_to_handle,1)[0]
fasta_files.append(file)
!cat {" ".join(fasta_files)} >protein_sequences.fasta
That should now produce a file named protein_sequences.fasta
in the current working directory. We can check that is the case. If you run the code in the next cell it will first display the current working directory, and then the list of files in that directory. We use the %%bash%
magic tag at the top of the cell to specify to run those commands in the shell. (For those that are Jupyter power users, you may realize this is unnecessary for these two commands; however, it is used here as what unix commands work without anything special is one of those things that novices find hard to follow when learning on Jupyter.)
%%bash
pwd
echo "-*--above is working directory; below is the list of files in that directory-*--"
ls
/home/jovyan/notebooks -*--above is working directory; below is the list of files in that directory-*-- BLAST on Command Line and Integrating with Python.ipynb BLAST on the Command Line with Advanced Python.ipynb Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.dna.toplevel.fa.gz Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.pep.all.fa.gz get_protein_seq_as_FASTA.py GSD protein_sequences.fasta S288C_YDL014W_NOP1_protein.fsa S288C_YDL140C_RPO21_protein.fsa S288C_YOR270C_VPH1_protein.fsa Searching for coding sequences in genomes using BLAST and Python.ipynb Searching for homologs among deduced proteins from a genome using BLAST and Python.ipynb Untitled.ipynb webservice.py
You should see protein_sequences.fasta
listed there.
The file protein_sequences.fasta
contains the sequences we'll use in the remainder of this notebook to look for homologs among the deduced protein sequences from Encephalitozoon cuniculi. If you already have a file of your own protein sequence(s) in FASTA format, you have some choices. One is to simply edit the content of protein_sequences.fasta
to replace it with your sequence or sequences. The more direct way, and one that allows you to skip any of the above cells once you know what you are doing is that you upload your sequence file to this session and change the file name below to match your sequence file. To do that you edit the text between the quotes. Note that the sequence file in FASTA format you provide will need to be located in the notebooks
directory alongside this notebook. To help in troubleshooting the upload and placing the file in the correct directory, you should see your file listed now if you re-run the cell above.
seq_file_to_use = "protein_sequences.fasta"
Sadly, the file containing the deduced sequences predicted proteins encoded by the genome of Encephalitozoon cuniculi we use here cannot be directly retrieved via curl
or wget
from the source because the port for ftp is blocked on MyBinder.org to prevent abuse. However, you could easily download the file and then upload it to the active session running via MyBinder.org. To save on that step it is already included in this repository and so is present when the session launches.
We need to first extract the sequence file from a compressed form using the following command:
%%bash
gunzip Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.pep.all.fa.gz
Let's rename the sequence file something more manageable, ec.pep.fas
.
%%bash
mv Encephalitozoon_cuniculi_ecuniii_l_gca_001078035.ECIIIL.pep.all.fa ec.pep.fas
Get the blast_to_df
script that will be used to convert the raw output from BLAST to something useful in Python by running the following cell. We'll also go ahead and import the main function from the script that allows sending BLAST output to Python dataframes so that we can use it here. Also we prepare to make files from each sequence so that we can feed as input into BLAST.
import os
file_needed = "blast_to_df.py"
if not os.path.isfile(file_needed):
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py
import pandas as pd
from blast_to_df import blast_to_df
from pyfaidx import Fasta
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 13771 100 13771 0 0 57280 0 --:--:-- --:--:-- --:--:-- 57141
We'll also define a function we can use later to save the individual sequences as single files for input into BLAST. This apparent counterproductivity approach where we start out with single files of the sequences in FASTA format and later save them again is to make it easier for researchers with sequences already in a file to plug into this workflow as mentioned above.
def one_seq_to_file(prot_seq, description, name_to_use):
'''
function to save a sequence as a file in FASTA-format
NOTE THAT NOT JUST USING `!faidx --split-files <seq_file>` because
want to know what files will be named for each. In fact can use same name
because will delete.
'''
# save the protein sequence as FASTA
chunk_size = 70 #<---amino acids per line to have in FASTA (note this chunking to
# multiple lines is the opposite of what PatMatch's `unjustify_fasta.pl` does)
prot_seq_chunks = [prot_seq[i:i+chunk_size] for i in range(
0, len(prot_seq),chunk_size)]
prot_seq_fa = ">" + description + "\n"+ "\n".join(prot_seq_chunks)
p_output_file_name = name_to_use
with open(p_output_file_name, 'w') as output:
output.write(prot_seq_fa)
sys.stderr.write("\n\nProtein sequence saved as "
"`{}`.".format(p_output_file_name))
Now you are prepared to run BLAST to search the deduced protein sequences from the Encephalitozoon cuniculi genome for the homologs of the three yeast genes. If you provided sequences as part of this, it will look for homologs of those among the deduced protein sequences from the Encephalitozoon cuniculi genome.
This going to use protein sequences from the Encephalitozoon cuniculi genome and make a database so it is searchable and then search for matches to each the genes. The information on the best match will be collected. One use for that information will be collecting the corresponding sequences later.
!makeblastdb -in ec.pep.fas -dbtype prot
Building a new DB, current time: 03/06/2024 22:04:27 New DB name: /home/jovyan/notebooks/ec.pep.fas New DB title: ec.pep.fas Sequence type: Protein Keep MBits: T Maximum file size: 3000000000B Adding sequences from FASTA; added 1834 sequences in 0.137166 seconds.
records = Fasta(seq_file_to_use)
dfs = []
for seq in records:
one_seq_to_file(str(seq), seq.long_name, "blast_input_seq.fa") # if you are seeing the caret ,a.k.a., less-than-symbol,
# get included in the name, switch `seq.long_name` to `seq.name`. I've seen where if the FASTA sequence file isn't
# standard and instead includes a blank line above the description lines (after the first), the `seq.long_name` will
# include `>` at the start. I chose to use `seq.long_name` to pass so that one has the richest set of
# options for adpating the code for naming of the files.
result = !blastp -query blast_input_seq.fa -db ec.pep.fas -outfmt "6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq" -task blastp
dfs.append(blast_to_df(result.n))
!rm blast_input_seq.fa
Protein sequence saved as `blast_input_seq.fa`.Provided results read and converted to a dataframe... A dataframe of the data has been saved as a file in a manner where other Python programs can access it (pickled form). RESULTING DATAFRAME is stored as ==> 'BLAST_pickled_df.pkl' Returning a dataframe with the information as well. Protein sequence saved as `blast_input_seq.fa`.Provided results read and converted to a dataframe... A dataframe of the data has been saved as a file in a manner where other Python programs can access it (pickled form). RESULTING DATAFRAME is stored as ==> 'BLAST_pickled_df.pkl' Returning a dataframe with the information as well. Protein sequence saved as `blast_input_seq.fa`.Provided results read and converted to a dataframe... A dataframe of the data has been saved as a file in a manner where other Python programs can access it (pickled form). RESULTING DATAFRAME is stored as ==> 'BLAST_pickled_df.pkl' Returning a dataframe with the information as well.
dfs
results in a list of dataframes, one for each search.
For example, the first one in the list can be displayed by running the following code:
dfs[0]
qseqid | sseqid | stitle | pident | qcovs | length | mismatch | gapopen | qstart | qend | sstart | send | qframe | sframe | frames | evalue | bitscore | qseq | sseq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NOP1 | KMV65234 | KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9... | 58.123 | 82 | 277 | 100 | 4 | 57 | 324 | 10 | 279 | 1 | 1 | 1/1 | 2.610000e-112 | 324.0 | GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH... | GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF... |
1 | NOP1 | KMV65287 | KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:1... | 28.889 | 25 | 90 | 49 | 4 | 160 | 240 | 141 | 224 | 1 | 1 | 1/1 | 1.100000e-02 | 33.1 | DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGR... | EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDK... |
2 | NOP1 | KMV66016 | KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:1... | 22.059 | 21 | 68 | 51 | 1 | 217 | 284 | 10 | 75 | 1 | 1 | 1/1 | 9.800000e-02 | 28.9 | IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLK... | VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLE... |
3 | NOP1 | KMV66763 | KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:9... | 28.289 | 44 | 152 | 92 | 7 | 13 | 156 | 32 | 174 | 1 | 1 | 1/1 | 4.900000e+00 | 25.0 | GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFG... | SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFP... |
4 | NOP1 | KMV65333 | KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:2... | 31.915 | 14 | 47 | 32 | 0 | 222 | 268 | 463 | 509 | 1 | 1 | 1/1 | 9.400000e+00 | 23.9 | EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV | EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV |
Displaying the results for the first one in the list, at index zero, shows that the first hit for Nop1p, the highly conserved protein fibrillarin, has high percent identity over 277 residues and a good bitscore. The others are not as good.
(Note if you'd prefer to see the BLAST results as text this is covered after a few more cells below under 'Text result'.)
Because the description in the stitle
column gets cut off, this following trick can be used to see more of that and that column. The sequences aren't as useful but they are shown,
#trick from based on https://stackoverflow.com/a/30691921 and https://stackoverflow.com/a/25352191/8508004
with pd.option_context('display.max_rows', None, 'display.max_columns', None,'display.max_colwidth', None):
display(dfs[0])
qseqid | sseqid | stitle | pident | qcovs | length | mismatch | gapopen | qstart | qend | sstart | send | qframe | sframe | frames | evalue | bitscore | qseq | sseq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NOP1 | KMV65234 | KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M970_100740 transcript:KMV65234 gene_biotype:protein_coding transcript_biotype:protein_coding description:fibrillarin | 58.123 | 82 | 277 | 100 | 4 | 57 | 324 | 10 | 279 | 1 | 1 | 1/1 | 2.610000e-112 | 324.0 | GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RHAGVYIARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIMGGLDELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGRELISMAKKRPNIIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAETVFAREVQKLREERIKPLEQLTLEPYERDHCIVVGRYMRSG | GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRFPGVYISRGKEDLLLTRNLVPGVSVYGEKRVAVD-------LEGMKVEYRVWNAYRSKLAAGIVCGAENIHMEPGSKVLYLGASSGTTVSHVSDIVGKDGVVYAVEFSERSGRDLINMSMKRPNIVPIIEDARYPSRYRMLVPIVDCIFSDVSQPDQTRIVALNAQYFLKEGGGVDVSIKANCVNSAVPAETVFADEVNILRKNSIRPKEQVTLEPFEKDHAMIIGRFKLSA |
1 | NOP1 | KMV65287 | KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 gene:M970_101280 transcript:KMV65287 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein | 28.889 | 25 | 90 | 49 | 4 | 160 | 240 | 141 | 224 | 1 | 1 | 1/1 | 1.100000e-02 | 33.1 | DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGREL--ISMAKKRPNII-------PIIEDARHPQKYRMLIGMVDCV | EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDKHLGPASIKCIRPDLLITESTYGSITRDCRKVKEREFLKAVSDCV |
2 | NOP1 | KMV66016 | KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 gene:M970_061160 transcript:KMV66016 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein | 22.059 | 21 | 68 | 51 | 1 | 217 | 284 | 10 | 75 | 1 | 1 | 1/1 | 9.800000e-02 | 28.9 | IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAET | VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLEDEAFDIKGLSAEEIGKLIPEDT |
3 | NOP1 | KMV66763 | KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:M970_010930 transcript:KMV66763 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein | 28.289 | 44 | 152 | 92 | 7 | 13 | 156 | 32 | 174 | 1 | 1 | 1/1 | 4.900000e+00 | 25.0 | GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFGGRGGSRGGARGGSR-----GGRGGAAGGARGGAKVVIEPHRHAGVYI-ARGKEDLLVTKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIM | SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFPMKGEAVTGGRYELRMSEDIGVQGPTGEGGQQGERCQEE------VYVDAKERDSSNRRCSVRPANEVFGREMCRLHSSPLEDG--KDRIE-KIESMLRNDLADGIV |
4 | NOP1 | KMV65333 | KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 gene:M970_101740 transcript:KMV65333 gene_biotype:protein_coding transcript_biotype:protein_coding description:glycyl-tRNA synthetase | 31.915 | 14 | 47 | 32 | 0 | 222 | 268 | 463 | 509 | 1 | 1 | 1/1 | 9.400000e+00 | 23.9 | EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV | EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV |
The individual dataframes of results can be used in some of the ways demonstrated earlier in this series of notebooks.
For example, it would be easy to filter based on the values in the bitscore column, like in the following cell. Note to make the filter step easier to read syntax-wise, first I assign the first dataframe from the list of datafames to a simpler moniker, of just df
.
Then I specify to subset that dataframe to just cases where the values in the dataframe's 'bitscore' column exceed 100.
df = dfs[0].copy() #make a copy assigned to `df` so next line easier to read
df[df['bitscore'] > 100]
qseqid | sseqid | stitle | pident | qcovs | length | mismatch | gapopen | qstart | qend | sstart | send | qframe | sframe | frames | evalue | bitscore | qseq | sseq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NOP1 | KMV65234 | KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:9... | 58.123 | 82 | 277 | 100 | 4 | 57 | 324 | 10 | 279 | 1 | 1 | 1/1 | 2.610000e-112 | 324.0 | GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RH... | GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRF... |
If you are more used to viewing the text output from BLAST to assess how things worked, you can modify the code used to generate the dataframes above in order show that instead of generating dataframes. In an attempt to make it easier to scroll through to view the results by limiting the output in the window, the results will be limited to the first gene in the list, Nop1, by adding break
so the for
loop ends when it is hit.
Here is a reworked version to show the text output.
records = Fasta(seq_file_to_use)
for seq in records:
one_seq_to_file(str(seq), seq.long_name, "blast_input_seq.fa")
!blastp -query blast_input_seq.fa -db ec.pep.fas -task blastp
!rm blast_input_seq.fa
break
Protein sequence saved as `blast_input_seq.fa`.
BLASTP 2.15.0+ Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Reference for composition-based statistics: Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005. Database: ec.pep.fas 1,834 sequences; 653,245 total letters Query= NOP1 YDL014W Length=328 Score E Sequences producing significant alignments: (Bits) Value KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M... 324 3e-112 KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 gene... 33.1 0.011 KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 gen... 28.9 0.098 KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:... 25.0 4.9 KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 gene... 23.9 9.4 >KMV65234 pep supercontig:ECIIIL:CH10_ECIII-L:96494:97369:1 gene:M970_100740 transcript:KMV65234 gene_biotype:protein_coding transcript_biotype:protein_coding description:fibrillarin Length=291 Score = 324 bits (830), Expect = 3e-112, Method: Compositional matrix adjust. Identities = 161/277 (58%), Positives = 208/277 (75%), Gaps = 16/277 (6%) Query 57 GRGGSRGGA--RGGSRGGRGGAAGGARGGA------KVVIEPH-RHAGVYIARGKEDLLV 107 GR +GG R G GRG +G KV++EPH R GVYI+RGKEDLL+ Sbjct 10 GRKFQKGGKPFRSGKGEGRGRMNNKKKGSVNAGLDRKVLVEPHPRFPGVYISRGKEDLLL 69 Query 108 TKNMAPGESVYGEKRISVEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIMGGLDELFIAPG 167 T+N+ PG SVYGEKR++V+ + KVEYRVWN +RSKLAAGI+ G + + + PG Sbjct 70 TRNLVPGVSVYGEKRVAVD-------LEGMKVEYRVWNAYRSKLAAGIVCGAENIHMEPG 122 Query 168 KKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGRELISMAKKRPNIIPIIEDARHP 227 KVLYLGA+SGT+VSHVSD+VG +GVVYAVEFS R GR+LI+M+ KRPNI+PIIEDAR+P Sbjct 123 SKVLYLGASSGTTVSHVSDIVGKDGVVYAVEFSERSGRDLINMSMKRPNIVPIIEDARYP 182 Query 228 QKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCIDSTVDAETVFA 287 +YRML+ +VDC+F+DV+QPDQ RI+ALN+ FLK+ GGV +SIKANC++S V AETVFA Sbjct 183 SRYRMLVPIVDCIFSDVSQPDQTRIVALNAQYFLKEGGGVDVSIKANCVNSAVPAETVFA 242 Query 288 REVQKLREERIKPLEQLTLEPYERDHCIVVGRYMRSG 324 EV LR+ I+P EQ+TLEP+E+DH +++GR+ S Sbjct 243 DEVNILRKNSIRPKEQVTLEPFEKDHAMIIGRFKLSA 279 >KMV65287 pep supercontig:ECIIIL:CH10_ECIII-L:168660:170150:1 gene:M970_101280 transcript:KMV65287 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein Length=496 Score = 33.1 bits (74), Expect = 0.011, Method: Compositional matrix adjust. Identities = 26/90 (29%), Positives = 42/90 (47%), Gaps = 15/90 (17%) Query 160 DELFIAPGKKVLYLGAASGTSVSHVSDVVGPEGVVYAVEFSHRPGREL--ISMAKKRPNI 217 ++ +I P Y G G ++ HV VVG + VVY ++S P + L S+ RP++ Sbjct 141 EDFYITP----YYAGHVLGAAMFHV--VVGDQSVVYTGDYSTTPDKHLGPASIKCIRPDL 194 Query 218 I-------PIIEDARHPQKYRMLIGMVDCV 240 + I D R ++ L + DCV Sbjct 195 LITESTYGSITRDCRKVKEREFLKAVSDCV 224 >KMV66016 pep supercontig:ECIIIL:CH06_ECIII-L:150158:150508:-1 gene:M970_061160 transcript:KMV66016 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein Length=106 Score = 28.9 bits (63), Expect = 0.098, Method: Composition-based stats. Identities = 15/68 (22%), Positives = 35/68 (51%), Gaps = 2/68 (3%) Query 217 IIPIIEDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVVISIKANCI 276 + P +++ RH Y +++G++D F + + D R I +++ L+D+ + + A I Sbjct 10 VTPALKERRH--TYAIIVGIIDVTFVLLQRKDGEREICSVANLHLEDEAFDIKGLSAEEI 67 Query 277 DSTVDAET 284 + +T Sbjct 68 GKLIPEDT 75 >KMV66763 pep supercontig:ECIIIL:CH01_ECIII-L:92624:94390:-1 gene:M970_010930 transcript:KMV66763 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein Length=588 Score = 25.0 bits (53), Expect = 4.9, Method: Compositional matrix adjust. Identities = 43/152 (28%), Positives = 65/152 (43%), Gaps = 17/152 (11%) Query 13 GGSRGGFGGRGGSRGGARGG-SRGGFGGR-GGSRGGARGGSRGGFGGRGGSRGGARGGSR 70 G R G G + G RGG S G GR GG R G RG F +G + G R R Sbjct 32 SGGRRGAGQKSQRIRGYRGGTSSGSHLGRSGGKRSDRSGCERGSFPMKGEAVTGGRYELR 91 Query 71 -----GGRGGAAGGARGGAKVVIEPHRHAGVYI-ARGKEDLLVTKNMAPGESVYGEKRIS 124 G +G G + G + E VY+ A+ ++ ++ P V+G + Sbjct 92 MSEDIGVQGPTGEGGQQGERCQEE------VYVDAKERDSSNRRCSVRPANEVFGREMCR 145 Query 125 VEEPSKEDGVPPTKVEYRVWNPFRSKLAAGIM 156 + EDG ++E ++ + R+ LA GI+ Sbjct 146 LHSSPLEDG--KDRIE-KIESMLRNDLADGIV 174 >KMV65333 pep supercontig:ECIIIL:CH10_ECIII-L:231151:232890:1 gene:M970_101740 transcript:KMV65333 gene_biotype:protein_coding transcript_biotype:protein_coding description:glycyl-tRNA synthetase Length=579 Score = 23.9 bits (50), Expect = 9.4, Method: Compositional matrix adjust. Identities = 15/47 (32%), Positives = 21/47 (45%), Gaps = 0/47 (0%) Query 222 EDARHPQKYRMLIGMVDCVFADVAQPDQARIIALNSHMFLKDQGGVV 268 ED+R +++ I V C + D+ LN FL D G VV Sbjct 463 EDSRPVFRFKPAIAPVQCAIGYLIHFDEFNEHILNIKKFLTDNGLVV 509 Lambda K H a alpha 0.319 0.141 0.423 0.792 4.96 Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6 Effective search space used: 126581391 Database: ec.pep.fas Posted date: Mar 6, 2024 10:04 PM Number of letters in database: 653,245 Number of sequences in database: 1,834 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Neighboring words threshold: 11 Window for multiple hits: 40
It clearly shows in the description of the best match for Nop1p that it is annotated as fibrillarin, and so it seems indeed the top hit identified is the ortholog of S. cerevisiae Nop1p.
To make it easier to review the results, you send them to a text file that you review here or download. This modification of the code when run, will send the search results for all the sequences to a separate file.
records = Fasta(seq_file_to_use)
!touch results.txt
!rm results.txt # adding a deletion so if you run this cell more than once it
# will start a fresh file and then append to it
!touch results.txt
for seq in records:
one_seq_to_file(str(seq), seq.long_name, "blast_input_seq.fa")
!blastp -query blast_input_seq.fa -db ec.pep.fas -task blastp >>results.txt
!rm blast_input_seq.fa
Protein sequence saved as `blast_input_seq.fa`. Protein sequence saved as `blast_input_seq.fa`. Protein sequence saved as `blast_input_seq.fa`.
If you are using the demonstration with the sequences as it was originally writen, the following command will show the size of the results.txt
file as 43k.
%%bash
ls -lah results.txt
-rw-r--r-- 1 jovyan jovyan 43K Mar 6 22:04 results.txt
Go to the dashboard by clicking on the Jupyter icon in the upper left and then you can review it like a text file. Or download this file if you'd like to keep it, or if you think you can view it better outside of Jupyter.