This notebook builds on the previous one, BLAST on the Command Line and Integrating with Python. See that one for introduction and credits.
The previous notebook relied on the more traditional/universal routes of passing data that involve file intermediates. Here, a number of advanced Python/Jupyter approaches/tricks are illustrated for working in the Jupyter environment.
Get the data, the blast_to_df
script, and set up the database by running these commands.
Repeating these steps if you had already done so this session will cause no harm, and so go ahead and run these two cells.
!curl -O http://sgd-archive.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py
import pandas as pd
!makeblastdb -in chrmt.fsa -dbtype nucl
%%bash
curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/S288C.mt.genome.fa.gz
gunzip -f S288C.mt.genome.fa.gz
# Up until sometime around September / October 2018, the above referenced file was named `S288c.mt.genome.fa.gz` on that server and
# I built things around that. The name has since changed to the more-correct `S288C.mt.genome.fa.gz`, but I am going
# to convert name to earlier form to make subsequent commands work, be more distinguished from SGD S288C reference sequence,
# and make things continue match what was seen before.
mv S288C.mt.genome.fa S288c.mt.genome.fa
curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/SK1.mt.genome.fa.gz
gunzip -f SK1.mt.genome.fa.gz
curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/Y12.mt.genome.fa.gz
gunzip -f Y12.mt.genome.fa.gz
var=$(echo "S288c.mt.genome.fa" | cut -d'.' --complement -f2-).mito
sed -i "1s/.*/>$var/" S288c.mt.genome.fa
var=$(echo "SK1.mt.genome.fa" | cut -d'.' --complement -f2-).mito
sed -i "1s/.*/>$var/" SK1.mt.genome.fa
var=$(echo "Y12.mt.genome.fa" | cut -d'.' --complement -f2-).mito
sed -i "1s/.*/>$var/" Y12.mt.genome.fa
Now you are prepared to run BLAST on the command line.
This is going to rely on approaches very similar to those illustrated here and here.
We obtained the blast_to_df.py
script in the preparation steps above. However, instead of using it as an external script as in the first notebook, we want to use the core function of that script within this notebook for the options that involve no file intermediatess. Similar to the way we imported a lot of other useful modules in the first notebook and a cell above, you can run the next cell to bring in to memory of this notebook's computational environment, the main function associated with the blast_to_df.py
script, aptly named blast_to_df
. (As written below the command to do that looks a bit redundant;however, the first from part of the command below actually is referencing the blast_to_df.py
script, but it doesn't need the .py
extension because the import only deals with such files.)
from blast_to_df import blast_to_df
We can demonstrate that worked by calling the function.
blast_to_df()
If the module was not imported, you'd see ModuleNotFoundError: No module named 'blast_to_df'
, but instead you should see it saying it is missing results
to act on because you passed it nothing.
After importing the main function of that script into this running notebook, you are ready to demonstrate the additional approaches that don't make file intermediates. These approaches work without the intermediates because instead of being directed to a file, the stdout (standard output stream of the shell) containing the results made by the BLAST command is directed to Python or variables that Python can access. Then, for processing the BLAST results to a dataframe, the imported blast_to_df
function is used within the computational environment of the notebook and the result assigned to a variable in the running the notebook. In the end, the results are in an active dataframe in the notebook without needing to read the pickled dataframe. Although bear in mind the pickled dataframe still gets made, and it is good to download and keep that pickled dataframe since you'll find it convenient for reading and getting back into an analysis without need for rerunning BLAST again.
First we'll do one of the several methods I have found to do this and show how to go to the next step entirely. This first example uses an approach illustrated here. The result that is returned is of the type IPython.utils.text.SList
, which offers some handy utility attributes associated with it as detailed here.
!cat S288c.mt.genome.fa Y12.mt.genome.fa SK1.mt.genome.fa > pacbio.mt.fa
result = !blastn -query pacbio.mt.fa -db chrmt.fsa -outfmt "6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq" -task blastn
from blast_to_df import blast_to_df
blast_df = blast_to_df(result.n)
blast_df.head()
(Note that I left in this block of commands the import statement that you had just previously run. It is evaluated and not run if Python knows it already has that function imported from that source. I originally separated that import out above to highlight it; however, this placement better reflects that you'd actually use it as a part of the pertinent block of code.)
The option demonstrated in the above cell has the distinction that all the steps can be combined in one cell of the notebook as demonstrated below. This is because the !<command>
approach makes a temporary subshell outside of the notebook environment to which it sends the command after the exclamation site. After that command is processed and any communication handled, that subshell where it ran is discarded. Given the transient nature of this environment you'll find !cd
never seems to work as they discuss here, search !cd
to find the pertinent section where they also show you the line magics solution. (Hereis another good resource in this regard.)
The additional options for passing the results demonstrated below instead rely on cell magics, and so the output needs to be captured from a cell before the next steps can be undertaken in an additional cell. While not a big deal that extra cells are involved, I find the !<command>
approach can be nice for streamlining things when making mini-pipelines/workflows using the Jupyter environment as a 'glue' to merge Python and command line use.
In preparation for demonstrating the other options, let's tag that pickled dataframe as corresponding to this demonstration above by running the next cell.
!mv BLAST_pickled_df.pkl BLAST_pickled_dfOPT1.pkl
The additional options for passing the results will be stepped through in a manner similar to 'Option 1' using different approaches for step 1 each time and step #2 varied to handle as necessary.
In this option, %%capture
cell magics is used, and then using the attributes of the utils.cpature
object you can easily get the stdout and/or stderr as a string, see here.
%%capture out
!cat S288c.mt.genome.fa Y12.mt.genome.fa SK1.mt.genome.fa > pacbio.mt.fa
!blastn -query pacbio.mt.fa -db chrmt.fsa -outfmt "6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq" -task blastn
from blast_to_df import blast_to_df
blast_df = blast_to_df(out.stdout)
!mv BLAST_pickled_df.pkl BLAST_pickled_dfOPT2.pkl
blast_df.head()
Note that in the demonstration of this option, the renaming of the pickled dataframe has been merged into that second step, too.
%%bash --out output
cat S288c.mt.genome.fa Y12.mt.genome.fa SK1.mt.genome.fa > pacbio.mt.fa
blastn -query pacbio.mt.fa -db chrmt.fsa -outfmt "6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq"
from blast_to_df import blast_to_df
blast_df = blast_to_df(output, pickle_df=False)
blast_df.head()
Note that in the demonstration of this option, an additional argument was provided to the blast_to_df()
function to indicate to not pickle the dataframe this time.
Although not ideal, if you had a VERY LARGE set of sequences to query and/or you only needed a tiny piece of infromation out of the BLAST results, it might not make sense, due to disk space, to concantenate all of the associated files into a single Multi-FASTA file. Instead of the approach that I suggested was the recommended way in the previous notebook, you could use Python and an approach from the first section of this notebook to automate things.
%%time
import fnmatch
import os
import sys
from blast_to_df import blast_to_df
collected_scores = []
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.mt.genome.fa'):
blast_result = !blastn -query {file} -db chrmt.fsa -outfmt "6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq" -task blastn
blast_df = blast_to_df(blast_result.n, pickle_df=False)
high_score_df = blast_df.sort_values('bitscore', ascending=False)
collected_scores.append(high_score_df.iloc[0]["bitscore"])
#print(collected_scores)
sys.stderr.write("\n\n\n\nThe best scores were {}".format(collected_scores)) # using this instead of
# `print` because stderr and stdout seem to get out of sync as placement of time seems to show
Two other notebooks in this series, one entitled 'Searching for coding sequences in genomes using BLAST and Python' and one entitled 'Searching for homologs among deduced proteins from a genome using BLAST and Python', build on what has been introduced in these two introductory notebooks to accomplish tasks reminiscent of a real workflow.
In the case of the notebook 'Searching for coding sequences in genomes using BLAST and Python' the task is to identify orthologs of a budding yeast gene in the genomes of some different strains and wild cousins.
In the case of the notebook 'Searching for homologs among deduced proteins from a genome using BLAST and Python' the task is to identify homologs of budding yeast proteins among the deduced proteins of another organism.