This will get example sequence data from an external source, prepare it, and then run BLAST with it. Then, subsequent steps will parse the resulting data into something useful in Python. And, even cover how to convert to Excel.
To make things easier and to allow integrating with the nice features provided by the Jupyter environment, we will access the command line from inside a Jupyter notebook using the !
symbol at the start of the command to direct it to the shell. (Alternatively, cells to direct to the shell can be started with %%bash
on the first line. Or, if you feel more confortable working on the actual command line, you can open a terminal from within this active Jupyter environment by clicking the Jupyter logo in the upper left corner of this notebook and then selecting New
> Terminal
form the dropdown on the right of the Jupyter environment dashboard and run the initial parts without the !
or %%bash
tags.)
If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!
Some tips:
CREDITS/RESOURCES:
The portions dealing with running BLAST on command line are based on:
See here for illustration of the approach used for feeding the results into Python, and here for a listing of explanations for the text codes involved in the flags. (Using NCBI BLAST+ Service with Biopython might also be of interest.)
The Tables here are particularly useful for enhancing your queries with custom options.
My script blast_to_df.py
is used as well and can be found in my blast-utilities repository.
Typically need for command line BLAST:
Normally the first is the major hurdle, but here that is already handled. BLAST WILL WORK IN THE ACTIVE JUPYTER ENVIRONMENT AT LAUNCH.
(For those curious, see here for the official documentation; however, conda makes it easier.)
If you got this far, you probably already have sequences you need to search.
We are going to make a custom database here.
We'll use the FASTA-formatted sequence of S. cerevisiae mitochondrial genome to make the database in this simple example. Retrieve that with the command in the next cell.
Click in the next cell and hit shift-enter
or press the Run
button on the toolbar above.
!curl -O http://sgd-archive.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa
!makeblastdb -in chrmt.fsa -dbtype nucl
Note the use of the exclamation point to direct the command to the command line shell.
Now that you have indexed database for mitochdondrial chromosome, you need some sequences to BLAST against it.
The next cell will get a few FASTA-formatted sequences of other mitochondria assembled primarily from long read data, see Yue et al., 2017 PMID: 28416820. It will also fix the first line of each of those so they have unique identifiers based on the file name. Otherwise they would all be chrMT
.
%%bash
curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/S288C.mt.genome.fa.gz
gunzip -f S288C.mt.genome.fa.gz
# Up until sometime around September / October 2018, the above referenced file was named `S288c.mt.genome.fa.gz` on that server and
# I built things around that. The name has since changed to the more-correct `S288C.mt.genome.fa.gz`, but I am going
# to convert name to earlier form to make subsequent commands work, be more distinguished from SGD S288C reference sequence,
# and make things continue match what was seen before.
mv S288C.mt.genome.fa S288c.mt.genome.fa
curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/SK1.mt.genome.fa.gz
gunzip -f SK1.mt.genome.fa.gz
curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/Y12.mt.genome.fa.gz
gunzip -f Y12.mt.genome.fa.gz
var=$(echo "S288c.mt.genome.fa" | cut -d'.' --complement -f2-).mito
sed -i "1s/.*/>$var/" S288c.mt.genome.fa
var=$(echo "SK1.mt.genome.fa" | cut -d'.' --complement -f2-).mito
sed -i "1s/.*/>$var/" SK1.mt.genome.fa
var=$(echo "Y12.mt.genome.fa" | cut -d'.' --complement -f2-).mito
sed -i "1s/.*/>$var/" Y12.mt.genome.fa
Note we used the %%bash
cell magics in the notebook cell to signal that this large code block was all to be run in the command line shell.
Now you are prepared to run BLAST on the command line.
To start simple, you can run one of the sequences against the database with the basic command next. It will output a lot of data.
!blastn -query S288c.mt.genome.fa -db chrmt.fsa
Already, you'll note that large amount of data is not easy to navigate here.
We can address that some before we move on to using Python to make it much easier to deal with the results.
For dealing with all this data, without Python you can direct the output to a file and then access portions of it.
!blastn -query S288c.mt.genome.fa -db chrmt.fsa -out S288C.mt.x.chrmt.txt
Now that there is a file, you can view the first 45 lines by running the next command.
!head -45 S288C.mt.x.chrmt.txt
That shows that there is a 57 kb fragment that matches very well the sequence we used to build the database.
Likewise tail
could be used and combinations to look at parts of the file, like in the next cell.
!tail -21979 S288C.mt.x.chrmt.txt | head -45
But that is clunky when there are much better ways using Python to read BLAST results that makes them easily queried by programmatic means.
To do this we need to move from outputing in a human readable form to something more parseable by a computer. We will do that in the next section, but first I'll end this section by briefly pointing out another advantange running BLAST on the the command line affords by showing one example of using flags to control settings in depth.
There are a lot of settings that can be adjusted/controlled via the command line and the command line exposes many more (I think) than are available in the web browser based interface (I think). Additionally, trying and adjusting these on the command line can be much more efficient if you start combining it with programmatic means that the this notebook and others in this series will try to introduce.
When we ran the command !blastn -query S288c.mt.genome.fa -db chrmt.fsa -out S288C.mt.x.chrmt.txt
above there were a lot of options that we let be determined by the software or the defaults programmed into the software. However, you don't need to accept those defaults and can adjust things as needed. Here is an example command were we use what are called 'flags' to provide parameters for various settings.
!blastn -query S288c.mt.genome.fa -db chrmt.fsa -task blastn -gapopen 5 -gapextend 2 -reward 2 -penalty -3 -word_size 11 -out flag_example.txt
The Tables here are particularly helpful for understanding what those flags mean. Briefly, we controlled a lot of the settings BLAST uses, and you'll note this query is computationally much more demanding as it takes longer to run this command than !blastn -query S288c.mt.genome.fa -db chrmt.fsa -out S288C.mt.x.chrmt.txt
took run above. I just wanted to touch on this ability here; you can explore this further in this binder with your own sequences. However, because experimenting with these settings can generate a lot of data, you may wish to return to trying things after seeing the next section because you can more easily deal with the generated data using a few abilitues Python adds.
Note: If you are not seeing hits, you may be using too stringent of setings as megablast is the default currently run with the command blastn
as discussed in this post entitled 'When Blast is not Blast'. You probably want -task blastn
included.
html_text = !blastn -query S288c.mt.genome.fa -db chrmt.fsa -task blastn -gapopen 5 -gapextend 2 -reward 2 -penalty -3 -word_size 11 -html
from IPython.core.display import HTML
HTML(html_text.n)
Why html_text.n
is used will become understandable in the course of working through the next notebook, entitled 'BLAST on the Command Line with Advanced Python' here. For now, it will suffice to say that I'm using a shortcut to get the HTML text output here to show you can get the more formatted version by using the HTML flag and then display it in Jupyter, if you need. However, the goal of this is to use BLAST in Jupyter to automate and so the focus is meant to be on computationally parseable output and not the human-readable variations.
The next sections and the next notebook build towards that more progammatic direction using Blast on the command line in conjunction with Python/Jupyter.
As I mentioned above, to make things more efficient and easier to parse, we'd like to move towards letting the computer handle our BLAST data. To do this we'll build on the approach outline at the start of the document here that involves using 'flags' to control the settings as just discussed above.
See here for a listing of explanations for the text codes involved in the flags.
Run the next cell to make a new file S288C.mt.x.chrmt.c.txt
.
!blastn -query S288c.mt.genome.fa -db chrmt.fsa -outfmt "6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq" -task blastn -out S288C.mt.x.chrmt.c.txt
Let's peek at what that looks like by looking at the first line.
!head -1 S288C.mt.x.chrmt.c.txt
You'll notice that definitely look different, and we've moved away from 'human readable' for certain.
Next, we'll use Python, which is great for working with text based data to parse this into a much more convenient form. Plus, we'll make it human readable at the same time. Then with that in hand will return to using BLAST by moving into adding in options to the BLAST run commands to control the settings in order customize the queries beyond the defaults.
We are going to use dataframes as form to handle the results. These are a really conevenient datatype and Python has an entire library, called Pandas, for dealing with them. This Jupyter environment already has the Pandas Python module installed. We'll use a script that will access it when it runs to take the BLAST results and make them into a dataframe.
To get that script, you can run the next cell. (It is not included in the repository where this launches from to insure you always get the most current version, which is assumed to be the best available at the time.)
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py
We have the script now. And we already have a file for it to process. To process the results, run the next command where we use Python to run the script and direct it at the results file, S288C.mt.x.chrmt.c.txt
, we made just a few cells ago.
%run blast_to_df.py S288C.mt.x.chrmt.c.txt
As of writing this, the script we are using outputs a file that is a binary, compact form of the dataframe. (That means it is tiny and not human readable. It is called 'pickled'. Saving in that form may seem odd, but as illustrated here below this is is a very malleable form. And even more pertinent for dealing with BLAST results in Jupyter notebooks, there is actually an easier way to interact with this script when in Jupyter notebook that skips saving this intermediate file. Please see the next notebook on using BLAST results with Python for more about that if you find that awkward. To lessen the Python wizardry for now, I am going to take the file intermediate path, and perhaps I should have saved it in a tab-separated form, but that would take up more space and some of these files can get rather large. It is easy to convert back and forth using the pickled form assuming you can match the Pandas/Python versions. I have found Python 3 to Python 3 you need to convert to the text based table and use that as an intermediate. In other words, the current Python3/Pandas pickled form is not directly readable by current Python 2.7/Pandas.)
We can take that file where the dataframe is pickled, and bring it into active memory in this notebook with another command form the Pandas library. First we have to import the Pandas library because where we used earlier was in termporary environment where blast_to_df.py
ran in the last cell, which actually is different than the computational envrionment associated with this notebook.
Run the next command to bring the dataframe into active memory.
import pandas as pd
df = pd.read_pickle("BLAST_pickled_df.pkl")
When that last cell ran, you won't notice any output, but something happened. We can look at that dataframe by calling it in a cell.
df
You'll notice that it is so large, that the Jupyter environment represents just the head and tail to make it more reasonable. There are ways you can have Jupyter display it all which we won't go into here.
Instead we'll start to show some methods of dataframes that make them convenient. For example, you can use the head
method to see the start like we used on the command line above.
df.head()
So Pandas dataframes are a excellent way to store data in a way that clear but at the same time provides a lot of power and convenience for dealing with rather large sets of data. We'll only touch upon this a little here. For example, there are more convenience methods and functions that Pandas, such as df.describe()
and df.tail
, that you can explore.
But let's run BLAST again with different data, and bring the results into a dataframe in a more direct way. And we'll then demonstrate the utility of the dataframe form with an applied example demonstrating the ease with which you can extract relevant data using Pandas.
Above the steps were laid out step-by-step to make clear what is happening. However, this can be done in a more compact way all in a single Jupyter cell. (In fact, if you were really doing an analysis of this collection of PacBio generated sequences, you'd take another step as dicussed below.) The next cell will run another BLAST and make the data in a dataframe.
(This will depend on running the 'Preparation' section above and additionally the following two commands from above:
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py
import pandas as pd
And so if you are picking this back up later, run the first section and then add some cells and run those extra two commands.)
!blastn -query Y12.mt.genome.fa -db chrmt.fsa -outfmt "6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq" -task blastn -out Y12.mt.x.chrmt.c.txt
%run blast_to_df.py Y12.mt.x.chrmt.c.txt --df_output next_BLAST_pickled_df.pkl
df = pd.read_pickle("next_BLAST_pickled_df.pkl")
df.head()
From my work with these genomes, I know that what the curating researchers defined as "start" position of the circular chromosome differs from the arbitrary "start" that the Saccharomyces Genome Database (SGD) uses for S288C mitochdonrial start position. For comparing the position of genes, etc., it is important for me to be able to match these other sequences to the SGD reference sequence.
Again, from my previous work, I know that in most cases for these sequence, the "start" used by the curating researchers, is easily located from the BLAST data be finding the segment that matches the start of the SGD sequence. That translates to wanting to find where the value in the sstart
(subject start) column matches 1. For the PacBio S288C that we looked at in the first step-by-step example above, the second row in the result viewable df.head()
revealed that. But it didn't here. We can find it easily though via the power of Pandas by coding a conditional selection to say we want any rows there sstart
equals 1
. The next cell will run this subsetting operation.
start_loc_df = df[df['sstart']==1]
start_loc_df
Running that reveals the answer, the qstart
(query start) position of 54123 matches the SGD reference sequence position 1.
In fact, we can extract that exact value using Pandas to references it specifically. We'll assign that to a variable in the next cell because we are going to use it next. Additionally we are going to make sure we sort the start_loc_df
on bitscore
. This sorting is serving an pedagogical and practical purpose here:
#1 it exposes another Pandas operation that easy to perform.
#2 it would be important if there were two matches satisfying df['sstart']==1
that we take the better scoring one. This situation is not happening with this example, but does with other examples from this set of a dozen PacBio sequences (only three of which did we actually retrieve in this example notebook), and so I am adding it here to generalize the processs further.
start_loc_df = start_loc_df.sort_values(['sstart','bitscore'],ascending=[1,0])
start_loc = start_loc_df.iloc[0]["qstart"]
start_loc
So now you have start_loc
defined as the position in the PacBio Y12 mitochondrial sequence as the one that matches the SGD sequence. To further demonstrate the utility of combining BLAST results with Python, in the next cell you can adjust the sequence of the PacBio Y12 mitochondrial sequence to match the 'start' position SGD uses.
from Bio import SeqIO
with open("Y12.mt.genome.fa") as handle:
mito_seq = SeqIO.read(handle, "fasta")
# fix, based on where it says "i.e. shift the starting point on this plasmid," @
#http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html
left = mito_seq[:start_loc-1] # use one less because of zero indexing in Python
right = mito_seq[start_loc-1:]
new_mito_seq = right + left
# write result after fix
SeqIO.write(
new_mito_seq, "Y12.mt.genome.START_ADJUSTED.fa", "fasta");
The saved sequence Y12.mt.genome.START_ADJUSTED.fa
should begin at that same point, and now it would be ready for use with the SGD reference sequence (available here or already downloaded into this active Jupyter environment as chrmt.fsa
in the course of the preparation steps) in an alignment by Kalign. If you were to do that, you'd see the both ends of that long alignment actually align.
(In case you are new to the Jupyter environment, to get to the dashboard where the file directory is visible, click on the Jupyter logo in the upper left of this notebook page.)
If we had tried running the alignment without moving the parts to match the SGD sequence, the sequences would align in the 'middle', but the two 'ends' would be left overhanging. In other words, without the corrective shift, the large 'start' segment of the SGD sequence would appear as an unaligned overhang with about 33 thousand base pairs into the alignment where upon the Y12 'start' sequence would finally match SGD sequence farily well, and that would be used as an anchor for this 'middle' matching. After that matching segment a large fragment of Y12 would then remain overhanging.
You can demonstrate those two situations I just described easily for yourself with the sequences available in this Jupyter environment by submitting them to Kalign.
Aside: Other Alternatives
Of course, you can work with tabular BLAST output other ways. For example, AWK used on the command line can easily grab the best hit as described here by Mick Watson:
"Also if you have multiline tabular BLAST results, just get top hit by
awk '!a[$1]++'"
-Source: https://twitter.com/BioMickWatson/status/1017486331511963648
Demonstrating that:
!cat S288C.mt.x.chrmt.c.txt|awk '!a[$1]++'
This section might be better titled, "How you'd really do what was shown above".
Above we dealt with PacBio-generated sequences S228C and then Y12 in turn comparing them the SGD reference sequence for S288C. However, it is much more convenient to combine the asociated sequences you need to compare into one file and submit in one BLAST job. As you'll see, the dataframe structure makes it easy to keep these combined, but easily accessible individually. This enhances reproducibility because you then have one master file to deal with and you know you had the same settings applied to all the involved queries. This is espeically important once you start using 'flags' to customize settings.
Let's combine all the PacBio data we have been dealing with above and one we haven't touched yet in one. (There are actually a total of a dozen in the real set.) First we'll concatenate all the individual FASTA files we were dealing with into one file that file that consist of multiple squences in FASTA format that is sometimes referred to as Multi-FASTA format. Then the next steps are just the same as above.
(As with the last section, the next cell working will depend on running the 'Preparation' section above and additionally the following two commands from above:
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/blast-utilities/blast_to_df.py
import pandas as pd
And so if you are picking this back up later, run the first section and then add some cells and run those extra two commands.)
!cat S288c.mt.genome.fa Y12.mt.genome.fa SK1.mt.genome.fa > pacbio.mt.fa
!blastn -query pacbio.mt.fa -db chrmt.fsa -outfmt "6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq" -task blastn -out pacbio.mt.x.chrmt.c.txt
%run blast_to_df.py pacbio.mt.x.chrmt.c.txt --df_output pb.x.chrmt_pickled_df.pkl
df = pd.read_pickle("pb.x.chrmt_pickled_df.pkl")
df.head()
Promising. But the qseqid
(query sequence identifier) only shows data for S288c.mito
visible right now. Let's run the df.tail()
command next.
df.tail()
Okay, there is SK1.mito
that we haven't touched prior to this and a high index approaching 5000 on the far left. So all the same data we have seen before, plus the SK1
data is in there.
We can easily access indivudal parts by adding conditions based on the qseqid
to selections for the dataframe filtering/subsetting.
For example, above we got the 'start' value for Y12 that corresponded to position #1 of the SGD sequence. You can repeat that for SK1 by running the next cell.
SK1_df = df[df['qseqid'] == "SK1.mito"]
start_loc_df = SK1_df.sort_values(['sstart','bitscore'],ascending=[1,0])
start_loc = start_loc_df.iloc[0]["qstart"]
start_loc
Yay, a number. You can see that more in context by showing the start of the pertinent dataframe.
start_loc_df.head()
So you could again adjust the sequence based on that number so the 'starts' match. There is no need to actually do that because the point was to illustrate pb.x.chrmt_pickled_df.pkl
is easier to generate and much better to deal with than the individual approaches I had done earlier as to make this process less overwhelming.
I've tried to sell you on the power of the Python/Pandas dataframe, but it isn't for all uses or everyone. However, most everyone is accustomed to dealing with text based tables or even Excel. In fact, a text-based based table perhaps tab or comma-delimited would be the better way to archive the data we are generating here. Python/Pandas makes it easy to go from the dataframe form to these tabular forms. You can even go back later from the table to the dataframe, which may be inportant if you are going to different versions of Python/Pandas as I briefly mentioned parenthetically above.
First, generating a text-based table.
#Save / write a TSV-formatted (tab-separated values/ tab-delimited) file
df.to_csv('blast_data.tsv', sep='\t',index = False) #add `,header=False` to leave off header, too
Because df.to_csv()
defaults to dealing with csv, you can simply use df.to_csv('example.csv',index = False)
for comma-delimited (comma-separated) files.
You can see that worked by looking at the first line with the next command. (Feel free to make the number higher or delete the number all together. I restricted it just to first line to make output smaller.)
!head -1 blast_data.tsv
If you had need to go back from a tab-separated table to a dataframe, you can run something like in the following cell.
reverted_df = pd.read_csv('blast_data.tsv', sep='\t')
reverted_df.to_pickle('reverted_df.pkl') # OPTIONAL: pickle that data too
For a comma-delimited (CSV) file you'd use df = pd.read_csv('example.csv')
because pd.read_csv()
method defaults to comma as the separator (sep
parameter).
You can verify that read from the text-based table by viewing it with the next line.
reverted_df.head()
Generating an Excel spreadsheet from a dataframe.
Because this is an specialized need, there is a special module needed that I didn't bother installing by default and so it needs to be installed before generating the Excel file. Running the next cell will do both.
%pip install openpyxl
# save to excel (KEEPS multiINDEX, and makes sparse to look good in Excel straight out of Python)
df.to_excel('blast_data.xlsx') # after openpyxl installed
You'll need to download the file first to your computer and then view it locally as there is no viewer in the Jupyter environment.
Adiitionally, it is possible to add styles to dataframes and the styles such as shading of cells and coloring of text will be translated to the Excel document made as well.
Excel files can be read in to dataframes directly without needing to go to a text based intermediate first using the xlrd
module.
%pip install xlrd
# read Excel, after xlrd installed
df_from_excel = pd.read_excel('blast_data.xlsx') # after xlrd installed
That can be viewed to convince yourself it worked by running the next command.
df_from_excel.head()
Hopefully, you now have an environment that you can run BLAST and manipulate the output readily. Your next step would be to add your data and repeat. You can get to the Jupyter dashboard by clicking the the Jupyter logo in the upper left corner of this notebook. There you can upload files from your computer using a typical GUI file handling interface. Or, you can use the command line to retrieve files.
Note that you can also use that same Jupyter dashboard, to launch new notebooks to start working with your own data in a fresh notebook in the same session as you ran the examples above.
Remember if you make any useful files to get them ASAP as these Binder-associated instances are temporary. Additionally, grab the pickled dataframe files as well as they can be read in again to continue an analysis of any results without need for re-rerunning any BLAST.
Additionallly, now that you can quickly reformat the data into useful forms, you may want return to the brief introduction above of 'flags' to adjust options and explore those taking advantage of how running BLAST on the command line offers the ability to quickly test various options and compare and contrast the output. Related to that, you may want to learn more about Pandas. Here are some resources for that:
Beyond that...
There is another notebook, entitled 'BLAST on the Command Line with Advanced Python' here in this repo that shows different ways these steps can be streamlined when running in the Jupyter environment. For example, there isn't a need to rely on file intermediates. If you are using Jupyter notebooks, to make a workflow or pipeline, you definitely want to take a look at this approach. Please see here for more advanced use of BLAST and Python.
The Binder-luanchable version that you are using here too limiting for your needs?
NCBI has made available an Amazon Machine Image (AMI) that will generate a remote cloud compute instance already set to run standalone BLAST, see here. Additional documentation is here.