Shortcuts: (Links to primary content on this page)
Blast Databases
Fasta files of interest
Steven does Module 1
-- Student Inquiry Based (problem solving)
-- Class Implemented
-- Discovery Driven
4 tenets
Everyone needs a Public Lab Notebook
!pwd
/Applications/BLAST/ncbi-blast-2.2.26+
!head w1_Intro.ipynb
{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ {
Assignment: Blast a large fasta file and create a tab-delimited output file.
Screenshot of Blast page at NCBI.
Download Stand-alone BLAST
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
pwd
u'/Users/sr320/Dropbox/Steven/ipython_nb/fish546'
cd /Applications/BLAST/ncbi-blast-2.2.26+/
/Applications/BLAST/ncbi-blast-2.2.26+
ls
ChangeLog README doc/ LICENSE bin/ ncbi_package_info
cd bin
/Applications/BLAST/ncbi-blast-2.2.26+/bin
ls
blast_formatter* blastx* makembindex* tblastn* blastdb_aliastool* convert2blastmask* makeprofiledb* tblastx* blastdbcheck* deltablast* psiblast* update_blastdb.pl* blastdbcmd* dustmasker* rpsblast* windowmasker* blastn* legacy_blast.pl* rpstblastn* blastp* makeblastdb* segmasker*
!blastn -help
USAGE blastn [-h] [-help] [-import_search_strategy filename] [-export_search_strategy filename] [-task task_name] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-out output_file] [-evalue evalue] [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty] [-perc_identity float_value] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-max_hsps_per_subject int_value] [-penalty penalty] [-reward reward] [-no_greedy] [-min_raw_gapped_score int_value] [-template_type type] [-template_length int_value] [-dust DUST_options] [-filtering_db filtering_database] [-window_masker_taxid window_masker_taxid] [-window_masker_db window_masker_db] [-soft_masking soft_masking] [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-off_diagonal_range int_value] [-use_index boolean] [-index_name string] [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-html] [-max_target_seqs num_sequences] [-num_threads int_value] [-remote] [-version] DESCRIPTION Nucleotide-Nucleotide BLAST 2.2.26+ OPTIONAL ARGUMENTS -h Print USAGE and DESCRIPTION; ignore other arguments -help Print USAGE, DESCRIPTION and ARGUMENTS description; ignore other arguments -version Print version number; ignore other arguments *** Input query options -query <File_In> Input file name Default = `-' -query_loc <String> Location on the query sequence in 1-based offsets (Format: start-stop) -strand <String, `both', `minus', `plus'> Query strand(s) to search against database/subject Default = `both' *** General search options -task <String, Permissible values: 'blastn' 'blastn-short' 'dc-megablast' 'megablast' 'rmblastn' > Task to execute Default = `megablast' -db <String> BLAST database name * Incompatible with: subject, subject_loc -out <File_Out> Output file name Default = `-' -evalue <Real> Expectation value (E) threshold for saving hits Default = `10' -word_size <Integer, >=4> Word size for wordfinder algorithm (length of best perfect match) -gapopen <Integer> Cost to open a gap -gapextend <Integer> Cost to extend a gap -penalty <Integer, <=0> Penalty for a nucleotide mismatch -reward <Integer, >=0> Reward for a nucleotide match -use_index <Boolean> Use MegaBLAST database index -index_name <String> MegaBLAST database index name *** BLAST-2-Sequences options -subject <File_In> * Incompatible with: db, gilist, seqidlist, negative_gilist, db_soft_mask, db_hard_mask -subject_loc <String> Location on the subject sequence in 1-based offsets (Format: start-stop) * Incompatible with: db, gilist, seqidlist, negative_gilist, db_soft_mask, db_hard_mask, remote *** Formatting options -outfmt <String> alignment view options: 0 = pairwise, 1 = query-anchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = XML Blast output, 6 = tabular, 7 = tabular with comment lines, 8 = Text ASN.1, 9 = Binary ASN.1, 10 = Comma-separated values, 11 = BLAST archive format (ASN.1) Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. The supported format specifiers are: qseqid means Query Seq-id qgi means Query GI qacc means Query accesion qaccver means Query accesion.version qlen means Query sequence length sseqid means Subject Seq-id sallseqid means All subject Seq-id(s), separated by a ';' sgi means Subject GI sallgi means All subject GIs sacc means Subject accession saccver means Subject accession.version sallacc means All subject accessions slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive-scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive-scoring matches frames means Query and subject frames separated by a '/' qframe means Query frame sframe means Subject frame btop means Blast traceback operations (BTOP) When not provided, the default value is: 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std' Default = `0' -show_gis Show NCBI GIs in deflines? -num_descriptions <Integer, >=0> Number of database sequences to show one-line descriptions for Default = `500' * Incompatible with: max_target_seqs -num_alignments <Integer, >=0> Number of database sequences to show alignments for Default = `250' * Incompatible with: max_target_seqs -html Produce HTML output? *** Query filtering options -dust <String> Filter query sequence with DUST (Format: 'yes', 'level window linker', or 'no' to disable) Default = `20 64 1' -filtering_db <String> BLAST database containing filtering elements (i.e.: repeats) -window_masker_taxid <Integer> Enable WindowMasker filtering using a Taxonomic ID -window_masker_db <String> Enable WindowMasker filtering using this repeats database. -soft_masking <Boolean> Apply filtering locations as soft masks Default = `true' -lcase_masking Use lower case filtering in query and subject sequence(s)? *** Restrict search or results -gilist <String> Restrict search of database to list of GI's * Incompatible with: negative_gilist, seqidlist, remote, subject, subject_loc -seqidlist <String> Restrict search of database to list of SeqId's * Incompatible with: gilist, negative_gilist, remote, subject, subject_loc -negative_gilist <String> Restrict search of database to everything except the listed GIs * Incompatible with: gilist, seqidlist, remote, subject, subject_loc -entrez_query <String> Restrict search with the given Entrez query * Requires: remote -db_soft_mask <String> Filtering algorithm ID to apply to the BLAST database as soft masking * Incompatible with: db_hard_mask, subject, subject_loc -db_hard_mask <String> Filtering algorithm ID to apply to the BLAST database as hard masking * Incompatible with: db_soft_mask, subject, subject_loc -perc_identity <Real, 0..100> Percent identity -culling_limit <Integer, >=0> If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit * Incompatible with: best_hit_overhang, best_hit_score_edge -best_hit_overhang <Real, (>=0 and =<0.5)> Best Hit algorithm overhang value (recommended value: 0.1) * Incompatible with: culling_limit -best_hit_score_edge <Real, (>=0 and =<0.5)> Best Hit algorithm score edge value (recommended value: 0.1) * Incompatible with: culling_limit -max_target_seqs <Integer, >=1> Maximum number of aligned sequences to keep * Incompatible with: num_descriptions, num_alignments *** Discontiguous MegaBLAST options -template_type <String, `coding', `coding_and_optimal', `optimal'> Discontiguous MegaBLAST template type * Requires: template_length -template_length <Integer, Permissible values: '16' '18' '21' > Discontiguous MegaBLAST template length * Requires: template_type *** Statistical options -dbsize <Int8> Effective length of the database -searchsp <Int8, >=0> Effective length of the search space -max_hsps_per_subject <Integer, >=0> Override maximum number of HSPs per subject to save for ungapped searches (0 means do not override) Default = `0' *** Search strategy options -import_search_strategy <File_In> Search strategy to use * Incompatible with: export_search_strategy -export_search_strategy <File_Out> File name to record the search strategy used * Incompatible with: import_search_strategy *** Extension options -xdrop_ungap <Real> X-dropoff value (in bits) for ungapped extensions -xdrop_gap <Real> X-dropoff value (in bits) for preliminary gapped extensions -xdrop_gap_final <Real> X-dropoff value (in bits) for final gapped alignment -no_greedy Use non-greedy dynamic programming extension -min_raw_gapped_score <Integer> Minimum raw gapped score to keep an alignment in the preliminary gapped and traceback stages -ungapped Perform ungapped alignment only? -window_size <Integer, >=0> Multiple hits window size, use 0 to specify 1-hit algorithm -off_diagonal_range <Integer, >=0> Number of off-diagonals to search for the 2nd hit, use 0 to turn off Default = `0' *** Miscellaneous options -parse_deflines Should the query and subject defline(s) be parsed? -num_threads <Integer, >=1> Number of threads (CPUs) to use in the BLAST search Default = `1' * Incompatible with: remote -remote Execute search remotely? * Incompatible with: gilist, seqidlist, negative_gilist, subject_loc, num_threads
Things to consider
You can make your own database from any fasta format file.
Some of the commonly used databases are found at NCBI and Uniprot.
A list of all NCBI databases for download is available at
ftp://ftp.ncbi.nlm.nih.gov/blast/db/. Preformatted and fasta files are available.
Fasta Files can be downloaded from Uniprot at http://www.uniprot.org/downloads.
Do note that all of these fasta files / databases are routinely updated so it is important to know where to get the most recent version.
I have downloaded a number of these locally and can be viewed at
http://eagle.fish.washington.edu/whale/index.php?dir=blast%2Fdb%2F. In order to use the files "locally" you will need to mount the computer hummmingbird.fish.washington.edu
. A description of these databases is shown below.
from IPython.display import HTML
HTML('<iframe src=https://docs.google.com/spreadsheet/pub?key=0AtV_gF766XZAdHpubXFXVmZlM0lBVnVSZURUMUdZMGc&output=html&widget=true width=100% height=750></iframe>')
ID | Description | url | comment |
---|---|---|---|
Cgigas_v9tran | Pacific oyster transcriptome; [.gz] 28027 gene (CDS only) sequences (via gigadb.org) | ftp://climb.genomics.cn/pub/10.5524/100001_101000/100030/gene_v9/oyster.v9.glean.final.rename.gff.cds.gz | note |
Cgigas_v9prot | Pacific oyster proteome; [.gz] 28027 protein sequences (via gigadb.org) | ftp://climb.genomics.cn/pub/10.5524/100001_101000/100030/gene_v9/oyster.v9.glean.final.rename.gff.pep.gz | note |
V_tubiashii | Assembly of Vibrio tubishii (RE22) partial genome | http://files.figshare.com/91912/ContigsequencesRE22_6588.fa | note |
Olurida_v3tran | Olympia oyster transcriptome version 3 | http://eagle.fish.washington.edu/cnidarian/Olurida_transcriptome_v3.fasta | note |
mystery | mystery | http://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa | note |
!%s/ .*// Olurida_transcriptome_v3.fasta > test
/bin/sh: line 0: fg: no job control
#can I tunnel into genefish to run blast
Download appropriate software from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
I downloaded the file to this location on my computer /Volumes/Bay3/Software/
.
cd /Volumes/Bay3/Software
/Volumes/Bay3/Software
#commmand that list file (-1 = one file per line) and only those that start with "ncbi"
!ls -1 ncbi*
ncbi-blast-2.2.27+-universal-macosx.tar.gz ncbi-blast-2.2.28+-universal-macosx.tar.gz ncbi-blast-2.2.29+-universal-macosx.tar.gz ncbi-blast-2.2.26+: ChangeLog LICENSE README bin blastdb.ncbirc db doc ncbi_package_info out query ncbi-blast-2.2.27+: ChangeLog LICENSE README bin db doc ncbi_package_info ncbi-blast-2.2.28+: ChangeLog LICENSE README bin db doc ncbi_package_info
#unzipping [-]x --extract --get; -v, --verbose; -z, --gzip; -f, --file F
!tar -xzvf ncbi-blast-2.2.29+-universal-macosx.tar.gz
x ncbi-blast-2.2.29+/ x ncbi-blast-2.2.29+/bin/ x ncbi-blast-2.2.29+/bin/makembindex x ncbi-blast-2.2.29+/bin/tblastn x ncbi-blast-2.2.29+/bin/psiblast x ncbi-blast-2.2.29+/bin/rpsblast x ncbi-blast-2.2.29+/bin/legacy_blast.pl x ncbi-blast-2.2.29+/bin/blastdbcmd x ncbi-blast-2.2.29+/bin/makeblastdb x ncbi-blast-2.2.29+/bin/tblastx x ncbi-blast-2.2.29+/bin/blastn x ncbi-blast-2.2.29+/bin/blastp x ncbi-blast-2.2.29+/bin/segmasker x ncbi-blast-2.2.29+/bin/dustmasker x ncbi-blast-2.2.29+/bin/blastx x ncbi-blast-2.2.29+/bin/blast_formatter x ncbi-blast-2.2.29+/bin/windowmasker x ncbi-blast-2.2.29+/bin/blastdb_aliastool x ncbi-blast-2.2.29+/bin/convert2blastmask x ncbi-blast-2.2.29+/bin/update_blastdb.pl x ncbi-blast-2.2.29+/bin/deltablast x ncbi-blast-2.2.29+/bin/blastdbcheck x ncbi-blast-2.2.29+/bin/rpstblastn x ncbi-blast-2.2.29+/bin/makeprofiledb x ncbi-blast-2.2.29+/doc/ x ncbi-blast-2.2.29+/doc/README.txt x ncbi-blast-2.2.29+/README x ncbi-blast-2.2.29+/ncbi_package_info x ncbi-blast-2.2.29+/LICENSE x ncbi-blast-2.2.29+/ChangeLog
#commmand that list file (-1 = one file per line) and only those that start with "ncbi"
!ls -1 ncbi*
ncbi-blast-2.2.27+-universal-macosx.tar.gz ncbi-blast-2.2.28+-universal-macosx.tar.gz ncbi-blast-2.2.29+-universal-macosx.tar.gz ncbi-blast-2.2.26+: ChangeLog LICENSE README bin blastdb.ncbirc db doc ncbi_package_info out query ncbi-blast-2.2.27+: ChangeLog LICENSE README bin db doc ncbi_package_info ncbi-blast-2.2.28+: ChangeLog LICENSE README bin db doc ncbi_package_info ncbi-blast-2.2.29+: ChangeLog LICENSE README bin doc ncbi_package_info
cd ncbi-blast-2.2.29+/
/Volumes/Bay3/Software/ncbi-blast-2.2.29+
cd bin
/Volumes/Bay3/Software/ncbi-blast-2.2.29+/bin
ls -1
blast_formatter* blastdb_aliastool* blastdbcheck* blastdbcmd* blastn* blastp* blastx* convert2blastmask* deltablast* dustmasker* legacy_blast.pl* makeblastdb* makembindex* makeprofiledb* psiblast* rpsblast* rpstblastn* segmasker* tblastn* tblastx* update_blastdb.pl* windowmasker*
#check to see if "works"
!blastx -h
USAGE blastx [-h] [-help] [-import_search_strategy filename] [-export_search_strategy filename] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-out output_file] [-evalue evalue] [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-max_hsps_per_subject int_value] [-max_intron_length length] [-seg SEG_options] [-soft_masking soft_masking] [-matrix matrix_name] [-threshold float_value] [-culling_limit int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines] [-query_gencode int_value] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-html] [-max_target_seqs num_sequences] [-num_threads int_value] [-remote] [-comp_based_stats compo] [-use_sw_tback] [-version] DESCRIPTION Translated Query-Protein Subject BLAST 2.2.28+ Use '-help' to print detailed descriptions of command line arguments
I would like to make a database of UniProt/Swiss-prot.
Screenshot:
cd /Volumes/Bay3/Software/
/Volumes/Bay3/Software
cd ncbi-blast-2.2.29+/
[Errno 2] No such file or directory: 'ncbi-blast-2.2.29+/' /Volumes/Bay3/Software/ncbi-blast-2.2.29+/db
cd db
[Errno 2] No such file or directory: 'db' /Volumes/Bay3/Software/ncbi-blast-2.2.29+/db
ls
uniprot_sprot.fasta.gz
!gzip -d uniprot_sprot.fasta.gz
ls
uniprot_sprot.fasta
pwd
u'/Volumes/Bay3/Software/ncbi-blast-2.2.29+/db'
#note I am working in dir db, thus can just use file names. Most times you might use the complete path.
!makeblastdb -in uniprot_sprot.fasta -dbtype prot -out uniprot_sprot_r2013_12
Building a new DB, current time: 01/08/2014 11:34:36 New DB name: uniprot_sprot_r2013_12 New DB title: uniprot_sprot.fasta Sequence type: Protein Keep Linkouts: T Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 541954 sequences in 53.9535 seconds.
#creating new directory;
!pwd
/Volumes/Bay3/Software/ncbi-blast-2.2.29+/db
cd ..
/Volumes/Bay3/Software/ncbi-blast-2.2.29+
!mkdir query
ls
ChangeLog README db/ ncbi_package_info LICENSE bin/ doc/ query/
cd query/
/Volumes/Bay3/Software/ncbi-blast-2.2.29+/query
#getting file from url to local location
!wget http://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa
--2014-01-08 11:40:14-- http://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa Resolving eagle.fish.washington.edu... 128.95.149.81 Connecting to eagle.fish.washington.edu|128.95.149.81|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 2030182 (1.9M) [text/plain] Saving to: `Ab_4denovo_CLC6_a.fa' 100%[======================================>] 2,030,182 --.-K/s in 0.03s 2014-01-08 11:40:14 (68.2 MB/s) - `Ab_4denovo_CLC6_a.fa' saved [2030182/2030182]
#lets get a preview
!head Ab_4denovo_CLC6_a.fa
>solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_1 ACACCCCACCCCAACGCACCCTCACCCCCACCCCAACAATCCATGATTGAATACTTCATC TATCCAAGACAAACTCCTCCTACAATCCATGATAGAATTCCTCCAAAAATAATTTCACAC TGAAACTCCGGTATCCGAGTTATTTTGTTCCCAGTAAAATGGCATCAACAAAAGTAGGTC TGGATTAACGAACCAATGTTGCTGCGTAATATCCCATTGACATATCTTGTCGATTCCTAC CAGGATCCGGACTGACGAGATTTCACTGTACGTTTATGCAAGTCATTTCCATATATAAAA TTGGATCTTATTTGCACAGTTAAATGTCTCTATGCTTATTTATAAATCAATGCCCGTAAG CTCCTAATATTTCTCTTTTCGTCCGACGAGCAAACAGTGAGTTTACTGTGGCCTTCAGCA AAAGTATTGATGTTGTAAATCTCAGTTGTGATTGAACAATTTGCCTCACTAGAAGTAGCC TTC
#word count
!wc Ab_4denovo_CLC6_a.fa
35092 35092 2030182 Ab_4denovo_CLC6_a.fa
#how many sequences? lets count ">" as we know each contig has 1
!grep -c ">" Ab_4denovo_CLC6_a.fa
5490
!head /Volumes/Bay3/Software/ncbi-blast-2.2.29\+/out/Ab_4denovo_CLC6_a_uniprot_blastx.tab
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.46 171 30 0 1 513 35 205 1e-101 301 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.38 65 16 0 3 197 121 185 1e-27 104 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_6 sp|P12234|MPCP_BOVIN 76.62 77 18 0 2 232 286 362 2e-23 98.6 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_9 sp|Q41629|ADT1_WHEAT 82.26 62 11 0 3 188 170 231 3e-27 104 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_13 sp|Q32NG4|PDDC1_XENLA 54.44 90 40 1 1 270 140 228 1e-27 106 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_23 sp|Q9GNE2|RL23_AEDAE 97.22 72 2 0 67 282 14 85 1e-42 142 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_31 sp|Q3V1H3|HPHL1_MOUSE 53.38 133 59 1 2 391 23 155 5e-42 153 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_32 sp|Q641Y2|NDUS2_RAT 88.03 117 14 0 2 352 334 450 1e-70 224 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_37 sp|Q9D3D9|ATPD_MOUSE 56.10 123 54 0 2 370 46 168 7e-42 144 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_39 sp|Q39613|CYPH_CATRO 75.00 120 23 1 55 393 1 120 7e-49 160
!wc /Volumes/Bay3/Software/ncbi-blast-2.2.29\+/out/Ab_4denovo_CLC6_a_uniprot_blastx.tab
664 7968 84910 /Volumes/Bay3/Software/ncbi-blast-2.2.29+/out/Ab_4denovo_CLC6_a_uniprot_blastx.tab