BLASTN C.gigas against NCBI nt DB

Code explanation (by line)

  1. Calls blastn program (!blastn) and specifies to use blastn (-task blastn).
  2. Specifies which query file to blast.
  3. Specifies which database file to blast against (located on Hummingbird /Volumes/Data/blast_dbs).
  4. Specifies output format. In this case, output format 6 with subject scientific names.
  5. Specifies maximum number of DB matches to save.
  6. Specifies number of CPUs to use.
  7. Output file name.
  8. Directs stderr output to file instead of printing to screen.
In [7]:
!blastn -task blastn \
-query Owl/halfshell/EmmaBS400.fa \
-db nt \
-outfmt "6 sscinames" \
-max_target_seqs 1 \
-num_threads 16 \
-out 20150501_nt_blastn.tab \
2> 20150501_nt_blastn.err
^C

Interrupted kernel because I glanced at the output file and saw this (after running for ~6hrs!):

In [8]:
!head -30 20150501_nt_blastn.tab
Methanohalophilus mahii DSM 5219
uncultured organism
uncultured organism
Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Mus musculus
Brassica rapa subsp. pekinensis
Vanderwaltozyma polyspora DSM 70294
Leishmania braziliensis MHOM/BR/75/M2904
Dictyostelium discoideum AX4
Dictyostelium discoideum AX4
Homo sapiens
Maribacter sp. HTCC2170
Dictyostelium discoideum AX4
Mus musculus
Dictyostelium discoideum AX4
Dictyostelium fasciculatum
Chrysemys picta bellii
Rattus norvegicus
Rattus norvegicus
Rattus norvegicus
Rattus norvegicus
Rattus norvegicus
Beta vulgaris
Beta vulgaris
Beta vulgaris

The issue is that the output only contains the species info and not the rest of the formatting (e-vals, bit scores, etc) that are part of BLASTn format 6.

It turns out that specifying format 6 just specifies tab-delimited output. If an "additional" feature is specified from the default output of format 6, then the defaults no longer apply; you have to specify them all.

Re-run BLAST with correct output formatting

In [ ]:
!blastn -task blastn \
-query Owl/halfshell/EmmaBS400.fa \
-db nt \
-outfmt "6 qseqid sseqid pident length evalue stitle sscinames" \
-max_target_seqs 1 \
-num_threads 16 \
-out 20150501_nt_blastn.tab \
2> 20150501_nt_blastn.err

Verify file output

In [10]:
!head -10 20150501_nt_blastn.tab
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_1	gi|292665689|gb|CP001994.1|	86.00	50	1e-04	Methanohalophilus mahii DSM 5219, complete genome	Methanohalophilus mahii DSM 5219
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_2	gi|534503879|gb|KF524813.1|	99.70	4658	0.0	Uncultured organism clone 89_8 Microvirus J protein, Capsid proteins, Major spike protein (G protein), Microvirus H protein, Bacteriophage replication gene A protein, and Phage protein C genes, complete cds; and Bacteriophage scaffolding protein D gene, partial cds	uncultured organism
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_2	gi|534503879|gb|KF524813.1|	100.00	802	0.0	Uncultured organism clone 89_8 Microvirus J protein, Capsid proteins, Major spike protein (G protein), Microvirus H protein, Bacteriophage replication gene A protein, and Phage protein C genes, complete cds; and Bacteriophage scaffolding protein D gene, partial cds	uncultured organism
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_3	gi|72493824|dbj|AP008934.1|	81.13	53	0.009	Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305 DNA, complete genome	Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4	gi|28475607|gb|AC117201.2|	70.56	180	2e-04	Mus musculus BAC clone RP23-216E6 from 16, complete sequence	Mus musculus
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4	gi|28475607|gb|AC117201.2|	70.81	161	0.033	Mus musculus BAC clone RP23-216E6 from 16, complete sequence	Mus musculus
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4	gi|28475607|gb|AC117201.2|	70.25	158	0.11	Mus musculus BAC clone RP23-216E6 from 16, complete sequence	Mus musculus
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4	gi|28475607|gb|AC117201.2|	69.44	180	0.11	Mus musculus BAC clone RP23-216E6 from 16, complete sequence	Mus musculus
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4	gi|28475607|gb|AC117201.2|	68.78	189	0.40	Mus musculus BAC clone RP23-216E6 from 16, complete sequence	Mus musculus
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4	gi|28475607|gb|AC117201.2|	69.70	165	1.4	Mus musculus BAC clone RP23-216E6 from 16, complete sequence	Mus musculus

Count number of matched sequences

In [17]:
!wc -l 20150501_nt_blastn.tab
   67326 20150501_nt_blastn.tab

Check best matches based on e-value

The code below cuts columns (fields; -f) 5 and 7 (e-value and species) of the input file, sorts and displays first 100 entries (head -100)

In [15]:
!cut -f5,7 20150501_nt_blastn.tab | sort | head -100
0.0	Alteromonas macleodii str. 'Ionian Sea U4'
0.0	Alteromonas macleodii str. 'Ionian Sea U4'
0.0	Canis lupus familiaris
0.0	Comamonas sp. 7D-2
0.0	Comamonas sp. 7D-2
0.0	Comamonas sp. 7D-2
0.0	Comamonas sp. 7D-2
0.0	Crassostrea angulata
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Crassostrea gigas
0.0	Cycloclasticus zancles 7-ME
0.0	Cycloclasticus zancles 7-ME
0.0	Cycloclasticus zancles 7-ME
0.0	Dinoroseobacter shibae DFL 12
0.0	Dinoroseobacter shibae DFL 12
0.0	Dinoroseobacter shibae DFL 12
0.0	Dinoroseobacter shibae DFL 12
0.0	Dinoroseobacter shibae DFL 12
0.0	Dinoroseobacter shibae DFL 12
0.0	Dinoroseobacter shibae DFL 12
0.0	Dinoroseobacter shibae DFL 12
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
0.0	Homo sapiens
sort: write failed: standard output: Broken pipe
sort: write error

Extract all ouput with e-values less than or equal to 0.001

The code uses awk to obtain all lines in the file that have e-values less than or equal to 0.001 ($5<=0.001). The e-values are found in column (i.e. field) 5 of the input file ($5).

In [79]:
%%bash
awk ' $5<=0.001' 20150501_nt_blastn.tab > 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt

Count number of lines produced from previous command

In [81]:
!wc -l 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt
   40700 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt

Store number of lines as Python variable

The code stores the number of lines in the Python variable "total" from the output of the bash command (designated by the "!").

In [92]:
total = !wc -l < 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt

Verify variable has correct number

Notice that the number of lines is stored as a string list. This is denoted by the brackets and single quotes in the output.

In [93]:
print total
['   40700']

Convert string list variable to integer

This enables the number to be more easily used for subsequent calculations.

In [96]:
total = int(total[0])

Verify variable is now and integer and no longer a string list

This is confirmed by the lack of brackets around the output.

In [97]:
print total
40700

Count the number of each species in the BLAST matches

The code uses awk again to extract all lines with e-values <=0.001. Those lines are then cut on column (i.e. field; -f) 7, sorted (which is necessary for the subsequent "uniq" command), and the uniq entries are counted (uniq -c).

In [70]:
%%bash
awk ' $5<=0.001' 20150501_nt_blastn.tab | cut -f7 | sort | uniq -c > 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt

Check output file

In [71]:
!head 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt
   2 'Nostoc azollae' 0708
   1 Acanthamoeba castellanii
  13 Acanthamoeba castellanii mamavirus
  89 Acanthamoeba polyphaga mimivirus
   1 Acanthocheilonema viteae
   2 Acanthocystis turfacea Chlorella virus TN603.4.2
   1 Acanthopagrus schlegelii
   4 Acaryochloris marina MBIC11017
   1 Acaryochloris sp. HICR111A
   2 Acetobacter pasteurianus 386B

Remove leading whitespace from file

Although difficult to notice, the previous command where the number of unique entities are counted results in an output with a bunch of spaces at the beginning of each line that I don't want.

The command below utilizes sed. Sed creates a backup of the input file (-i.bu; ".bu" is a custom suffix for the backup file - you can enter anything you'd like) and then removes (substitutes, thus the s/) all spaces (*/) found only at the beginning (^) of each line.

In [72]:
!sed -i.bu 's/^ *//g' 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt

Verify removal of leading whitespaces

Note how the output is shifted to the left compared to how it was above.

In [73]:
!head -30 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt
2 'Nostoc azollae' 0708
1 Acanthamoeba castellanii
13 Acanthamoeba castellanii mamavirus
89 Acanthamoeba polyphaga mimivirus
1 Acanthocheilonema viteae
2 Acanthocystis turfacea Chlorella virus TN603.4.2
1 Acanthopagrus schlegelii
4 Acaryochloris marina MBIC11017
1 Acaryochloris sp. HICR111A
2 Acetobacter pasteurianus 386B
3 Acetobacterium woodii DSM 1030
8 Acetohalobium arabaticum DSM 5501
2 Achromobacter denitrificans
1 Acidianus hospitalis W1
1 Acidiphilium cryptum JF-5
1 Acidovorax sp. KKS102
1 Aciduliprofundum boonei T469
2 Acinetobacter baumannii
2 Acinetobacter baumannii AYE
1 Acinetobacter baumannii BJAB07104
4 Acinetobacter baumannii BJAB0715
2 Acinetobacter baumannii BJAB0868
1 Acinetobacter baumannii SDF
1 Acinetobacter calcoaceticus
11 Acinetobacter calcoaceticus PHEA-2
4 Acinetobacter oleivorans DR1
8 Acinetobacter sp. ADP1
2 Acinetobacter sp. C42
1 Acinetobacter sp. M-1
1 Acontias meleagris

Count number of unique species

In [75]:
!wc -l 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt
    1608 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt

Copy file to Eagle

In [77]:
!cp 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt /Volumes/Eagle/Arabidopsis/

View top 15 species of BLAST matches

The code sorts the file in reverse (-r) numerical (-n) order on the first column (-k1) and displays the first 15 lines (head -15).

In [78]:
!sort -rn -k1 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt | head -15
9737 Dictyostelium discoideum AX4
8248 Danio rerio
6047 Homo sapiens
2739 Mus musculus
956 Rattus norvegicus
411 Solanum lycopersicum
395 Volvox carteri f. nagariensis
377 Dictyostelium fasciculatum
369 Hucho taimen
368 Dictyostelium purpureum
365 Schistosoma mansoni
360 Botryotinia fuckeliana
312 Lodderomyces elongisporus NRRL YB-4239
265 Crassostrea gigas
248 Octadecabacter antarcticus 307

Store Crassostrea gigas counts in variable

The code uses bash grep to find any lines with "Crassostrea gigas" on them and then uses awk to print the first column (i.e. field; $1) which contains the count.

In [108]:
gigas_count = !grep 'Crassostrea gigas' 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt | awk '{ print $1 }'

Confirm count is stored in variable

In [109]:
print gigas_count
['265']

Convert Python variable value from string list to floating number

Conversion to float is necessary based on the desired output we want later. The output I want later will be a value less than 1. Integers can only display whole numbers.

In [110]:
gigas_count = float(gigas_count[0])

Verify conversion from string list to float

In [111]:
print gigas_count
265.0

Calculate percentage of Crassostrea gigas BLAST hits

In [113]:
print (gigas_count/total) * 100
0.651105651106