FASTQC Analysis

In [2]:
!fastqc --version
FastQC v0.11.2
In [1]:
!fastqc 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Started analysis of 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 5% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 10% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 15% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 20% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 25% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 30% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 35% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 40% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 45% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 50% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 55% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 60% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 65% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 70% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 75% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 80% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 85% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 90% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 95% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Approx 100% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Analysis complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
In [8]:
!cp 2112_lane1_NoIndex_L001_R1_001_fastqc.* \
/Volumes/Eagle/Arabidopsis/iPythonNotebooks/
In [9]:
!mv /Volumes/Eagle/Arabidopsis/iPythonNotebooks/2112_lane1_NoIndex_L001_R1_001_fastqc.html \
/Volumes/Eagle/Arabidopsis/iPythonNotebooks/20150313_2112_lane1_NoIndex_L001_R1_001_fastqc.html
In [10]:
!mv /Volumes/Eagle/Arabidopsis/iPythonNotebooks/2112_lane1_NoIndex_L001_R1_001_fastqc.zip \
/Volumes/Eagle/Arabidopsis/iPythonNotebooks/20150313_2112_lane1_NoIndex_L001_R1_001_fastqc.zip
In [1]:
from IPython.display import HTML
HTML('<iframe src=http://eagle.fish.washington.edu/Arabidopsis/iPythonNotebooks/20150313_2112_lane1_NoIndex_L001_R1_001_fastqc.html width=100% height=700></iframe>')
Out[1]:

Illumina Index Identification Methods Comparison

In [6]:
pwd
Out[6]:
u'/Users/Sam/Documents'

Count the total number of sequences in the FASTQ file and store in variable

This command uses bash commands to count the number of lines in the FASTQ file (wc-l), divides the total number of lines by 4 (there are 4 lines per read in Illumina FASTQ files). The echo command is used to print the result to the screen, which gets stored in the variable: TotalSeqs

In [127]:
TotalSeqs = !echo $((`wc -l < 2112_lane1_NoIndex_L001_R1_001.fastq` / 4))
In [128]:
#Prints the value stored in TotalSeqs.
#Notice that this is a Python string list and is not an integer!
print TotalSeqs
['16000000']
In [129]:
#Converts the value in the TotalSeqs string list at index 0 (TotalSeqs[0]) to 
#an integer value of base 10.
#This conversion will be used repeatedly throughout this notebook to allow 
#mathematical calculations using the numbers generated by bash commands.
TotalSeqs = int(TotalSeqs[0], 10)
In [130]:
print TotalSeqs
16000000

Use bash grep and wc -l to count all the instances of each of the full-length TruSeq adaptor/index sequences.

The index sequence is indicated in each of the respective variable names.

Additionally, the Epigentek barcode number is indicated in the variable names (e.g. BC1 = barcode 1).

In [43]:
BC1_ATCACG_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [44]:
#Converts the value in the BC1_ATCACG_full string list at index 0 (BC1_ATCACG_full[0]) to 
#an integer value of base 10.
BC1_ATCACG_full = int(BC1_ATCACG_full[0] ,10)
In [45]:
print BC1_ATCACG_full
294374

In [46]:
BC3_TTAGGC_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [47]:
#Converts the value in the BC3_TTAGGC_full string list at index 0 (BC3_TTAGGC_full[0]) to 
#an integer value of base 10.
BC3_TTAGGC_full = int(BC3_TTAGGC_full[0], 10)
In [48]:
print BC3_TTAGGC_full
244638

In [49]:
BC4_TGACCA_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [50]:
#Converts the value in the BC4_TGACCA_full string list at index 0 (BC4_TGACCA_full[0]) to 
#an integer value of base 10.
BC4_TGACCA_full = int(BC4_TGACCA_full[0], 10)
In [51]:
print BC4_TGACCA_full
388498

In [80]:
BC5_ACAGTG_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [82]:
#Converts the value in the BC5_ACAGTG_full string list at index 0 (BC5_ACAGTG_full[0]) to 
#an integer value of base 10.
BC5_ACAGTG_full = int(BC5_ACAGTG_full[0], 10)
In [83]:
print BC5_ACAGTG_full
308463

In [55]:
BC6_GCCAAT_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [56]:
#Converts the value in the BC6_GCCAAT_full string list at index 0 (BC6_GCCAAT_full[0]) to 
#an integer value of base 10.
BC6_GCCAAT_full = int(BC6_GCCAAT_full[0], 10)
In [57]:
print BC6_GCCAAT_full
211205

In [58]:
BC7_CAGATC_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [59]:
#Converts the value in the BC7_CAGATC_full string list at index 0 (BC7_CAGATC_full[0]) to 
#an integer value of base 10.
BC7_CAGATC_full = int(BC7_CAGATC_full[0], 10)
In [60]:
print BC7_CAGATC_full
504685

Total Number of Full-length Barcodes Identified

In [69]:
#Adds all of the counts from each full-length Illumina adaptor/index sequence.
#Saves to variable "sum_full".
sum_full = BC1_ATCACG_full + BC3_TTAGGC_full + BC4_TGACCA_full + BC5_ACAGTG_full + BC6_GCCAAT_full + BC7_CAGATC_full
In [70]:
print sum_full
1951863

Percentage of Reads Containing Full-length Barcodes

In [131]:
#Calculates percentage of reads having full-lenght Illumina adaptor/index sequences.
#Uses "float" to convert integer values to floating point decimals. Necessary since 
#the calculation on integers would be < 1 & would result in an answer of '0'.
print ((float(sum_full)/TotalSeqs)*100)
12.19914375

Use bash grep and wc -l to count all the instances of each of the TruSeq index sequences.

The index sequence is indicated in each of the respective variable names.

Additionally, the Epigentek barcode number is indicated in the variable names (e.g. BC1 = barcode 1).

In [84]:
BC1_ATCACG = !grep -o 'ATCACG' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [85]:
#Converts the value in the BC1_ATCACG string list at index 0 (BC1_ATCACG[0]) to 
#an integer value of base 10.
BC1_ATCACG = int(BC1_ATCACG[0] ,10)
In [86]:
print BC1_ATCACG
700818

In [87]:
BC3_TTAGGC = !grep -o 'TTAGGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [88]:
#Converts the value in the BC3_TTAGGC string list at index 0 (BC3_TTAGGC[0]) to 
#an integer value of base 10.
BC3_TTAGGC = int(BC3_TTAGGC[0] ,10)
In [89]:
print BC3_TTAGGC
387329

In [15]:
BC4_TGACCA = !grep -o 'TGACCA' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [16]:
#Converts the value in the BC4_TGACCA string list at index 0 (BC4_TGACCA[0]) to 
#an integer value of base 10.
BC4_TGACCA = int(BC4_TGACCA[0])
In [17]:
print BC4_TGACCA
727528

In [93]:
BC5_ACAGTG = !grep -o 'ACAGTG' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [94]:
#Converts the value in the BC5_ACAGTG string list at index 0 (BC5_ACAGTG[0]) to 
#an integer value of base 10.
BC5_ACAGTG = int(BC5_ACAGTG[0] ,10)
In [95]:
print BC5_ACAGTG
544521

In [105]:
BC6_GCCAAT = !grep -o 'GCCAAT' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [106]:
#Converts the value in the BC6_GCCAAT string list at index 0 (BC6_GCCAAT[0]) to 
#an integer value of base 10.
BC6_GCCAAT = int(BC6_GCCAAT[0] ,10)
In [107]:
print BC6_GCCAAT
469213

In [108]:
BC7_CAGATC = !grep -o 'CAGATC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
In [109]:
#Converts the value in the BC7_CAGATC string list at index 0 (BC7_CAGATC[0]) to 
#an integer value of base 10.
BC7_CAGATC = int(BC7_CAGATC[0] ,10)
In [110]:
print BC7_CAGATC
1107028

Total Number of Short Barcodes Identified

In [111]:
#Adds all of the counts from each Illumina adaptor/index sequence.
#Saves to variable "sum_short".
sum_short = BC1_ATCACG + BC3_TTAGGC + BC4_TGACCA + BC5_ACAGTG + BC6_GCCAAT + BC7_CAGATC
In [112]:
print sum_short
3936437

Percentage of Reads Containing Short Barcodes

In [132]:
#Calculates percentage of reads having full-lenght Illumina adaptor/index sequences.
#Uses "float" to convert integer values to floating point decimals. Necessary since 
#the calculation on integers would be < 1 & would result in an answer of '0'.
print ((float(sum_short)/TotalSeqs)*100)
24.60273125

Use fastx_barcode_splitter to identify full-length TruSeq adaptor/index sequences.

The fastx_barcode_splitter is a component of fastx_toolkit-0.0.13.2

In [138]:
#The full-lengths barcode file used by fastx_barcode_splitter.
!head TruSeqBarcodesLong.txt
BC7_CAGATC	GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGC
BC4_TGACCA	GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGC
BC5_ACAGTG	GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC
BC1_ATCACG	GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGC
BC3_TTAGGC	GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGC
BC6_GCCAAT	GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC

Look for full-length barcodes at beginning of lines

In [139]:
#Gunzip the gzipped FASTQ file.
#Pipe the output of that to fastx_barcode_splitter.pl
#fastx_barcode_splitter uses a default mismatch value = 1
#Specify barcode file (--bcfile TruSeqBarcodesLong.txt)
#Specify to look for barcode at beginning of file (--bol)
#Specify output location and append a prefix to new file name (--prefix ./bol_)
#Specify new file name suffix (--suffix ".fastq")
!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \
fastx_barcode_splitter.pl \
--bcfile TruSeqBarcodesLong.txt \
--bol \
--prefix ./bol_long_ \
--suffix ".fastq" | \
tee bol_long_stats.txt
Barcode	Count	Location
BC1_ATCACG	359478	./bol_long_BC1_ATCACG.fastq
BC3_TTAGGC	299176	./bol_long_BC3_TTAGGC.fastq
BC4_TGACCA	472062	./bol_long_BC4_TGACCA.fastq
BC5_ACAGTG	378448	./bol_long_BC5_ACAGTG.fastq
BC6_GCCAAT	257815	./bol_long_BC6_GCCAAT.fastq
BC7_CAGATC	605680	./bol_long_BC7_CAGATC.fastq
unmatched	13627341	./bol_long_unmatched.fastq
total	16000000

Look for full-length barcodes at end of lines

In [140]:
#Gunzip the gzipped FASTQ file.
#Pipe the output of that to fastx_barcode_splitter.pl
#fastx_barcode_splitter uses a default mismatch value = 1
#Specify barcode file (--bcfile TruSeqBarcodesLong.txt)
#Specify to look for barcode at beginning of file (--eol)
#Specify output location and append a prefix to new file name (--prefix ./eol_)
#Specify new file name suffix (--suffix ".fastq")
!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \
fastx_barcode_splitter.pl \
--bcfile TruSeqBarcodes.txt \
--eol \
--prefix ./eol_long_ \
--suffix ".fastq" | \
tee eol_long_stats.txt
Barcode	Count	Location
BC1_ATCACG	136	./eol_long_BC1_ATCACG.fastq
BC3_TTAGGC	180	./eol_long_BC3_TTAGGC.fastq
BC4_TGACCA	388	./eol_long_BC4_TGACCA.fastq
BC5_ACAGTG	138	./eol_long_BC5_ACAGTG.fastq
BC6_GCCAAT	99	./eol_long_BC6_GCCAAT.fastq
BC7_CAGATC	336	./eol_long_BC7_CAGATC.fastq
unmatched	15998723	./eol_long_unmatched.fastq
total	16000000

Use fastx_barcode_splitter to identify TruSeq index sequences.

In [141]:
#The full-lenghts barcode file used by fastx_barcode_splitter.
!head TruSeqBarcodesShort.txt
BC7_CAGATC	CAGATC
BC4_TGACCA	TGACCA
BC5_ACAGTG	ACAGTG
BC1_ATCACG	ATCACG
BC3_TTAGGC	TTAGGC
BC6_GCCAAT	GCCAAT

Look for index sequences at beginning of lines

In [142]:
#Gunzip the gzipped FASTQ file.
#Pipe the output of that to fastx_barcode_splitter.pl
#fastx_barcode_splitter uses a default mismatch value = 1
#Specify barcode file (--bcfile TruSeqBarcodesShort.txt)
#Specify to look for barcode at beginning of file (--bol)
#Specify output location and append a prefix to new file name (--prefix ./bol_)
#Specify new file name suffix (--suffix ".fastq")
!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \
fastx_barcode_splitter.pl \
--bcfile TruSeqBarcodesShort.txt \
--bol \
--prefix ./bol_ \
--suffix ".fastq" | \
tee bol_short_stats.txt
Barcode	Count	Location
BC1_ATCACG	54841	./bol_BC1_ATCACG.fastq
BC3_TTAGGC	24864	./bol_BC3_TTAGGC.fastq
BC4_TGACCA	63480	./bol_BC4_TGACCA.fastq
BC5_ACAGTG	12874	./bol_BC5_ACAGTG.fastq
BC6_GCCAAT	25066	./bol_BC6_GCCAAT.fastq
BC7_CAGATC	24092	./bol_BC7_CAGATC.fastq
unmatched	15794783	./bol_unmatched.fastq
total	16000000

Look for index sequences at end of lines

In [143]:
#Gunzip the gzipped FASTQ file.
#Pipe the output of that to fastx_barcode_splitter.pl
#fastx_barcode_splitter uses a default mismatch value = 1
#Specify barcode file (--bcfile TruSeqBarcodesShort.txt)
#Specify to look for barcode at beginning of file (--eol)
#Specify output location and append a prefix to new file name (--prefix ./eol_)
#Specify new file name suffix (--suffix ".fastq")
!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \
fastx_barcode_splitter.pl \
--bcfile TruSeqBarcodesShort.txt \
--eol \
--prefix ./eol_ \
--suffix ".fastq" | \
tee eol_short_stats.txt
Barcode	Count	Location
BC1_ATCACG	86091	./eol_BC1_ATCACG.fastq
BC3_TTAGGC	27144	./eol_BC3_TTAGGC.fastq
BC4_TGACCA	63478	./eol_BC4_TGACCA.fastq
BC5_ACAGTG	30680	./eol_BC5_ACAGTG.fastq
BC6_GCCAAT	54759	./eol_BC6_GCCAAT.fastq
BC7_CAGATC	162844	./eol_BC7_CAGATC.fastq
unmatched	15575004	./eol_unmatched.fastq
total	16000000