#!/usr/bin/env python
# coding: utf-8

# FASTQ
# =====
# 
# This notebook explores [FASTQ], the most common format for storing sequencing reads.
# 
# FASTA and FASTQ are rather similar, but FASTQ is almost always used for storing *sequencing reads* (with associated quality values), whereas FASTA is used for storing all kinds of DNA, RNA or protein sequencines (without associated quality values).
# 
# Before delving into the format, I should mention that there are great tools and libraries for parsing and manipulating FASTQ, e.g. [FASTX], and [BioPython]'s [SeqIO] module.  If your needs are relatively simple, you might try using these tools and libraries and skip reading this document.
# 
# [FASTA]: http://en.wikipedia.org/wiki/FASTA_format
# [FASTQ]: http://en.wikipedia.org/wiki/FASTQ_format
# [BioPython]: http://biopython.org/wiki/Main_Page
# [SeqIO]: http://biopython.org/wiki/SeqIO
# [FASTX]: http://hannonlab.cshl.edu/fastx_toolkit/

# ### Basic format
# Here's a single sequencing read in FASTQ format:
# 
#     @ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1
#     AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATGAGAGCTCAGAAGTAACAGTTGCTTTCAGTCCCATAAAAACAGTCCTACAA
#     +
#     BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGFEGFGFFDFFHBGDGFHGEFGHFGHGFGFFFEHGGFGGDGHGFEEHFFHGE
# 
# It's spread across four lines.  The four lines are:
# 
# 1. "`@`" followed by a read name
# 2. Nucleotide sequence
# 3. "`+`", possibly followed by some info, but ignored by virtually all tools
# 4. Quality sequence (explained below)
# 
# Here is a very simple Python function for parsing file of FASTQ records:

# In[1]:


def parse_fastq(fh):
    """ Parse reads from a FASTQ filehandle.  For each read, we
        return a name, nucleotide-string, quality-string triple. """
    reads = []
    while True:
        first_line = fh.readline()
        if len(first_line) == 0:
            break  # end of file
        name = first_line[1:].rstrip()
        seq = fh.readline().rstrip()
        fh.readline()  # ignore line starting with +
        qual = fh.readline().rstrip()
        reads.append((name, seq, qual))
    return reads

fastq_string = '''@ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1
AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATG
+
BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGF
@ERR294379.136275489 HS24_09441:8:2311:1917:99340#42/1
CTTAAGTATTTTGAAAGTTAACATAAGTTATTCTCAGAGAGACTGCTTTT
+
@@AHFF?EEDEAF?FEEGEFD?GGFEFGECGE?9H?EEABFAG9@CDGGF
@ERR294379.97291341 HS24_09441:8:2201:10397:52549#42/1
GGCTGCCATCAGTGAGCAAGTAAGAATTTGCAGAAATTTATTAGCACACT
+
CDAF<FFDEHEFDDFEEFDGDFCHD=GHG<GEDHDGJFHEFFGEFEE@GH'''

from io import StringIO

parse_fastq(StringIO(fastq_string))


# The nucleotide string can sometimes contain the character "`N`".  `N` essentially means "no confidence." The sequencer knows there's a nucleotide there but doesn't know whether it's an A, C, G or T.

# ### Read name
# 
# Read names often contain information about:
# 
# 1. The scientific study for which the read was sequenced.  E.g. the string `ERR294379` (an [SRA accession number](http://www.ebi.ac.uk/ena/about/sra_format)) in the read names correspond to [this study](http://www.ncbi.nlm.nih.gov/sra/?term=ERR294379).
# 2. The sequencing instrument, and the exact *part* of the sequencing instrument, where the DNA was sequenced.  See the [FASTQ format](http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers) wikipedia article for specifics on how the Illumina software encodes this information.
# 3. Whether the read is part of a *paired-end read* and, if so, which end it is.  Paired-end reads will be discussed further below.  The `/1` you see at the end of the read names above indicate the read is the first end from a paired-end read.

# ### Quality values
# 
# Quality values are probabilities.  Each nucleotide in each sequencing read has an associated quality value.  A nucleotide's quality value encodes the probability that the nucleotide was *incorrectly called* by the sequencing instrument and its software.  If the nucleotide is `A`, the corresponding quality value encodes the probability that the nucleotide at that position is actually *not* an `A`.
# 
# Quality values encoded in two senses: first, the relevant probabilities are rescaled using the Phread scale, which is a negative log scale.  In other words if *p* us the probability that the nucleotide was incorrectly called, we encode this as *Q* where *Q* = -10 \* log10(*p*).
# 
# For example, if *Q* = 30, then *p* = 0.001, a 1-in-1000 chance that the nucleotide is wrong.  If *Q* = 20, then *p* = 0.01, a 1-in-100 chance.  If *Q* = 10, then *p* = 0.1, a 1-in-10 chance.  And so on.
# 
# Second, scaled quality values are *rounded* to the nearest integer and encoded using [ASCII printable characters](http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters).  For example, using the Phred33 encoding (which is by far the most common), a *Q* of 30 is encoded as the ASCII character with code 33 + 30 = 63, which is "`?`".  A *Q* of 20 is encoded as the ASCII character with code 33 + 20 = 53, which is "`5`".  And so on.
# 
# Let's define some relevant Python functions:

# In[2]:


def phred33_to_q(qual):
  """ Turn Phred+33 ASCII-encoded quality into Phred-scaled integer """
  return ord(qual)-33

def q_to_phred33(Q):
  """ Turn Phred-scaled integer into Phred+33 ASCII-encoded quality """
  return chr(Q + 33)

def q_to_p(Q):
  """ Turn Phred-scaled integer into error probability """
  return 10.0 ** (-0.1 * Q)

def p_to_q(p):
  """ Turn error probability into Phred-scaled integer """
  import math
  return int(round(-10.0 * math.log10(p)))


# In[3]:


# Here are the examples I discussed above

# Convert Qs into ps
q_to_p(30), q_to_p(20), q_to_p(10)


# In[4]:


p_to_q(0.00011) # note that result is rounded


# In[5]:


q_to_phred33(30), q_to_phred33(20)


# To convert an entire string Phred33-encoded quality values into the corresponding *Q* or *p* values, I can do the following:

# In[6]:


# Take the first read from the small example above
name, seq, qual = parse_fastq(StringIO(fastq_string))[0]
q_string = list(map(phred33_to_q, qual))
p_string = list(map(q_to_p, q_string))
print(q_string)
print(p_string)


# You might wonder how the sequencer and its software can *know* the probability that a nucleotide is incorrected called.  It can't; this number is just an estimate.  To describe exactly how it's estimated is beyond the scope of this notebook; if you're interested, search for academic papers with "base calling" in the title.  Here's a helpful [video by Rafa Irizarry](http://www.youtube.com/watch?v=eXkjlopwIH4).
# 
# A final note: other ways of encoding quality values were proposed and used in the past.  For example, Phred64 uses an ASCII offset of 64 instead of 33, and Solexa64 uses "odds" instead of the probability *p*.  But Phred33 is by far the most common today and you will likely never have to worry about this.

# ### Paired-end reads
# 
# Sequencing reads can come in *pairs*.  Basically instead of reporting a single snippet of nucleotides from the genome, the sequencer might report a *pair* of snippets that appear *close to each other* in the genome.  To accomplish this, the sequencer sequences *both ends* of a longer *fragment* of DNA.
# 
# Here is simple Python code that mimicks how the sequencer obtains one paired-end read:

# In[7]:


# Let's just make a random genome of length 1K
import random
random.seed(637485)
genome = ''.join([random.choice('ACGT') for _ in range(1000)])
genome


# In[8]:


# The sequencer draws a fragment from the genome of length, say, 250
offset = random.randint(0, len(genome) - 250)
fragment = genome[offset:offset+250]
fragment


# In[9]:


# Then it reads sequences from either end of the fragment
end1, end2 = fragment[:75], fragment[-75:]
end1, end2


# In[10]:


# And because of how the whole biochemical process works, the
# second end is always from the opposite strand from the first.

import string

# function for reverse-complementing
_revcomp_trans = str.maketrans("ACGTacgt", "TGCAtgca")
def reverse_complement(s):
    return s[::-1].translate(_revcomp_trans)

end2 = reverse_complement(end2)
end1, end2


# FASTQ can be used to store paired-end reads.  Say we have 1000 paired-end reads.  We should store them in a *pair* of FASTQ files.  The first FASTQ file (say, `reads_1.fq`) would contain all of the first ends and the second FASTQ file (say, `reads_2.fq`) would contain all of the second ends.  In both files, the ends would appear in corresponding order.  That is, the first entry in `reads_1.fq` is paired with the first entry in `reads_2.fq` and so on.
# 
# Here is a Python function that parses a pair of files containing paired-end reads.

# In[11]:


def parse_paired_fastq(fh1, fh2):
    """ Parse paired-end reads from a pair of FASTQ filehandles
        For each pair, we return a name, the nucleotide string
        for the first end, the quality string for the first end,
        the nucleotide string for the second end, and the
        quality string for the second end. """
    reads = []
    while True:
        first_line_1, first_line_2 = fh1.readline(), fh2.readline()
        if len(first_line_1) == 0:
            break  # end of file
        name_1, name_2 = first_line_1[1:].rstrip(), first_line_2[1:].rstrip()
        seq_1, seq_2 = fh1.readline().rstrip(), fh2.readline().rstrip()
        fh1.readline()  # ignore line starting with +
        fh2.readline()  # ignore line starting with +
        qual_1, qual_2 = fh1.readline().rstrip(), fh2.readline().rstrip()
        reads.append(((name_1, seq_1, qual_1), (name_2, seq_2, qual_2)))
    return reads

fastq_string1 = '''@509.6.64.20524.149722/1
AGCTCTGGTGACCCATGGGCAGCTGCTAGGGAGCCTTCTCTCCACCCTGA
+
HHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIHHIHFHHF
@509.4.62.19231.2763/1
GTTGATAAGCAAGCATCTCATTTTGTGCATATACCTGGTCTTTCGTATTC
+
HHHHHHHHHHHHHHEHHHHHHHHHHHHHHHHHHHHHHHDHHHHHHGHGHH'''

fastq_string2 = '''@509.6.64.20524.149722/2
TAAGTCAGGATACTTTCCCATATCCCAGCCCTGCTCCNTCTTTAAATAAT
+
HHHHHHHHHHHHHHHHHHHH@HHFHHHEFHHHHHHFF#FFFFFFFHHHHH
@509.4.62.19231.2763/2
CTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAANAAGAGCTTACTA
+
HHHHHHHHHHHHHHHHHHEHEHHHFHGHHHHHHHH>@#@=44465HHHHH'''

parse_paired_fastq(StringIO(fastq_string1), StringIO(fastq_string2))


# ### Other comments
# 
# In all the examples above, the reads in the FASTQ file are all the same length.  This is not necessarily the case though it is usually true for datasets generated by sequencing-by-synthesis instruments.  FASTQ files can contain reads of various lengths.
# 
# FASTQ files often have extension `.fastq` or `.fq`.

# ### Other resources
# 
# * [Wikipedia page for FASTQ format](http://en.wikipedia.org/wiki/Fastq_format)
# * [BioPython], which has [its own ways of parsing FASTA](http://biopython.org/wiki/SeqIO)
# * [FASTX] toolkit
# * [seqtk]
# * [FastQC]
# 
# [BioPython]: http://biopython.org/wiki/Main_Page
# [SeqIO]: http://biopython.org/wiki/SeqIO
# [SAMtools]: http://samtools.sourceforge.net/
# [FASTX]: http://hannonlab.cshl.edu/fastx_toolkit/
# [FASTQC]: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
# [seqtk]: https://github.com/lh3/seqtk
# 
# © Copyright [Ben Langmead](http://www.cs.jhu.edu/~langmea) 2014--2019