Notebook

SAM¶

This notebook explores parsing and understanding the SAM (Sequence Alignment/Map) and related BAM format. SAM is an extremely common format for representing read alignments. Most widely-used read alignment tools output SAM alignments.

SAM is a text format. There is a closely related binary format called BAM. There are two types of BAM files: unsorted or sorted. Various tools, notably SAMtools and Picard, can convert back and forth between SAM and BAM, and can sort an existing BAM file. When we say a BAM file is sorted we almost always mean that the alignments are sorted left-to-right along the reference genome. (It's also possible to sort a BAM file by read name, though that's only occassionally useful.)

Once you have an interesting set of read alignments that you would like to keep for a while and perhaps analyze further, it's a good idea to keep them in sorted BAM files. This is because:

They will be well compressed. BAM files are smaller than corresponding SAM files, and sorted BAM files are smaller than corresponding unsorted BAM files.
From a sorted BAM file, it's easy to extract just the alignments that overlap a specified stretch of the genome, making it easy to convert from sorted BAM to many other useful formats.

That said, most alignment tools output SAM (not BAM), and the alignments come out in an arbitrary order -- not sorted.

An authoritative and complete document describing the SAM and BAM formats is the SAM specification. This document is thorough, but, being a specification, it does not describe is the various "dialects" of legal SAM emitted by popular tools. I'll cover some of that here.

In [1]:

# Here's a string representing a three-line SAM file.  I'm temporarily
# ignoring the fact that SAM files usually have several header lines at
# the beginning.
samStr = '''\
r1	0	gi|9626243|ref|NC_001416.1|	18401	42	122M	*	0	0	TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG	+"@6<:27(F&5)9)"B:%B+A-%5A?2$HCB0B+0=D<7E/<.03#!.F77@6B==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C>	AS:i:-5	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:59G13G21G26	YT:Z:UU
r2	0	gi|9626243|ref|NC_001416.1|	8886	42	275M	*	0	0	NTTNTGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGATCACCCTGTGGGTTTATAAGGGGATCGGTGACCCCTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGNCCTATGACGACAGCTATCTCGATGATGAAGATGCAGACTGGACTGC	(#!!'+!$""%+(+)'%)%!+!(&++)''"#"#&#"!'!("%'""("+&%$%*%%#$%#%#!)*'(#")(($&$'&%+&#%*)*#*%*')(%+!%%*"$%"#+)$&&+)&)*+!"*)!*!("&&"*#+"&"'(%)*("'!$*!!%$&&&$!!&&"(*"$&"#&!$%'%"#)$#+%*+)!&*)+(""#!)!%*#"*)*')&")($+*%%)!*)!('(%""+%"$##"#+(('!*(($*'!"*('"+)&%#&$+('**$$&+*&!#%)')'(+(!%+	AS:i:-14	XN:i:0	XM:i:8	XO:i:0	XG:i:0	NM:i:8	MD:Z:0A0C0G0A108C23G9T81T46	YT:Z:UU
r3	16	gi|9626243|ref|NC_001416.1|	11599	42	338M	*	0	0	GGGCGCGTTACTGGGATGATCGTGAAAAGGCCCGTCTTGCGCTTGAAGCCGCCCGAAAGAAGGCTGAGCAGCAGACTCAAGAGGAGAAAAATGCGCAGCAGCGGAGCGATACCGAAGCGTCACGGCTGAAATATACCGAAGAGGCGCAGAAGGCTNACGAACGGCTGCAGACGCCGCTGCAGAAATATACCGCCCGTCAGGAAGAACTGANCAAGGCACNGAAAGACGGGAAAATCCTGCAGGCGGATTACAACACGCTGATGGCGGCGGCGAAAAAGGATTATGAAGCGACGCTGTAAAAGCCGAAACAGTCCAGCGTGAAGGTGTCTGCGGGCGAT	7F$%6=$:9B@/F'>=?!D?@0(:A*)7/>9C>6#1<6:C(.CC;#.;>;2'$4D:?&B!>689?(0(G7+0=@37F)GG=>?958.D2E04C<E,*AD%G0.%$+A:'H;?8<72:88?E6((CF)6DF#.)=>B>D-="C'B080E'5BH"77':"@70#4%A5=6.2/1>;9"&-H6)=$/0;5E:<8G!@::1?2DC7C*;@*#.1C0.D>H/20,!"C-#,6@%<+<D(AG-).?&#0.00'@)/F8?B!&"170,)>:?<A7#1(A@0E#&A.*DC.E")AH"+.,5,2>5"2?:G,F"D0B8D-6$65D<D!A/38860.*4;4B<*31?6	AS:i:-22	XN:i:0	XM:i:8	XO:i:0	XG:i:0	NM:i:8	MD:Z:80C4C16A52T23G30A8T76A41	YT:Z:UU'''

In [2]:

# I'll read this string in line-by-line as though it were a file.
# I'll (lightly) parse the alignment records as I go.
import string
from io import StringIO  # reading from string rather than file
for ln in StringIO(samStr):
    qname, flag, rname, pos, mapq, cigar, rnext, \
    pnext, tlen, seq, qual, extras = str.split(ln, '\t', 11)
    print(qname, len(seq)) # print read name, length of read sequence

r1 122
r2 275
r3 338

SAM fields¶

As you can see from this last example, each SAM record is on a separate line, and it consists of several tab-delimited fields. The SAM specification is the authoritative source for information on what all the fields mean exactly, but here's a brief summary of each of the fields (in order):

qname is the name of the read
flags is a bit field encoding some yes/no pieces of information about whether and how the read aligned
rname is the name of the reference sequence that the read aligned to (if applicable). E.g., might be "chr17" meaning "chromosome 17"
pos is the 1-based offset into the reference sequence where the read aligned.
mapq for an aligned read, this is a confidence value; high when we're very confident we've found the correct alignment, low when we're not confident
cigar indicates where any gaps occur in the alignment
rnext only relevant for paired-end reads; name of the reference sequence where other end aligned
pnext only relevant for paired-end reads; 1-based offset into the reference sequence where other end aligned
tlen only relevant for paired-end reads; fragment length inferred from alignment
seq read sequence. If read aligned to reference genome's reverse-complement, this is the reverse complement of the read sequence.
qual quality sequence. If read aligned to reference genome's reverse-complement, this is the reverse of the quality sequence.
extras tab-separated "extra" fields, usually optional and aligner-specific but often very important!

Field 1, qname is the name of the read. Read names often contain information about:

The scientific study for which the read was sequenced.
The sequencing instrument, and the exact part of the sequencing instrument, where the DNA was sequenced.

Field 5, mapq, encodes the probability p that the alignment reported is incorrect. The probability is encoded as an integer Q on the Phred scale:

$$ Q = -10 \cdot \log_{10}(p) $$

Fields 7, 8 and 9 (rnext, pnext and tlen) are only relevant if the read is part of a pair. By way of background, sequencers can be configured to report pairs of DNA snippets that appear close to each other in the genome. To accomplish this, the sequencer sequences both ends of a longer fragment of DNA. When this is the case rnext and pnext tell us where the other end of the pair aligned, and tlen tells us the length of the fragment, as inferred from the alignments of the two ends.

Field 10, seq, is the nucleotide sequence of the read. The nucleotide sequence is reverse complemented if the read aligned to the reverse complement of the reference genome. (This is equivalent to the reverse complement of the read aligning to the genome.) seq can contain the character "N". N essentially means "no confidence." The sequencer knows there's a nucleotide there but doesn't know whether it's an A, C, G or T.

Field 11, qual, is the quality sequence of the read. Each nucleotide in seq has a corresponding quality value in this string. A nucleotide's quality value encodes the probability that the nucleotide was incorrectly called by the sequencing instrument and its software. For details on this encoding, see the FASTQ notebook.

Flags¶

The flags field is a bitfield. Individual bits correspond to certain yes/no properties of the alignment. Here are the most relevant ones:

Bit 0 (least significant): 1 if read is paired-end, 0 otherwise
Bit 1: for paired-end reads only: 1 if the pair aligns concordantly, 0 otherwise
Bit 2: 1 if read failed to align, 0 otherwise
Bit 3: for paried-end reads only: 1 if the other end failed to align, 0 otherwise
Bit 4: 1 if read aligned to Crick strand, 0 if Watson strand
Bit 5: for paired-end reads only: 1 if the other end aligned to Crick strand, 0 if Watson strand
Bit 6: for paired-end reads only: 1 if this is the first (#1) end, 0 if this is the second (#2) end
Bit 7: for paired-end reads only: 0 if this is the first (#1) end, 1 if this is the second (#2) end

There are a few more that are used less often; see the [SAM specification] for details.

Alignment score¶

How do we know how good an alignment is, i.e., how well the read sequence matches the corresponding referene sequence. This information is spread across a few places:

The AS:i extra field
The cigar field
The MD:Z extra field

While the AS:i extra field is not required by the specification, all the most popular tools output it. The integer appearing in this field is an alignment score. The higher the score, the more similar the read sequence is to the reference sequence.

Different tools use differen scales for AS:i. Sometimes (e.g. in Bowtie 2's --end-to-end alignment mode)

End-to-end versus local alignment¶

Some alignment tools such as Bowtie, BWA and Bowtie 2 in --end-to-end mode, will attempt to align a sequencing read end-to-end. In other words, the read will align such that every nucleotide of the read participates.

Alignment shape¶

We would like to know exactly how the read aligned to the reference, including where all the mismatches and gaps are, and which characters appear opposite the gaps. This is not possible just by looking at the read sequence together with the CIGAR string. That doesn't tell us what reference characters appear in the mismatched position, or in the positions involved in deletions. Instead, we combine information from the (a) read sequence, (b) CIGAR string, and (c) MD:Z string. The CIGAR and MD:Z strings are both described in the SAM specification.

First we construct a function to parse the CIGAR field:

In [3]:

def cigarToList(cigar):
    ''' Parse CIGAR string into a list of CIGAR operations.  For more
        info on CIGAR operations, see SAM spec:
        http://samtools.sourceforge.net/SAMv1.pdf '''
    ret, i = [], 0
    op_map = {'M':0, # match or mismatch
              '=':0, # match
              'X':0, # mismatch
              'I':1, # insertion in read w/r/t reference
              'D':2, # deletion in read w/r/t reference
              'N':3, # long gap due e.g. to splice junction
              'S':4, # soft clipping due e.g. to local alignment
              'H':5, # hard clipping
              'P':6} # padding
    # Seems like = and X together are strictly more expressive than M.
    # Why not just have = and X and get rid of M?  Space efficiency,
    # mainly.  The titans discuss: http://www.biostars.org/p/17043/
    while i < len(cigar):
        run = 0
        while i < len(cigar) and cigar[i].isdigit():
            # parse one more digit of run length
            run *= 10
            run += int(cigar[i])
            i += 1
        assert i < len(cigar)
        # parse cigar operation
        op = cigar[i]
        i += 1
        assert op in op_map
        # append to result
        ret.append([op_map[op], run])
    return ret

In [4]:

cigarToList('10=1X10=')

Out[4]:

[[0, 10], [0, 1], [0, 10]]

Next we construct a function to parse the MD:Z extra field:

In [5]:

def mdzToList(md):
    ''' Parse MD:Z string into a list of operations, where 0=match,
        1=read gap, 2=mismatch. '''
    i = 0;
    ret = [] # list of (op, run, str) tuples
    while i < len(md):
        if md[i].isdigit(): # stretch of matches
            run = 0
            while i < len(md) and md[i].isdigit():
                run *= 10
                run += int(md[i])
                i += 1 # skip over digit
            if run > 0:
                ret.append([0, run, ""])
        elif md[i].isalpha(): # stretch of mismatches
            mmstr = ""
            while i < len(md) and md[i].isalpha():
                mmstr += md[i]
                i += 1
            assert len(mmstr) > 0
            ret.append([1, len(mmstr), mmstr])
        elif md[i] == "^": # read gap
            i += 1 # skip over ^
            refstr = ""
            while i < len(md) and md[i].isalpha():
                refstr += md[i]
                i += 1 # skip over inserted character
            assert len(refstr) > 0
            ret.append([2, len(refstr), refstr])
        else:
            raise RuntimeError('Unexpected character in MD:Z: "%d"' % md[i])
    return ret

In [6]:

# Each element in the list returned by this call is itself a list w/ 3
# elements.  Element 1 is the MD:Z operation (0=match, 1=mismatch,
# 2=deletion).  Element 2 is the length and element 3 is the relevant
# sequence of nucleotides from the reference.
mdzToList('10A5^AC6')

Out[6]:

[[0, 10, ''], [1, 1, 'A'], [0, 5, ''], [2, 2, 'AC'], [0, 6, '']]

Now we can write a fucntion that takes a read sequennce, a parsed CIGAR string, and a parse MD:Z string and combines information from all three to make what I call a "stacked alignment."

In [7]:

def cigarMdzToStacked(seq, cgp, mdp_orig):
    ''' Takes parsed CIGAR and parsed MD:Z, generates a stacked alignment:
        a pair of strings with gap characters inserted (possibly) and where
        characters at at the same offsets are opposite each other in the
        alignment.  Only knows how to handle CIGAR ops M=XDINSH right now.
    '''
    mdp = mdp_orig[:]
    rds, rfs = [], []
    mdo, rdoff = 0, 0
    for c in cgp:
        op, run = c
        skipping = (op == 4 or op == 5)
        assert skipping or mdo < len(mdp)
        if op == 0: # CIGAR op M, = or X
            # Look for block matches and mismatches in MD:Z string
            mdrun = 0
            runleft = run
            while runleft > 0 and mdo < len(mdp):
                op_m, run_m, st_m = mdp[mdo]
                run_comb = min(runleft, run_m)
                runleft -= run_comb
                assert op_m == 0 or op_m == 1
                rds.append(seq[rdoff:rdoff + run_comb])
                if op_m == 0: # match from MD:Z string
                    rfs.append(seq[rdoff:rdoff + run_comb])
                else: # mismatch from MD:Z string
                    assert len(st_m) == run_comb
                    rfs.append(st_m)
                mdrun += run_comb
                rdoff += run_comb
                # Stretch of matches in MD:Z could span M and I CIGAR ops
                if run_comb < run_m:
                    assert op_m == 0
                    mdp[mdo][1] -= run_comb
                else:
                    mdo += 1
        elif op == 1: # CIGAR op I
            rds.append(seq[rdoff:rdoff + run])
            rfs.append("-" * run)
            rdoff += run
        elif op == 2: # D
            op_m, run_m, st_m = mdp[mdo]
            assert op_m == 2
            assert run == run_m
            assert len(st_m) == run
            mdo += 1
            rds.append("-" * run)
            rfs.append(st_m)
        elif op == 3: # N
            rds.append("-" * run)
            rfs.append("-" * run)
        elif op == 4: # S
            rds.append(seq[rdoff:rdoff + run].lower())
            rfs.append(' ' * run)
            rdoff += run
        elif op == 5: # H
            rds.append('!' * run)
            rfs.append(' ' * run)
        elif op == 6: # P
            raise RuntimeError("Don't know how to handle P in CIGAR")
        else:
            raise RuntimeError('Unexpected CIGAR op: %d' % op)
    assert mdo == len(mdp)
    return ''.join(rds), ''.join(rfs)

In [8]:

# Following example includes gaps and mismatches
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGATAAACC', cigarToList('12M2D17M2I14M'), mdzToList('12^AT30G0'))

Out[8]:

('GGACGCTCAGTA--GTGACGATAGCTGAAAACCCTGTACGATAAACC',
 'GGACGCTCAGTAATGTGACGATAGCTGAAAA--CTGTACGATAAACG')

In [9]:

# Following example also includes soft clipping (CIGAR: S)
# SAM spec: Soft clipping: "clipped sequences present in SEQ"
# We print them in lowercase to emphasize their clippedness
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC', cigarToList('12M2D17M2I8M6S'), mdzToList('12^AT25'))

Out[9]:

('GGACGCTCAGTA--GTGACGATAGCTGAAAACCCTGTACGAgaagcc',
 'GGACGCTCAGTAATGTGACGATAGCTGAAAA--CTGTACGA      ')

In [10]:

# Following example also includes hard clipping (CIGAR: H)
# SAM spec: Hard clipping: "clipped sequences NOT present in SEQ"
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC', cigarToList('12M2D17M2I8M6S3H'), mdzToList('12^AT25'))
# Note: don't see hard clipping in practice much

Out[10]:

('GGACGCTCAGTA--GTGACGATAGCTGAAAACCCTGTACGAgaagcc!!!',
 'GGACGCTCAGTAATGTGACGATAGCTGAAAA--CTGTACGA         ')

In [11]:

# Following example also includes skipping (CIGAR: N), as seen in
# TopHat alignments
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC',
                  cigarToList('12M2D10M10N7M2I8M6S3H'),
                  mdzToList('12^AT25'))

Out[11]:

('GGACGCTCAGTA--GTGACGATAG----------CTGAAAACCCTGTACGAgaagcc!!!',
 'GGACGCTCAGTAATGTGACGATAG----------CTGAAAA--CTGTACGA         ')

From the stacked alignment, it's easy to do other things. E.g. we can turn a stacked alignment into a new CIGAR string that uses the = and X operations instead of the less specific M operation:

In [12]:

def cigarize(rds, rfs):
    off = 0
    oplist = []
    lastc, cnt = '', 0
    for i in range(len(rds)):
        c = None
        if rfs[i] == ' ':
            c = 'S'
        elif rds[i] == '-' and rfs[i] == '-':
            c = 'N'
        elif rds[i] == '-':
            c = 'D'
        elif rfs[i] == '-':
            c = 'I'
        elif rds[i] != rfs[i]:
            c = 'X'
        else:
            c = '='
        if c == lastc:
            cnt += 1
        else:
            if len(lastc) > 0:
                oplist.append((lastc, cnt))
            lastc, cnt = c, 1
    if len(lastc) > 0:
        oplist.append((lastc, cnt))
    return ''.join(map(lambda x: str(x[1]) + x[0], oplist))

In [13]:

x, y = cigarMdzToStacked('ACGTACGT', cigarToList('8M'), mdzToList('4G3'))

In [14]:

cigarize(x, y)

Out[14]:

'4=1X3='

In [15]:

x, y = cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC',
                         cigarToList('12M2D10M10N7M2I8M6S3H'),
                         mdzToList('12^AT25'))

In [16]:

cigarize(x, y)

Out[16]:

'12=2D10=10N7=2I8=9S'

Other resources¶

SAM specification
SAMtools (and associated publication)
pysam (Python library)
Picard (Java library)
Bamtools (C++ library)
Rsamtools (R library)