#!/usr/bin/env python
# coding: utf-8

# Repetitive DNA elements ("repeats") are DNA sequences prevalent in genomes, especially of higher eukaryotes. Repeats make up about [50% of the human genome](http://www.nature.com/nature/journal/v409/n6822/abs/409860a0.html) and over [80% of the maize genome](http://www.sciencemag.org/content/326/5956/1112). Repeats can be categorized as *interspersed*, where similar DNA sequences are spread throughout the genome, or *tandem*, where similar sequences are adjacent (see [Treangen and Salzberg](http://www.nature.com/nrg/journal/v13/n2/full/nrg3164.html)). Some interspersed repeats are long segmental duplications, but most are relatively short transposons and retrotransposons. Though repeats are sometimes referred to as “junk,” they are involved in processes of current scientific interest, including genome expansion, speciation, and epigenetic regulation (see [Fedoroff](http://www.sciencemag.org/content/338/6108/758)). Some are still actively expressed and duplicated, including in the human genome (see [Witherspoon et al](http://genome.cshlp.org/content/23/7/1170.long), [Tyekucheva et al](http://www.biomedcentral.com/1471-2164/12/495)).

# #### RepeatMasker
# 
# [RepeatMasker](http://www.repeatmasker.org/) is both a tool for identifying repeats in a genome sequence, and a [database of repeats](http://www.repeatmasker.org/genomicDatasets/RMGenomicDatasets.html) that have been found.  The database covers some well known model species, like human, chimpanzee, gorilla, rhesus, rat, mouse, horse, cow, cat, dog, chicken, zebrafish, bee, fruitfly and roundworm.  People often use RepeatMasker to remove ("mask out") repetitive sequences from the genome so that they can be ignored (or otherwise treated specially) in later analyses, though that's not our goal here.
# 
# It's intructive to click on some of the species listed in [the database](http://www.repeatmasker.org/genomicDatasets/RMGenomicDatasets.html) and examine the associated bar and pie charts describing their repeat content.  For example, note the differences between the bar charts for [human](http://www.repeatmasker.org/species/hg.html) and [mouse](http://www.repeatmasker.org/species/mm.html), especially for SINE/Alu and LINE/L1.

# #### Working with RepeatMasker databases
# 
# Let's obtain and parse a RepeatMasker database.  We'll start with [roundworm](http://www.repeatmasker.org/species/ce.html) because it's relatively small (only about 2.5 megabytes compressed).

# In[1]:


import urllib.request
rm_site = 'http://www.repeatmasker.org'
fn = 'ce10.fa.out.gz'
url = '%s/genomes/ce10/RepeatMasker-rm405-db20140131/%s' % (rm_site, fn)
urllib.request.urlretrieve(url, fn)


# In[2]:


import gzip
import itertools
fh = gzip.open(fn, 'rt')
for ln in itertools.islice(fh, 10):
    print(ln, end='')


# Above are the first several lines of the `.out.gz` file for the roundworm (*C. elegans*).  The columns have headers, which are somewhat helpful.  More detail is available in the [RepeatMasker documentation](http://www.repeatmasker.org/webrepeatmaskerhelp.html) under "How to read the results".  (Note that in addition to the 14 fields descrived in the documentation, there's also a 15th `ID` field.)

# Here's an extremely simple class that parses a line from these files and stores the individual values in its fields:

# In[3]:


class Repeat(object):
    def __init__(self, ln):
        # parse fields
        (self.swsc, self.pctdiv, self.pctdel, self.pctins, self.refid,
         self.ref_i, self.ref_f, self.ref_remain, self.orient, self.rep_nm,
         self.rep_cl, self.rep_prior, self.rep_i, self.rep_f, self.unk) = ln.split()
        # int-ize the reference coordinates
        self.ref_i, self.ref_f = int(self.ref_i), int(self.ref_f)


# We can parse a file into a list of Repeat objects:

# In[4]:


def parse_repeat_masker_db(fn):
    reps = []
    with gzip.open(fn) if fn.endswith('.gz') else open(fn) as fh:
        fh.readline()  # skip header
        fh.readline()  # skip header
        fh.readline()  # skip header
        while True:
            ln = fh.readline()
            if len(ln) == 0:
                break
            reps.append(Repeat(ln.decode('UTF8')))
    return reps


# In[5]:


reps = parse_repeat_masker_db('ce10.fa.out.gz')


# #### Extracting repeats from the genome in FASTA format
# 
# Now let's obtain the genome for the roundworm in FASTA format.  For more information on FASTA, see [the FASTA notebook](http://nbviewer.ipython.org/gist/BenLangmead/8307011).  As seen above, the name of the genome assembly used by RepeatMasker is `ce10`.  We can get it from the UCSC server.  It's around 30 MB.

# In[6]:


ucsc_site = 'http://hgdownload.cse.ucsc.edu/goldenPath'
fn = 'chromFa.tar.gz'
urllib.request.urlretrieve("%s/ce10/bigZips/%s" % (ucsc_site, fn), fn)


# In[7]:


get_ipython().system('tar zxvf chromFa.tar.gz')


# Let's load chromosome I into a string so that we can see the sequences of the repeats.

# In[8]:


from collections import defaultdict

def parse_fasta(fns):
    ret = defaultdict(list)
    for fn in fns:
        with open(fn, 'rt') as fh:
            for ln in fh:
                if ln[0] == '>':
                    name = ln[1:].rstrip()
                else:
                    ret[name].append(ln.rstrip())
    for k, v in ret.items():
        ret[k] = ''.join(v)
    return ret


# In[9]:


genome = parse_fasta(['chrI.fa', 'chrII.fa', 'chrIII.fa', 'chrIV.fa', 'chrM.fa', 'chrV.fa', 'chrX.fa'])


# In[10]:


genome['chrI'][:1000]  # printing just the first 1K nucleotides


# Note the combination of lowercase and uppercase.  Actually, that relates to our discussion here.  The lowercase stretches are repeats!  The UCSC genome sequences use the lowercase/uppercase distinction to make it clear where the repeats are -- and they know this because they ran RepeatMasker on the genome beforehand.  In this case, the two repeats you can see are both simple hexamer repeats.  Also, note that their position in the genome corresponds to the first two rows of the RepeatMasker database that we printed above.

# We write a function that, given a Repeat and given a dictionary containing the sequences of all the chromosomes in the genome, outputs each repeat string.

# In[11]:


def extract_repeat(rep, genome):
    assert rep.refid in genome
    return genome[rep.refid][rep.ref_i-1:rep.ref_f]


# In[12]:


extract_repeat(reps[0], genome)


# In[13]:


extract_repeat(reps[1], genome)


# In[14]:


extract_repeat(reps[2], genome)


# Let's specifically try to extract a repeat from the DNA/CMC-Chapaev family.

# In[15]:


chapaevs = filter(lambda x: 'DNA/CMC-Chapaev' == x.rep_cl, reps)


# In[16]:


[extract_repeat(chapaev, genome) for chapaev in chapaevs]


# #### How are repeats related?
# 
# Look at the repeat family/class names for the first several repeats in the roundworm database:

# In[17]:


from operator import attrgetter

'  '.join(map(attrgetter('rep_cl'), reps[:60]))


# You'll notice a few things.  (1) The family names seem to have some hierarchical relationships; e.g. `DNA/TcMar-Tc1` seems to be more specific than `DNA`, (2) some of them end in a question mark, (3) some of them are `Unknown`.  I don't really know what these mean or what to do as a result -- you'll have to navigate that issue.  Seems like you can often look up the family names on RepeatMasker's site and find more detailed info (e.g., here are the details for [DNA/TcMar-Tc1](http://www.repeatmasker.org/cgi-bin/ViewRepeat?id=Tc1-7_Xt)).

# #### Alternatives to RepeatMasker
# 
# [Dfam](http://www.dfam.org/) is an alternative.  Note that Dfam ultimately relies on Repbase for its "seed alignments."  Also, the only the human genome has a pre-built Dfam database, as far as I can tell.