Notebook

Tutorial:¶

paired-end ddRAD analysis w/ merged reads¶

(or other two-cutter based datatypes)¶

pyRAD v.3.0.4¶

Topics:¶

Setup of params file for paired-end ddRAD (pairddrad)
Assemble simulated data set
Visualize ideal results on simulated data

What is ddRAD?¶

The double digest library preparation method was developed and described by Peterson et al. 2012. Here I will be talking about __ paired-end ddRAD __ data and describe my recommendations for how to setup a params file to analyze them in pyRAD.

A ddRAD library is prepared by cutting genomic DNA with two different restriction enzymes and selecting the intervening fragments that are within a certain size window. These will contain overhangs from the respective cut sites on either side. One side will have a barcode+illumina adapter ligated, and the other end will have only the reverse Illumina adapter ligated. The first reads may come in one or multiple files with "_R1_" in the name, and the second reads are in a separate file/s with "_R2_". Second read files will contain the corresponding pair from the first read files in the exact same order.

alt

Your data will likely come to you non-demultiplexed (meaning not sorted by which individual the reads belong to). You can demultiplex the reads according to their barcodes using pyRAD or separate software.

A number of programs are available to check for overlap of paired end reads, and you can run your data through these programs before being input to pyRAD. At the time of writing this, I recommend the software PEAR (https://github.com/xflouris/PEAR), which I demonstrate on pairddrad data in a separate tutorial here.

The Example data¶

For this tutorial I simulated paired-end ddRAD loci on a 12 taxon species tree, shown below. You can download the data using the script below and assemble these data by following along with all of the instructions.

In [1]:

%%bash
## download the data
wget -q http://www.dereneaton.com/downloads/simpairddrads.zip simpairddrads.zip
## unzip the data
unzip simpairddrads.zip

Archive:  simpairddrads.zip
  inflating: simpairddrads_R1_.fastq.gz  
  inflating: simpairddrads_R2_.fastq.gz  
  inflating: simpairddrads.barcodes

Assembling the data set with pyRAD¶

We first create an empty params.txt input file for the pyRAD analysis.
The following command will create a template which we will fill in with all relevant parameter settings for the analysis.

In [2]:

%%bash
pyrad -n

	new params.txt file created

Take a look at the default options. Each line designates a parameter, and contains a "##" symbol after which comments can be added, and which includes a description of the parameter. The affected step for each parameter is shown in parentheses. The first 14 parameters are required. Numbers 15-37 are optional.

In [3]:

%%bash
cat params.txt

==** parameter inputs for pyRAD version 3.0.4  **======================== affected step ==
./                          ## 1. Working directory                                 (all)
./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 16)  (s1)
./*.barcodes                ## 3. Loc. of barcode file (if not line 16)             (s1)
vsearch                     ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                      ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                       ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
2                           ## 7. N processors (parallel)                           (all)
6                           ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                           ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.88                         ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
rad                         ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad,merged (all)
4                           ## 12. MinCov: min samples in a final locus             (s7)
3                           ## 13. MaxSH: max inds with shared hetero site          (s7)
c88d6m4p3                   ## 14. Prefix name for final output (no spaces)         (s7)
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: loc. of de-multiplexed data                      (s2)
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
                       ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
                       ## 30.opt.: output formats: p,n,a,s,v,u,t,m,k,g,* (see docs) (s7)
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Edit the params file¶

I will use the script below to substitute new values, but you should simply use any text editor to make changes. For this analysis I made the following changes from the defaults:

 6. set the two restriction enzymes used to generate the ddRAD data
 10. lowered clustering threshold to .85
 11. set datatype to pairddrad
 14. changed the output name prefix
 19. mismatches for demulitiplexing set to 0, exact match.
 24. Raised maxH. Lower is better for filtering paralogs.
 30. added additional output formats (e.g., nexus,SNPs,STRUCTURE)

In [4]:

%%bash
sed -i '/## 6. /c\TGCAG,AATT                 ## 6. cutsites... ' ./params.txt
sed -i '/## 10. /c\.85                       ## 10. lowered clust thresh' ./params.txt
sed -i '/## 11. /c\pairddrad                 ## 11. datatype... ' ./params.txt
sed -i '/## 14. /c\c85d6m4p3                 ## 14. prefix name ' ./params.txt
sed -i '/## 19./c\0                     ## 19. errbarcode... ' ./params.txt
sed -i '/## 24./c\10                    ## 24. maxH... ' ./params.txt
sed -i '/## 30./c\*                     ## 30. outformats... ' ./params.txt

Let's take a look at the edited params.txt file

In [5]:

%%bash
cat params.txt

==** parameter inputs for pyRAD version 3.0.4  **======================== affected step ==
./                          ## 1. Working directory                                 (all)
./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 16)  (s1)
./*.barcodes                ## 3. Loc. of barcode file (if not line 16)             (s1)
vsearch                     ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                      ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG,AATT                 ## 6. cutsites... 
2                           ## 7. N processors (parallel)                           (all)
6                           ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                           ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.85                       ## 10. lowered clust thresh
pairddrad                 ## 11. datatype... 
4                           ## 12. MinCov: min samples in a final locus             (s7)
3                           ## 13. MaxSH: max inds with shared hetero site          (s7)
c85d6m4p3                 ## 14. prefix name 
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: loc. of de-multiplexed data                      (s2)
0                     ## 19. errbarcode... 
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
10                    ## 24. maxH... 
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
*                     ## 30. outformats... 
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Step 1: De-multiplexing the data¶

Four examples of acceptable input file name formats for paired-end data:

   1. xxxx_R1_001.fastq      xxxx_R2_001.fastq
   2. xxxx_R1_001.fastq.gz   xxxx_R2_001.fastq.gz
   3. xxxx_R1_001.fq.gz      xxxx_R2_001.fq.gz
   4. xxxx_R1_001.fq         xxxx_R2_001.fq

The file ending can be .fastq, .fq, or .gz.
There should be a unique name or number shared by each pair and the characters "_R1_" and "_R2_"
For every file name with "_R1_" there should be a corresponding "_R2_" file.

If your data are already demultiplexed skip step 1 and see step 2 below.

In [6]:

%%bash
pyrad -p params.txt -s 1


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 1: sorting reads by barcode
	 .

Now we can look at the stats output for this below:

In [7]:

%%bash
cat stats/s1.sorting.txt

file    	Nreads	cut_found	bar_matched
simpairddrads_.fastq.gz	240000	240000	240000


sample	true_bar	obs_bars	N_obs
2G0    	AATATT    	AATATT	20000   
1D0    	ATATGG    	ATATGG	20000   
1B0    	ATGAAG    	ATGAAG	20000   
1A0    	CATCAT    	CATCAT	20000   
3L0    	GAGTTG    	GAGTTG	20000   
1C0    	GTATAA    	GTATAA	20000   
3K0    	GTGGAA    	GTGGAA	20000   
2H0    	GTTTAT    	GTTTAT	20000   
3I0    	TATATA    	TATATA	20000   
2E0    	TGAAAG    	TGAAAG	20000   
3J0    	TGTAGT    	TGTAGT	20000   
2F0    	TTGGTT    	TTGGTT	20000   

nomatch  	_            	0

The de-multiplexed reads are written to a new file for each individual in a new directory created within your working directory called fastq/

In [8]:

%%bash 
ls fastq/

1A0_R1.fq.gz
1A0_R2.fq.gz
1B0_R1.fq.gz
1B0_R2.fq.gz
1C0_R1.fq.gz
1C0_R2.fq.gz
1D0_R1.fq.gz
1D0_R2.fq.gz
2E0_R1.fq.gz
2E0_R2.fq.gz
2F0_R1.fq.gz
2F0_R2.fq.gz
2G0_R1.fq.gz
2G0_R2.fq.gz
2H0_R1.fq.gz
2H0_R2.fq.gz
3I0_R1.fq.gz
3I0_R2.fq.gz
3J0_R1.fq.gz
3J0_R2.fq.gz
3K0_R1.fq.gz
3K0_R2.fq.gz
3L0_R1.fq.gz
3L0_R2.fq.gz

An individual file will look like below:

In [9]:

%%bash
## FIRST READS file
## I show the first 12 lines and 80 characters to print clearly here
less fastq/1A0_R1.fq.gz | head -n 12 | cut -c 1-80

@lane1_fakedata0_R1_0 1:N:0:
TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTAC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_1 1:N:0:
TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTAC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_2 1:N:0:
TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTAC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

In [10]:

%%bash
## SECOND READS file
less fastq/1A0_R2.fq.gz | head -n 12 | cut -c 1-80

@lane1_fakedata0_R2_0 1:N:0:
AATTATTAACCCAGACGAGTTGTCAGAGAGTGCTTTACCCCTTCTACGGTCCTCTGGGAGATTCGTGTGATTGTACTGGT
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R2_1 1:N:0:
AATTATTAACCCAGACGAGTTGTCAGAGAGTGCTTTACCCCTTCTACGGTCCTCTGGGAGATTCGTGTGATTGTACTGGT
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R2_2 1:N:0:
AATTATTAACCCAGACGAGTTGTCAGAGAGTGCTTTACCCCTTCTACGGTCCTCTGGGAGATTCGTGTGATTGTACTGGT
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

The reads were sorted into a separate file for the first (R1) and second (R2) reads for each individual.

If your data were previously de-multiplexed you need the following things before step 2:

your sorted file names should be formatted similar to above, but with sample names substituted for 1A0, 1A1, etc.
the files can be zipped (.gz) or not (.fq or .fastq).
the barcode should be removed (not on left side of reads)
the restriction site should not be removed, but if it is, enter a '@' symbol before the location of your sorted data.
Enter on line 18 of the params file the location of your sorted data.

Step 2: Quality filtering¶

Next we apply the quality filtering.

We left the quality filter (line 21) at the default value of 0, meaning that it will only filter based on quality scores but not look for the presence of Illumina adapters. Low quality sites are converted to Ns, and any locus with more than X number of Ns is discarded, where X is the number set on line 9 of the params file. If you used pear to check for merged reads then this filter can be set to 0, otherwise I would suggest setting it to 1 (check for adapters) or 2 (more stringent check).

In [11]:

%%bash
pyrad -p params.txt -s 2


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 2: quality filtering 
	............

Statistics for the number of reads that passed filtering can be found in the stats/ directory.

In [12]:

%%bash 
cat stats/s2.rawedit.txt

sample	Nreads	exclude	trimmed	passed
1B0	20000	0	0	20000
3I0	20000	0	0	20000
2H0	20000	0	0	20000
3J0	20000	0	0	20000
3K0	20000	0	0	20000
2G0	20000	0	0	20000
2E0	20000	0	0	20000
1C0	20000	0	0	20000
1D0	20000	0	0	20000
2F0	20000	0	0	20000
3L0	20000	0	0	20000
1A0	20000	0	0	20000

    Nreads = total number of reads for a sample
    exclude = reads that were excluded
    trimmed = reads that had adapter trimmed but were kept
    passed = total kept reads

The filtered data files are converted to fasta format and written to a directory called edits/. I show this below:

In [13]:

%%bash
ls edits/

1A0.edit
1B0.edit
1C0.edit
1D0.edit
2E0.edit
2F0.edit
2G0.edit
2H0.edit
3I0.edit
3J0.edit
3K0.edit
3L0.edit

Here you can see that the first and second reads have been concatenated into a single read separated by 'nnnn' in these files, and the read quality scores have now been discarded.

In [14]:

%%bash
head -n 8 edits/1A0.edit

>1A0_0_pair
TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTACCGAATGGTTGCGAnnnnTCATAAAACATTACTCTAAGACCAGTACAATCACACGAATCTCCCAGAGGACCGTAGAAGGGGTAAAGCACTCTCTGACAACTCGTCTGGGTTAATAATT
>1A0_1_pair
TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTACCGAATGGTTGCGAnnnnTCATAAAACATTACTCTAAGACCAGTACAATCACACGAATCTCCCAGAGGACCGTAGAAGGGGTAAAGCACTCTCTGACAACTCGTCTGGGTTAATAATT
>1A0_2_pair
TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTACCGAATGGTTGCGAnnnnTCATAAAACATTACTCTAAGACCAGTACAATCACACGAATCTCCCAGAGGACCGTAGAAGGGGTAAAGCACTCTCTGACAACTCGTCTGGGTTAATAATT
>1A0_3_pair
TGCAGTTACCTACTGTGATCGCCTAGACGGCAGTAAAACCGATGAGGCCCTCTCTAGAGTAACGGCTGAACTTATCCTACCGAATGGTTGCGAnnnnTCATAAAACATTACTCTAAGACCAGTACAATCACACGAATCTCCCAGAGGACCGTAGAAGGGGTAAAGCACTCTCTGACAACTCGTCTGGGTTAATAATT

Step 3: Within-sample clustering¶

From here you have two options for how to proceed. The first is to treat these concatenated reads like single end ddRAD sequences and cluster them as concatenated reads. To do this change the datatype option in the params folder from "pairddrad" to "ddrad", and proceed with steps 3-7. It will detect and remove the 'nnnn' seperator and cluster the long reads.

Alternatively, the more complex option is to cluster the reads separately, which I call the "split clustering" method. This will do the following:

de-replicate concatenated reads
split them and cluster first reads only
match second reads back to firsts
align second read clusters separately from first-read clusters
discard loci for which second reads align poorly (incomplete digestion or paralogs)

This is illustrated graphically below:

The screen captures below show split clustering performed on an empirical paired ddRAD data set:¶

The good aligned clusters are written to a .clustS.gz file¶

And the poor aligned pairs which are excluded from further analysis are written to a separate file ending with .badpairs.gz.¶

The benefit of this method over clustering concatenated first+second reads is that second reads can be retained that are either more divergent than the clustering threshold, or which contain many low quality base calls (Ns), thus potentially yielding more SNPs. Also, for very very large data sets this method performs faster clustering.

Drawbacks are that it discards sequences for which the first and second reads do not appear to be from the same DNA fragment, whereas the concatenated clustering method would likely cluster them separately. And also, even a single incompletely digested second read can cause an otherwise good cluster to be discarded.

The split clustering method filters the alignment of second reads based on the number of indels in the alignment. This number can be set on line 27 of the params file. The default values are 3,6,99,99. Meaning for within-sample clusters we allow 3 indels in first read clusters and 6 in second read clusters. For across-sample clusters we allow 99 in first read clusters and 99 in second read clusters. If only two numbers are set (e.g., 3,10) this is equivalent to 3,3,10,10.

In [14]:

%%bash
## we are using the split-clustering method since the 
## datatype is still set to pairddrad
pyrad -p params.txt -s 3


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	de-replicating files for clustering...

	step 3: within-sample clustering of 12 samples at 
	        '.85' similarity. Running 2 parallel jobs
	 	with up to 6 threads per job. If needed, 
		adjust to avoid CPU and MEM limits

	sample 1C0 finished, 1000 loci
	sample 2H0 finished, 1000 loci
	sample 3L0 finished, 1000 loci
	sample 3I0 finished, 1000 loci
	sample 1D0 finished, 1000 loci
	sample 2F0 finished, 1000 loci
	sample 3K0 finished, 1000 loci
	sample 1B0 finished, 1000 loci
	sample 1A0 finished, 1000 loci
	sample 2G0 finished, 1000 loci
	sample 2E0 finished, 1000 loci
	sample 3J0 finished, 1000 loci

In [15]:

%%bash
head -n 23 stats/s3.clusters.txt

taxa	total	dpt.me	dpt.sd	d>5.tot	d>5.me	d>5.sd	badpairs
1A0	1000	20.0	0.0	1000	20.0	0.0	0
1B0	1000	20.0	0.0	1000	20.0	0.0	0
1C0	1000	20.0	0.0	1000	20.0	0.0	0
1D0	1000	20.0	0.0	1000	20.0	0.0	0
2E0	1000	20.0	0.0	1000	20.0	0.0	0
2F0	1000	20.0	0.0	1000	20.0	0.0	0
2G0	1000	20.0	0.0	1000	20.0	0.0	0
2H0	1000	20.0	0.0	1000	20.0	0.0	0
3I0	1000	20.0	0.0	1000	20.0	0.0	0
3J0	1000	20.0	0.0	1000	20.0	0.0	0
3K0	1000	20.0	0.0	1000	20.0	0.0	0
3L0	1000	20.0	0.0	1000	20.0	0.0	0

    ## total = total number of clusters, including singletons
    ## dpt.me = mean depth of clusters
    ## dpt.sd = standard deviation of cluster depth
    ## >N.tot = number of clusters with depth greater than N
    ## >N.me = mean depth of clusters with depth greater than N
    ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N
    ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)

Steps 4 & 5: Consensus base calling¶

We next make consensus base calls for each cluster within each individual. First we estimate the error rate and heterozygosity within each sample:

In [16]:

%%bash
pyrad -p params.txt -s 4


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 4: estimating error rate and heterozygosity
	............

Calling consensus sequences applies a number of filters, as listed in the params file, to remove potential paralogs or highly repetitive markers from the data set. For paired-end data the maxN filter only applies to first reads, since we don't mind allowing low quality base calls in second reads. The maxH filter applies to both reads together. The diploid filter (only allow 2 haplotypes in a consensus sequence) also applies to the two reads together.

In [17]:

%%bash
pyrad -p params.txt -s 5


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 5: created consensus seqs for 12 samples, using H=0.00141 E=0.00050
	............

In [18]:

%%bash
cat stats/s5.consens.txt

taxon	nloci	f1loci	f2loci	nsites	npoly	poly
3I0	1000	1000	1000	183003	268	0.0014645
2G0	1000	1000	1000	183006	287	0.0015683
3K0	1000	1000	1000	183008	266	0.0014535
2H0	1000	1000	1000	183003	272	0.0014863
3J0	1000	1000	1000	183008	274	0.0014972
1B0	1000	1000	1000	183010	269	0.0014699
1D0	1000	1000	1000	183008	245	0.0013387
2E0	1000	1000	1000	183009	232	0.0012677
2F0	1000	1000	1000	183004	264	0.0014426
1C0	1000	1000	1000	183008	267	0.001459
1A0	1000	1000	1000	183004	245	0.0013388
3L0	1000	1000	1000	183004	241	0.0013169

    ## nloci = number of loci
    ## f1loci = number of loci with >N depth coverage
    ## f2loci = number of loci with >N depth and passed paralog filter
    ## nsites = number of sites across f loci
    ## npoly = number of polymorphic sites in nsites
    ## poly = frequency of polymorphic sites

Step 6: Across-sample clustering¶

This step clusters consensus sequences across samples. It can take a long time for very large data sets. If run normally it will print its progress to the screen so you can judge how long it might take. This example will take less than a minute.

In [19]:

%%bash
pyrad -p params.txt -s 6

vsearch v1.0.7_linux_x86_64, 7.5GB RAM, 4 cores
https://github.com/torognes/vsearch


	finished clustering


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 6: clustering across 12 samples at '.85' similarity 

Reading file /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/clust.85/cat.firsts_ 100%
1116009 nt in 12000 seqs, min 93, max 94, avg 93
Indexing sequences 100%
Counting unique k-mers 100%
Clustering 100%
Writing clusters 100%
Clusters: 1000 Size min 12, max 12, avg 12.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters

Step 7: Statistics and output files¶

Alignment of loci, statistics, and output of various formatted data files.

In [20]:

%%bash
pyrad -p params.txt -s 7

	ingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0
	addon 
	exclude 
	
	final stats written to:
	 /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/stats/c85d6m4p3.stats
	output files being written to:
	 /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_pairddRAD/outfiles/ directory

	writing nexus file
	writing phylip file
	  + writing full SNPs file
	  + writing unlinked SNPs file
	  + writing STRUCTURE file
	  + writing geno file
	  ** must enter group/clade assignments for treemix output 
	writing vcf file
	writing alleles file
	  ** must enter group/clade assignments for migrate-n output


     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

...

Stats for this run. In this case 1000 loci were shared across all 12 samples.

In [21]:

%%bash
cat stats/c85d6m4p3.stats


1000        ## loci with > minsp containing data
1000        ## loci with > minsp containing data & paralogs removed
1000        ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
1A0	1000
1B0	1000
1C0	1000
1D0	1000
2E0	1000
2F0	1000
2G0	1000
2H0	1000
3I0	1000
3J0	1000
3K0	1000
3L0	1000


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	0	*	1000
5	0	*	1000
6	0	*	1000
7	0	*	1000
8	0	*	1000
9	0	*	1000
10	0	*	1000
11	0	*	1000
12	1000	*	1000


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	0	0	70	0
1	2	2	179	179
2	8	18	256	691
3	20	78	218	1345
4	43	250	144	1921
5	78	640	77	2306
6	123	1378	36	2522
7	146	2400	12	2606
8	150	3600	6	2654
9	126	4734	1	2663
10	93	5664	1	2673
11	83	6577	0	2673
12	61	7309	0	2673
13	32	7725	0	2673
14	16	7949	0	2673
15	11	8114	0	2673
16	3	8162	0	2673
17	3	8213	0	2673
18	1	8231	0	2673
19	1	8250	0	2673
total var= 8250
total pis= 2673
sampled unlinked SNPs= 1000
sampled unlinked bi-allelic SNPs= 1000

Output files¶

In [22]:

%%bash
ls outfiles/

c85d6m4p3.alleles
c85d6m4p3.excluded_loci
c85d6m4p3.geno
c85d6m4p3.loci
c85d6m4p3.nex
c85d6m4p3.phy
c85d6m4p3.snps
c85d6m4p3.str
c85d6m4p3.unlinked_snps
c85d6m4p3.vcf

Stats files¶

In [23]:

%%bash
ls stats/

c85d6m4p3.stats
Pi_E_estimate.txt
s1.sorting.txt
s2.rawedit.txt
s3.clusters.txt
s5.consens.txt