Example PE-GBS (w/ merged reads) notebook

Paired-end GBS (or similarly, paired Ez-RAD) data may commonly contain first and second reads that overlap to some degree and therefore should be merged. This notebook will outline how to analyze such data in pyRAD (v.3.03 or newer). I will demonstrate this example using an empirical data set of mine. To begin, we will create an empty directory within our current directory in which to perform the assembly.

In [5]:
%%bash
mkdir -p analysis_pyrad

Next we create an empty params file by calling pyRAD with the -n option.

In [82]:
%%bash
pyrad -n
	new params.txt file created

Enter params file arguments

Then we enter some arguments into the params file, which can be done using any text editor. I use the script below to automate it here. Some important arguments for dealing with PE-GBS data in particular include the following:

  • Set the correct datatype. In the case of this analysis, we will use 'pairgbs' for de-multiplexing in step 1, but then switch it to 'merged' for steps 2-7 after we merge the paired reads.

  • The filter setting (param 21) is not needed in this case because we will be using an external read merging program that applies this filtering.

  • We will, however, still probably want to trim the overhanging edges of reads (param 29), which helps to remove barcodes or adapters that were missed.

  • It can also be helpful with this data type to use a low mindepth setting in conjunction with majority rule consensus base calls (param 31). This will tell pyrad to make statistical base calls for all sites with depth >mindepth, but to make majority rule base calls at sites with depth lower than mindepth, but higher than the majority rule minimum depth. I first run the analysis with the option turned off, and then at the end of the tutorial show the difference with it turned on.

In [83]:
%%bash
sed -i '/## 1. /c\analysis_pyrad/      ## 1. working directory   ' params.txt
sed -i '/## 2. /c\/home/deren/Longiflorae/*.gz   ## 2. working directory     ' params.txt
sed -i '/## 3. /c\/home/deren/barcodes/cyatho_longi.barcodes  ## 3. barcodes ' params.txt
sed -i '/## 7. /c\12                   ## 7. N processors   ' params.txt
sed -i '/## 9. /c\6                    ## 9. NQual     ' params.txt
sed -i '/## 10. /c\.85                 ## 10. Wclust     ' params.txt
sed -i '/## 11. /c\pairgbs             ## 11. datatype   ' params.txt
sed -i '/## 14. /c\c85d6m4p3           ## 14. output prefix  ' params.txt
sed -i '/## 29./c\1,1                  ## 29. trim edges ' params.txt
sed -i '/## 30./c\p                    ## 30. output formats ' params.txt

Let's view the params file

In [84]:
cat params.txt
==** parameter inputs for pyRAD version 3.0.3  **======================== affected step ==
analysis_pyrad/      ## 1. working directory   
/home/deren/Longiflorae/*.gz   ## 2. working directory     
/home/deren/barcodes/cyatho_longi.barcodes  ## 3. barcodes 
vsearch                     ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                      ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                       ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
12                   ## 7. N processors   
6                           ## 8. Mindepth: min coverage for a cluster              (s4,s5)
6                    ## 9. NQual     
.85                 ## 10. Wclust     
pairgbs             ## 11. datatype   
4                           ## 12. MinCov: min samples in a final locus             (s7)
3                           ## 13. MaxSH: max inds with shared hetero site          (s7)
c85d6m4p3           ## 14. output prefix  
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: loc. of de-multiplexed data                      (s2)
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
                       ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
1,1                  ## 29. trim edges 
p                    ## 30. output formats 
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Step 1: De-multiplex reads

In [19]:
%%bash
pyrad -p params.txt -s 1

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 1: sorting reads by barcode
	 ...........................................................

Let's look at the first 100 lines of the stats output for demultiplexing

It looks like most reads from each file were matched to a barcode.

In [20]:
%%bash
head -n 100 analysis_pyrad/stats/s1.sorting.txt
file    	Nreads	cut_found	bar_matched
lane2_NoIndex_L002_049.fastq.gz	4000000	3721343	3563126
lane2_NoIndex_L002_039.fastq.gz	4000000	3775744	3746367
lane2_NoIndex_L002_038.fastq.gz	4000000	3749702	3712199
lane2_NoIndex_L002_008.fastq.gz	4000000	3788251	3763415
lane2_NoIndex_L002_030.fastq.gz	4000000	3756116	3721905
lane2_NoIndex_L002_042.fastq.gz	4000000	3821669	3761365
lane2_NoIndex_L002_027.fastq.gz	4000000	3809029	3786687
lane2_NoIndex_L002_046.fastq.gz	4000000	3800963	3717126
lane2_NoIndex_L002_037.fastq.gz	4000000	3746059	3721532
lane2_NoIndex_L002_045.fastq.gz	4000000	3698917	3593579
lane2_NoIndex_L002_019.fastq.gz	4000000	3778683	3750877
lane2_NoIndex_L002_044.fastq.gz	4000000	3751756	3661419
lane2_NoIndex_L002_058.fastq.gz	4000000	3755612	3729931
lane2_NoIndex_L002_032.fastq.gz	4000000	3823466	3804376
lane2_NoIndex_L002_025.fastq.gz	4000000	3821273	3800999
lane2_NoIndex_L002_055.fastq.gz	4000000	3789040	3767778
lane2_NoIndex_L002_023.fastq.gz	4000000	3832294	3814024
lane2_NoIndex_L002_014.fastq.gz	4000000	3832280	3812920
lane2_NoIndex_L002_013.fastq.gz	4000000	3835097	3816100
lane2_NoIndex_L002_031.fastq.gz	4000000	3827911	3808997
lane2_NoIndex_L002_011.fastq.gz	4000000	3845888	3828585
lane2_NoIndex_L002_022.fastq.gz	4000000	3837573	3819811
lane2_NoIndex_L002_024.fastq.gz	4000000	3829959	3810628
lane2_NoIndex_L002_034.fastq.gz	4000000	3611235	3590798
lane2_NoIndex_L002_056.fastq.gz	4000000	3802234	3779975
lane2_NoIndex_L002_026.fastq.gz	4000000	3819201	3799143
lane2_NoIndex_L002_021.fastq.gz	4000000	3835070	3816075
lane2_NoIndex_L002_052.fastq.gz	4000000	3828882	3810331
lane2_NoIndex_L002_041.fastq.gz	4000000	3835218	3791068
lane2_NoIndex_L002_051.fastq.gz	4000000	3833143	3814474
lane2_NoIndex_L002_047.fastq.gz	4000000	3759517	3607421
lane2_NoIndex_L002_059.fastq.gz	4000000	3757946	3719770
lane2_NoIndex_L002_054.fastq.gz	4000000	3819129	3798453
lane2_NoIndex_L002_028.fastq.gz	4000000	3795559	3771229
lane2_NoIndex_L002_048.fastq.gz	4000000	3748548	3623691
lane2_NoIndex_L002_035.fastq.gz	4000000	3266305	3243997
lane2_NoIndex_L002_043.fastq.gz	4000000	3817002	3744305
lane2_NoIndex_L002_057.fastq.gz	4000000	3743088	3719312
lane2_NoIndex_L002_018.fastq.gz	4000000	3794176	3769286
lane2_NoIndex_L002_017.fastq.gz	4000000	3807601	3784147
lane2_NoIndex_L002_050.fastq.gz	4000000	3745707	3651533
lane2_NoIndex_L002_033.fastq.gz	4000000	3821688	3802696
lane2_NoIndex_L002_040.fastq.gz	4000000	3761738	3722996
lane2_NoIndex_L002_053.fastq.gz	4000000	3822961	3803422
lane2_NoIndex_L002_029.fastq.gz	4000000	3781432	3755276
lane2_NoIndex_L002_016.fastq.gz	4000000	3822901	3802934
lane2_NoIndex_L002_036.fastq.gz	4000000	3805280	3783717
lane2_NoIndex_L002_020.fastq.gz	4000000	3732223	3681151
lane2_NoIndex_L002_005.fastq.gz	4000000	3814967	3793880
lane2_NoIndex_L002_006.fastq.gz	4000000	3812538	3792209
lane2_NoIndex_L002_001.fastq.gz	4000000	3838930	3821090
lane2_NoIndex_L002_007.fastq.gz	4000000	3799427	3776091
lane2_NoIndex_L002_015.fastq.gz	4000000	3822598	3802203
lane2_NoIndex_L002_004.fastq.gz	4000000	3823816	3803572
lane2_NoIndex_L002_010.fastq.gz	4000000	3761201	3730496
lane2_NoIndex_L002_009.fastq.gz	4000000	3777890	3750566
lane2_NoIndex_L002_002.fastq.gz	4000000	3826291	3807448
lane2_NoIndex_L002_003.fastq.gz	4000000	3824097	3804683
lane2_NoIndex_L002_012.fastq.gz	4000000	3838714	3820626


sample	true_bar	obs_bars	N_obs
d31733    	AAAAGTT    	AAAAGTT	5114026    
d31733    	AAAAGTT    	AANAGTT	813193     
d31733    	AAAAGTT    	AAGAGTT	90500      
d31733    	AAAAGTT    	AACAGTT	57040      
d31733    	AAAAGTT    	AAAAGGT	19502      
d31733    	AAAAGTT    	AAAAGAT	19214      
d31733    	AAAAGTT    	AATAGTT	13373      
d31733    	AAAAGTT    	AAAGGTT	9539       
d31733    	AAAAGTT    	AAAAGCT	8946       
d31733    	AAAAGTT    	GAAAGTT	8772       
d31733    	AAAAGTT    	NAAAGTT	8297       
d31733    	AAAAGTT    	AAAAGTC	7090       
d31733    	AAAAGTT    	AAAAATT	6140       
d31733    	AAAAGTT    	AGAAGTT	6042       
d31733    	AAAAGTT    	AAAATTT	4801       
d31733    	AAAAGTT    	CAAAGTT	4089       
d31733    	AAAAGTT    	AAAAGTG	3887       
d31733    	AAAAGTT    	AAACGTT	3255       
d31733    	AAAAGTT    	ACAAGTT	2946       
d31733    	AAAAGTT    	TAAAGTT	1997       
d31733    	AAAAGTT    	AAAAGTA	1857       
d31733    	AAAAGTT    	ATAAGTT	1145       
d31733    	AAAAGTT    	AAATGTT	994        
d31733    	AAAAGTT    	AAAACTT	896        
d31733    	AAAAGTT    	ANAAGTT	512        
d33291    	AACGCCT    	AACGCCT	949865     
d33291    	AACGCCT    	AANGCCT	150664     
d33291    	AACGCCT    	AAGGCCT	18149      
d33291    	AACGCCT    	AAAGCCT	14152      
d33291    	AACGCCT    	AATGCCT	3499       
d33291    	AACGCCT    	AACGACT	2636       
d33291    	AACGCCT    	AACGCTT	2091       
d33291    	AACGCCT    	AACGTCT	2053       
d33291    	AACGCCT    	GACGCCT	1775       
d33291    	AACGCCT    	AACGCCC	1714       
d33291    	AACGCCT    	AGCGCCT	1543       
d33291    	AACGCCT    	NACGCCT	1474       

Merge paired-reads

For this we will use the program PEAR. This step is very important. I can pretty much assure you that you will have many paired end reads in your library that overlap, and if you do not merge these reads they can really mess up your analysis. In this case the vast majority of paired reads are overlapping (~80%), and in fact, the merged reads are the only portion of the data that we will be using for analysis.

The first thing we have to do is unzip the reads if they are gzipped. If you demultiplexed your data in pyrad they will be in the fastq/ directory.

In [21]:
%%bash
## For this example I subselect only the samples from 
## this library that start with the letter 'd'
gunzip analysis_pyrad/fastq/d*.gz

I prefer using a stringent filter setting, including the -q option which trims parts of the read that are low quality. This often results in removing all reads that are not merged. You can assemble merged and non-merged reads separately, as described in the 'non-merged PE-GBS tutorial', but here we will focus on just analyzing the merged reads.

In [35]:
%%bash
for gfile in analysis_pyrad/fastq/*_R1.fq;
    do pear -f $gfile \
            -r ${gfile/_R1.fq/_R2.fq} \
            -o ${gfile/_R1.fq/} \
            -n 33 \
            -t 33 \
            -q 10 \
            -j 20  >> pear.log 2>&1;
done

Set the data location to the de-multiplexed & merged data

In [37]:
%%bash
## set location of demultiplexed data that are 'pear' filtered
sed -i '/## 18./c\analysis_pyrad/fastq/*.assembled.fastq ## 18. demulti data loc ' ./params.txt

Set the datatype to "merged"

In [46]:
%%bash
sed -i '/## 11./c\merged            ## 11. data type ' ./params.txt

Step 2: Quality filtering

In [50]:
%%bash
pyrad -p params.txt -s 2

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

	sorted .fastq from analysis_pyrad/fastq/*.assembled.fastq being used
	step 2: editing raw reads 
	........................

Let's take a look at the results

In [51]:
%%bash
cat analysis_pyrad/stats/s2.rawedit.txt
sample 	Nreads	passed	passed.w.trim	passed.total
d19long1.assembled	2973258	2865913	0	2865913
d30181.assembled	4178962	4048613	0	4048613
d30695.assembled	4435144	4249855	0	4249855
d31733.assembled	4757784	4598646	0	4598646
d33291.assembled	847503	819072	0	819072
d34041.assembled	988252	969387	0	969387
d35178.assembled	2479633	2429555	0	2429555
d35320.assembled	1308443	1269046	0	1269046
d35371.assembled	572512	559953	0	559953
d35422.assembled	3069039	3016381	0	3016381
d39103.assembled	2629656	2523785	0	2523785
d39104.assembled	3415956	3353178	0	3353178
d39114.assembled	2215501	2133972	0	2133972
d39187.assembled	907321	875889	0	875889
d39253.assembled	5665909	5444296	0	5444296
d39404.assembled	2567688	2447216	0	2447216
d39531.assembled	3254639	3128372	0	3128372
d39968.assembled	3561187	3441543	0	3441543
d40328.assembled	1651265	1601896	0	1601896
d41058.assembled	2444960	2395622	0	2395622
d41237.assembled	1847033	1815168	0	1815168
d41389.assembled	2491305	2433467	0	2433467
d41732.assembled	2344042	2291187	0	2291187
decor21.assembled	5056601	4926349	0	4926349

    Nreads = total number of reads for a sample
    passed = retained reads that passed quality filtering at full length
    passed.w.trim= retained reads that were trimmed due to detection of adapters
    passed.total  = total kept reads of sufficient length
    note: you can set the option in params file to include trimmed reads of xx length. 

Step 3: Clustering

In [ ]:
%%bash
pyrad -p params.txt -s 3

Let's take a look at the results

In [51]:
%%bash
cat analysis_pyrad/stats/s3

and let's look at an example cluster file

In [57]:
%%bash
gzip -cd analysis_pyrad/clust.85/d35422.assembled.clustS.gz | head -n 100 | cut -b 1-80
>d35422.assembled_2120_r1;size=51;
TGCAGCGCTACCTGCCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA
>d35422.assembled_274175_r1;size=15;+
TGCAGCGCTACCTGCCCGATCTGTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA
>d35422.assembled_2265219_r1;size=2;+
TGCAGCGCTACCTGCCGGATCTGTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA
>d35422.assembled_216079_r1;size=1;+
TGCAGCGCTACCTGCCCGATCTGCTCGAGAAGCGGACGAACTGAGAATGAACCTGCA
>d35422.assembled_2495995_r1;size=1;+
TGCAGCGCTACCTGCCCGGTCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA
>d35422.assembled_590416_r1;size=1;+
TGCAGCGCTATCTGCCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA
>d35422.assembled_1947641_r1;size=1;+
TGCAGCGCTACCTGCCCGATCTTTTCGAGAAGCGGACGAGCTGAGAATGAACCTGCA
>d35422.assembled_1601389_r1;size=1;+
TGCAGCGCTACCTGCCGGATCTGTTCGAGAAGCGGACGAACTGAGAATGACACTGCA
>d35422.assembled_190367_r1;size=1;+
TGCAGCGCTACCTGCCCGATCCTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA
>d35422.assembled_2631365_r1;size=1;+
TGCAGCGCTACCTGCCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCG
>d35422.assembled_126737_r1;size=1;+
TGCAGCGCTACCTGCCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTNCA
>d35422.assembled_288692_r1;size=1;+
TGCAGCGCTACCTGCCGGATCNGTTCGAGAAGCGGACGAACTGAGAATGACACTGCA
>d35422.assembled_2194406_r1;size=1;+
TGCAGCGCTANCTNCCCGATCTTTTCGAGAAGCGGANNANCTGAGAATGAACCTGCA
>d35422.assembled_831862_r1;size=1;+
TGCAGCGCTACCTGNCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA
//
//
>d35422.assembled_61639_r1;size=24;
TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC
>d35422.assembled_825340_r1;size=8;+
TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGTCAGTTCATCCAGTCGCATCCGCGCGACGCCCTGGTGCCGTCC
>d35422.assembled_88436_r1;size=2;+
TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC
>d35422.assembled_354389_r1;size=2;+
TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC
>d35422.assembled_2229078_r1;size=1;+
TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTACGTCAGTTCATCCAGTCGCATCCGCGCGACTCCCTGGTGCCGTCC
>d35422.assembled_1405977_r1;size=1;+
TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATNCAGTCGCANCCCCGCGACGCCCTGGTGCCGNCC
>d35422.assembled_2738651_r1;size=1;+
TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC
>d35422.assembled_2565401_r1;size=1;+
TGCAGCGTCAGTACGAGCTGGCGGAGATGGGCCTGCGTCAGTTCATCCAGTCCCATCCGCGTGATGCACTGGTACCGGCC
>d35422.assembled_2060896_r1;size=1;+
TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCGCGCGACGCCCTCGTGCCGGCC
>d35422.assembled_27774_r1;size=1;+
TGCAGCGTCAGTACGAGCTGGCGGAGATGGGCCTGCGTCAGTTCATCCAGTCCCATCCGCGTGATGCACTGGTACCGGCC
>d35422.assembled_2776613_r1;size=1;+
TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC
>d35422.assembled_2566934_r1;size=1;+
TGCAGCGCCAGTACGAGCAGGCCGGGATGGGCCTGCGTCAGTTCATCCAGTCGCATCCGCGCGACGCCCTGGTGCCGTCC
//
//
>d35422.assembled_781655_r1;size=2;+
TGCAGGACCAGGGCTTCGCCGTGACCGCGAACATGGTGACCAGCCGCATCGTCGTGGACGCCGTGGGCACCTTCGACGCC
>d35422.assembled_1218064_r1;size=2;
TGCAGGACCAGGGCTTCGCCGTGACCGCGAACATGGTGACCAGTCGTGCTGTGGTCGAAGCCGTCGGCGCCTTCGACGCC
>d35422.assembled_81417_r1;size=1;+
TGCAGGACCAGGGCTTCGCCGTGNCCGCGAACATGGTNACCACCCGCGCCGTTGTCGACGCCGTCGGCACCTTCGACGCC
>d35422.assembled_1447392_r1;size=1;+
TGCAGGACCAGGGCTTCGCNGTGACCNCGAACATGGTGACCAGCCGCATCGTCGTGGACGCCGTGGGCACCTTCGACGCC
>d35422.assembled_1916059_r1;size=1;+
TGCAGGACCAGGGCTTCGCCGTGACCGCGAACATGGTGACCAGCCGCGTCGTCGTGGACGCCGTGGGCACCTTCGACGCC
//
//
>d35422.assembled_1086507_r1;size=1;
TGCAGNNANC-AAAAAAAATATATG-TTTTTTTTTTTTGCTGCA
>d35422.assembled_1935566_r1;size=1;+
TGCAGNTAACNAAAAAAAATATATGTTTTTTTTTTTTTGCTGCA
//
//
>d35422.assembled_105061_r1;size=2;
TGCAGGTCTCCTCCGACGTCTACTACTACACGGTCGGCTCGAAGCTCTTCTCGAGGCCGAACAAGCCCCTGCA
>d35422.assembled_405948_r1;size=2;+
TGCAGGTCTCCTCCGACGTCTACTACTACACGGTCGGCGCGAACCTCTTCGAGAGGAAGAACAAGCCGCTGCA
>d35422.assembled_1112236_r1;size=1;+
TGCAGGTNNCCTCNGACGTCTACTACTACACGGTCGGCGCGAACCTCTTCTCCAAGCCGAACAAGCCGCTGCA
>d35422.assembled_129605_r1;size=1;+
TGCAGGTGTCCTCCGACGTCTACTACTACACCGTCGGGTCGAAGCTCTTCTCGAGGCCGAACAAGCCCCTGCA
>d35422.assembled_108168_r1;size=1;+
TGCAGGTGTCCTCCGACGTCTACTACTACACGGTCGGCGCGAACCTCTTCTCGAAGCCGAACAAGCCCCTGCA
>d35422.assembled_1796775_r1;size=1;+
TGCAGGTGTCCTCCGACGTCTACTACTACACGGTCGGCGCGAACCTCTTCTCCAAGCCGAACAAGCCGCTGCA
//
//
>d35422.assembled_2015519_r1;size=1;+
TGCAGGGGGNCGCGGGGAGCAACCCCAGCTCGGTGCACGTGAGACAGGTGTCGATCACGGGCGGGGACCTCCTGGGGCCT
>d35422.assembled_1326334_r1;size=1;
TGCAGGGGGTCGCGGGGAGCAACCCCAGCTCGGTGCACGTGAGACAGGTGTCGATCACGGGCGGGGACCTCCTGGGGCCT
//
//
>d35422.assembled_735556_r1;size=2;
TGCAGCGCCGCCAGCGGCTGCAGAACTGCCCCCGAGTTTCCAGCTTCTGCA
>d35422.assembled_2203593_r1;size=1;+
TGCAGCGCCTGGACGCCCTGCAGAACTGCCCCCGAGTTTCCAGCTTCTGCA
//
//
gzip: stdout: Broken pipe

Step 4: Parameter estimation

In [ ]:
%%bash
pyrad -p params.txt -s 4
In [59]:
%%bash
cat analysis_pyrad/stats/Pi_E_estimate.txt
taxa	H	E
d35371.assembled	0.0182707	0.00219242	
d39187.assembled	0.02781985	0.00272968	
d39104.assembled	0.02073342	0.0015139	
d35320.assembled	0.02621882	0.00175474	
d34041.assembled	0.02897603	0.00381241	
d33291.assembled	0.02675046	0.00311296	
d40328.assembled	0.02060546	0.00267209	
d35422.assembled	0.04092042	0.00224104	
d19long1.assembled	0.02280653	0.00095164	
d41389.assembled	0.02985094	0.00193657	
d41237.assembled	0.05196614	0.00204698	
d39114.assembled	0.02263828	0.00168706	
d39404.assembled	0.04004115	0.00165295	
d41058.assembled	0.03910496	0.00179648	
d39531.assembled	0.03286906	0.0014618	
d41732.assembled	0.0367377	0.00251955	
d35178.assembled	0.0248532	0.00261068	
d31733.assembled	0.03005955	0.00205113	
d30181.assembled	0.02449057	0.00222622	
d39968.assembled	0.03075244	0.00329168	
d39253.assembled	0.03181217	0.00343958	
decor21.assembled	0.03917313	0.00253141	
d30695.assembled	0.02899983	0.00255832	
d39103.assembled	0.03842947	0.00396182	

Step 5: Consenus base calling

You may want to adjust:

  • max cluster size (param 33)
  • max number of Ns
  • max number of Hs
In [ ]:
%%bash
pyrad -p params.txt -s 5
In [60]:
%%bash
cat analysis_pyrad/stats/s5.consens.txt
taxon          	nloci	f1loci	f2loci	nsites	npoly	poly
d35371.assembled	60885	7839	6432	489780	2313	0.0047225
d41237.assembled	293759	10550	5791	433633	2696	0.0062172
d35320.assembled	156419	12616	9039	777103	3193	0.0041089
d41058.assembled	329719	13087	8587	620904	3583	0.0057706
d35422.assembled	169511	13976	8440	588597	3865	0.0065665
d19long1.assembled	161936	11666	8611	763559	2339	0.0030633
d33291.assembled	169929	14920	10442	913599	3976	0.004352
d34041.assembled	197867	19992	14191	931867	6604	0.0070868
d39187.assembled	96755	12685	8535	752923	3913	0.0051971
d39104.assembled	101266	8902	6827	546599	1931	0.0035328
d41732.assembled	452240	23593	15770	1129545	7020	0.0062149
d41389.assembled	211387	21245	14809	1053031	6331	0.0060122
d39404.assembled	243761	23389	13769	1106220	7083	0.0064029
d40328.assembled	136690	24752	18911	1544641	7199	0.0046606
d39114.assembled	218883	25257	18878	1562653	7444	0.0047637
d39531.assembled	332792	26717	16869	1393571	7705	0.005529
d35178.assembled	450842	47869	38116	2467261	21384	0.0086671
decor21.assembled	577428	46634	28116	2021202	14304	0.007077
d31733.assembled	439816	40309	27103	2178840	11820	0.0054249
d30181.assembled	516149	49126	36637	2879253	13306	0.0046213
d39103.assembled	642379	48620	27395	2247944	13442	0.0059797
d30695.assembled	485553	48503	32034	2797840	15100	0.005397
d39253.assembled	431026	51267	31529	2654672	14942	0.0056286
d39968.assembled	456174	61773	39746	3309319	18475	0.0055827

    ## nloci = number of loci
    ## f1loci = number of loci with >N depth coverage
    ## f2loci = number of loci with >N depth and passed paralog filter
    ## nsites = number of sites across f loci
    ## npoly = number of polymorphic sites in nsites
    ## poly = frequency of polymorphic sites

Cluster across samples

In [ ]:
%%bash
pyrad -p params.txt -s 6

Assemble final data set

In [ ]:
%%bash
pyrad -p params.txt -s 7

Examine results

In [61]:
%%bash
cat analysis_pyrad/stats/c88d6m4p3.stats

17401       ## loci with > minsp containing data
16418       ## loci with > minsp containing data & paralogs removed
16418       ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
d19long1.assembled	2232
d30181.assembled  	5279
d30695.assembled  	8028
d31733.assembled  	5717
d33291.assembled  	3529
d34041.assembled  	4291
d35178.assembled  	4252
d35320.assembled  	3451
d35371.assembled  	2535
d35422.assembled  	2776
d39103.assembled  	6100
d39104.assembled  	2311
d39114.assembled  	4695
d39187.assembled  	3249
d39253.assembled  	7324
d39404.assembled  	5203
d39531.assembled  	6176
d39968.assembled  	6724
d40328.assembled  	4921
d41058.assembled  	1872
d41237.assembled  	1308
d41389.assembled  	4676
d41732.assembled  	4007
decor21.assembled 	7367


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	5937	*	16418
5	3301	*	10481
6	2050	*	7180
7	1386	*	5130
8	910	*	3744
9	588	*	2834
10	412	*	2246
11	329	*	1834
12	241	*	1505
13	167	*	1264
14	140	*	1097
15	121	*	957
16	97	*	836
17	106	*	739
18	91	*	633
19	76	*	542
20	93	*	466
21	107	*	373
22	105	*	266
23	98	*	161
24	63	*	63


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	2451	0	6616	0
1	2053	2053	3490	3490
2	1932	5917	1952	7394
3	1655	10882	1254	11156
4	1376	16386	865	14616
5	1193	22351	611	17671
6	959	28105	383	19969
7	806	33747	266	21831
8	660	39027	210	23511
9	499	43518	184	25167
10	429	47808	118	26347
11	324	51372	110	27557
12	266	54564	77	28481
13	250	57814	70	29391
14	214	60810	48	30063
15	172	63390	39	30648
16	171	66126	31	31144
17	139	68489	24	31552
18	104	70361	16	31840
19	107	72394	10	32030
20	79	73974	16	32350
21	87	75801	10	32560
22	59	77099	4	32648
23	58	78433	2	32694
24	43	79465	5	32814
25	35	80340	2	32864
26	36	81276	2	32916
27	27	82005	0	32916
28	33	82929	2	32972
29	36	83973	1	33001
30	23	84663	0	33001
31	23	85376	0	33001
32	13	85792	0	33001
33	20	86452	0	33001
34	12	86860	0	33001
35	14	87350	0	33001
36	10	87710	0	33001
37	7	87969	0	33001
38	8	88273	0	33001
39	7	88546	0	33001
40	3	88666	0	33001
41	2	88748	0	33001
42	2	88832	0	33001
43	2	88918	0	33001
44	4	89094	0	33001
45	2	89184	0	33001
46	4	89368	0	33001
47	0	89368	0	33001
48	2	89464	0	33001
49	3	89611	0	33001
50	2	89711	0	33001
51	0	89711	0	33001
52	1	89763	0	33001
53	1	89816	0	33001
total var= 89816
total pis= 33001
In [81]:
%%bash
head -n 100 analysis_pyrad/outfiles/c88d6m4p3.loci | cut -b 1-88
>d31733.assembled      GTACAGCACCGGGTAGCGGCGGRGGCTAGCGGCGTAGCCGGGCGGCAGGTACACCCACACCCGGC
>d39114.assembled      GTACAGCACCGGGTAGCGGCGGRGGCTAGCGGCGTAGCCGGGCGGCAGGTACACCCACACCCGGC
>d39968.assembled      GTACAGCACCGGGTAGCGGCGGGGGCTAGCNGCGTAGCCGGGCGGCAGGTACACCCACACCCGGC
>decor21.assembled     GTACAGCACCGGGTAGCGGCGGRGGCTAGCGGCGTAGCCGGGCGGCAGGTACACCCACACCCGGC
//                                           *                                          
>d30695.assembled      AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC
>d31733.assembled      AATTGTGAGTACCGTCGTGACCGANGGCACCGAGGC
>d39103.assembled      AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC
>d39531.assembled      AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC
>d41389.assembled      AATTGTGAGTACCGTCGTRACCGA-GGCACCGAGGC
>d41732.assembled      AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC
>decor21.assembled     AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC
//                                       -                 |
>d30695.assembled      CCRTGGTGCGGCTGGCGCAAGGTGCGSGASCTSGCGGTGGCGCAGGGC
>d31733.assembled      CCATGGTGCGGCTGGCGCAAGGTGCGSGACCTSGCGGTGGCGCAGGGC
>d33291.assembled      CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGGCGCAGGGC
>d34041.assembled      CCGTGGTGCGGCTGGCGCAAGGTGCGCGARCTCGCGGTGGCGCAGGGY
>d35320.assembled      CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGNCGCAGGGC
>d35422.assembled      CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGGCGCAGGGC
>d39253.assembled      CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGGCGCAGGGY
>d39531.assembled      CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGGCGCAGGGC
//                       *                       *  *  *              *|
>d30181.assembled      GCCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCGCCATTCCCCATGGTACGCCGCA
>d30695.assembled      GCCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCGCCATTCCCCATGGTACGCCGCA
>d31733.assembled      GCCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCGCCATTCCCCATGGTACGCCGCA
>d39114.assembled      GCCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCNCCATTCCCCATGGTACGCCGCA
>d39404.assembled      GNCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCGCCATTCCCCATGGTACGCCGCA
//                                                                                      
>d30181.assembled      CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGCAGGGGCAGCAGCTTGACCTTGCAGGTTCACGACG
>d30695.assembled      CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGCAGGGGCAGCAGCTTGACCTTGCAGGTTCACGACG
>d31733.assembled      CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGCAGGGGCAGCAGCTTGACCTTGCAGGTTCACGACG
>d39114.assembled      CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGNAGGGGCAGNAGCTTGACCTTNCAGGTTCACGACG
>d39253.assembled      CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGCAGGGGCAGCAGCTTGACCTTGCAGGTTCACGACG
>d39404.assembled      CCAGGGCAGCGAAGGCCTGAGCGTCGNCAGCAGGGGCAGCAGCTTGACCTTGCAGGNTCACGACG
//                                                                                      
>d35320.assembled      GCCCGAGTCGGCCCCGACCAGCCAGCCCAGGACGTCGGAGCGGTAGGTGGCGAAGTCGTAGAGCT
>d39114.assembled      ACCGGAGTCCGCGCCGACCAGCCAGCCCAGGACGTCGGAGCGGTAGGTGGCGAAGTCGTAGAGCT
>d39531.assembled      RCCGGAGNCCGCRCCGACCAGCCAGCCCAGGACGTCGGAGCGGTAGGTGGCGAAGTCGTAGAGCT
>decor21.assembled     GCCGGAGNCCGCACCGACCAGCCAGCCCAGGACGTCGGAGCGGTAGGTGGCGAAGTCGTAGAGCT
//                     *  -     -  *                                                    
>d19long1.assembled    ATATTTTTGGAGGATGAAGAGTTTCTGTCTTCTGCATTCTCTTCTTCTTCTTCTTCCTCTTAGGT
>d35422.assembled      ATATTTTTGGAGGATGACGAGTTTCTGTCTTCTGCATTC---TCTTCTTCTTCTTCCTCTTAGGT
>d39104.assembled      ATATTTTTGGAGGATGACGAGTTTCTGTCTTCTGCATTC---TCTTCTTCTTCCTCCTCTTAGGT
>d41058.assembled      ATATTTTTGGAGGATGAAGATTTTCTGTCTTCTGCATTC---TCTTCTTCTTCCTCCTCTTAGGT
//                                      *  -                                *           
>d35320.assembled      CGCCGGGCGTTCGAACCGAGCGCGGGCGTCGCGCCGCACGCGGCGCGGCGCGNC
>d39103.assembled      CGCCGGGCGTTCGAACCGAGCGCGGGCGTCGCGCCGCNCGCGGCGCGGCGCGCC
>d39404.assembled      CGCCGGGCGTTCGARCCGAGCGCGGGCRTCGCGCCGCACGCGGCGCGSCGNGCC
>d39531.assembled      CGCCGSGCGTTCGARCCGAGCGCGGGCRTCGCGCCGCACGCGGCGCGGCGCGCC
>d39968.assembled      CGCCGGGCGTTCGAACCGAGCGCGGGCGTCGCGCCNCACGCGNCGCGGCGCGCC
//                          -        *            *                   -      |
>d30695.assembled      GGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT
>d31733.assembled      GGCGTAYGACTCGTCGTTCAGCCAGGCCAGCGT
>d34041.assembled      GGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT
>d39114.assembled      NGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT
>d39253.assembled      GGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT
>d41732.assembled      GGCGTAYNACTCGTCGTTCAGCCAGGCCAGCGT
>decor21.assembled     GGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT
//                           *                          |
>d30695.assembled      ACGCGAGGCCCATTCCAAA-TTCGAGAGGACTTACACAGATATCGCCGCAGATAACTTTTACGAT
>d39253.assembled      ACGCGAGGCCCATTCCAAAGTTYGAGAGGACTTACACAGATATCGCCGCAGATAACTTTTACGAT
>d39968.assembled      ACGCGAGGCCCATTCCAAAGTTCGAGAGGACTTACACAGATATCGCCGCAGATAACTTTTACGAT
>d40328.assembled      ACGCGAGGCCCATTCCAAAGTTCGAGAGGACTTACACAGATATCGCCGCAGAYAAYTNTTACGAT
//                                           -                             -  -         
>d30181.assembled      TCGGTGCTTCCGGAACTCCGCAACGNG
>d30695.assembled      TCGGTGCTTCCGGAACTCCGCAACGCG
>d39103.assembled      TCGGTGCTTCCGGAACTGCGCAACGCG
>d39968.assembled      TCGGTGCTTCCGGAACTCCGCAACGCG
//                                      -         |
>d33291.assembled      GTTGCGCGCGATCTCCACCATCTGCTGCTGGCCGATGCCCAGCGTGCCGACCCGCGTGCCCGGGT
>d35178.assembled      GTTGCGCGCGATCTCCACCATCTGCTGCTGGCCGATGCCSAGCGTGCCGACCMGCGTGCCCGGGT
>d35320.assembled      GTTGCGSGCGATCTCCACCATCTGCTGCTGGCCGATGCCSAGCGTGCCGACCCGCGTGCCCGGGT
>d39253.assembled      GTTGCGCGCGATCTCCACCATCTGCTGCTGGCCGATGCCCAGCGTGCCGACCNGCGTGCCCGGGT
>d40328.assembled      GTTGCGCGCGATCTCSACCATCTGCTGCTGSCCGATGCCGAGCGTGCCGACCMGCGTGCCSGGGT
//                           -        -              -        *            *       -    
>d30695.assembled      CGACGACACGTCGGAGACCGGCTCCTCGACGAGAACCACTCCGGCAGCGCGACACGCACTCAGTA
>d39531.assembled      CGACGACACGTCGGAGACCRGCTCCTCGACGAGAACCACTCCGGCAGCGCGACACGCANTCAGTA
>d41389.assembled      CGACGACACGTCGGAGACCGGCTCCTCGACGAGAACCACTCCGGCAGCGCGACACGCACTCAGTA
>decor21.assembled     CGACGACACGTCGGAGACCRNCTCCTCGACGAGAACCACTCCGGCAGCGCGACACGCANTCAGTA
//                                        *                                             
>d31733.assembled      CGGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC
>d35178.assembled      CNGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC
>d35320.assembled      CNGCAGCTCCAGCTGGGGCCGCGCCAGCCGCTGCCGCAGCTC
>d35422.assembled      CGGCAGCTCNAGCTGGGGCCGCGCCAGCCGCTGCCGCNGNTC
>d39103.assembled      CGGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC
>d39187.assembled      CGGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC
>d39253.assembled      CGGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC
>d39404.assembled      CGGYNGCNCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC
>d39968.assembled      CGGCAGCTCCAGCCGGGGCTGCNCCAGCCGCTGNCGCAGCTC
>decor21.assembled     CGGNAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC
//                        -         *     *                      |
>d30181.assembled      AGCCTGCTSGGCTACGTNGTCCGCTGGGTCGACGCCGGGGTGGG
>d39187.assembled      AGCCTGCTSGGCTACGTCGTCCGCTGGGTCGACGCCGGNGTGGG
>d39253.assembled      AGCCTGCTGGGCTACGTCGTCCGCTGGGTCGACGCCGGNGTGGG
>d39968.assembled      AGCCTGCTGGGCTACGTCGTCCGCTGGGTCGACGCCGGCGTGGG
//                             *                             -     |
>d30181.assembled      CAGCGAGGTGTCCACGCGGCTGCCCTGGCCGGTGCGCAGCCGCTTCACGTAGGCGGTGACCACGC
>d33291.assembled      CAGCGAGGTGTCCACGCGGCTGCCCTGGCCGGTGCGCAGCCGCTTCACGTAGGCGGTGACCACGC
>d39103.assembled      CAGCGAGGTGTCCACGCGGCTGCCCTGGCCGGTGCGCAGCCGCTTCACGTAGGCGGTGACCACGC
>d39187.assembled      CAGCGAGGTGTCCACGCGGCTGCCCTGGCCGGTGCGCAGCCGCTTCACGTAGGCGGTGACCACGC

Infer a tree

In [ ]:
%%bash
raxmlHPC-PTHREADS-AVX -s analysis_pyrad/outfiles/c88d6m4p3.phy \
                      -m GTRGAMMA \
                      -f a \
                      -x 12345 \
                      -p 12345 \
                      -N 100 \
                      -w /home/deren/Documents/PEGBS/analysis_raxml/ \
                      -n c88d6m4p3 \
                      -o "d35320.assembled" \
                      -T 20 &> /dev/null
In [72]:
%load_ext rmagic
In [73]:
%%R
library(ape)
tre <- read.tree("analysis_raxml/RAxML_bipartitions.c88d6m4p3")
plot(tre)
nodelabels(tre$node.label)

Other ways to improve assembly

If you look at the stats output for step 5 you can see that we discarded most of our data for being too low of coverage. One way to salvage these data is to set the minimum depth for majority rule consensus base calls to 2 (param 31). In this way, statistical base calls are still made for all stacks in which the depth is >mindepth (param 8), but for everything between (2-mindepth) majority rule is used to make base calls. This will probably increase the size of this data set ~10X, as you can see in the example S5 stats file below.

In [70]:
%%bash
## This is the step 5 results after I moved the consens.gz files out of 
## the clust.85/ directory and replaced them with new consens calls 
## generated by running step 5 with the majority rule consensus 
## option turned on. 

tail -n 32 analysis_pyrad/stats/s5.consens.txt
taxon          	nloci	f1loci	f2loci	nsites	npoly	poly
d35371.assembled	60885	24564	23062	1780882	2350	0.0013196
d41058.assembled	329719	90853	84689	6244501	3656	0.0005855
d41237.assembled	293759	90845	84120	5741117	2794	0.0004867
d33291.assembled	169929	67772	62376	5502056	4098	0.0007448
d34041.assembled	197867	89535	83219	5919976	6740	0.0011385
d35422.assembled	169511	56563	50274	3627878	3974	0.0010954
d35320.assembled	156419	52912	48780	4266120	3268	0.000766
d39187.assembled	96755	45614	41058	3775082	4079	0.0010805
d39104.assembled	101266	34065	31273	2410273	1958	0.0008124
d19long1.assembled	161936	39462	35591	3167415	2402	0.0007583
d41389.assembled	211387	77211	70021	5475971	6483	0.0011839
d40328.assembled	136690	65435	59329	5145197	7386	0.0014355
d41732.assembled	452240	181118	170999	12626514	7201	0.0005703
d39114.assembled	218883	92037	85134	7595069	7721	0.0010166
d39404.assembled	243761	89814	78903	6913918	7306	0.0010567
d39531.assembled	332792	112016	101057	8879145	7975	0.0008982
d35178.assembled	450842	201307	190564	13878623	21667	0.0015612
d31733.assembled	439816	175627	160987	13694620	12159	0.0008879
decor21.assembled	577428	227921	207326	15697724	14727	0.0009382
d30181.assembled	516149	203114	189290	15726272	13709	0.0008717
d39103.assembled	642379	235667	211348	18531562	13982	0.0007545
d30695.assembled	485553	174196	155764	14331377	15685	0.0010945
d39253.assembled	431026	172635	151210	13507456	15403	0.0011403
d39968.assembled	456174	186053	162430	14528464	19145	0.0013178

    ## nloci = number of loci
    ## f1loci = number of loci with >N depth coverage
    ## f2loci = number of loci with >N depth and passed paralog filter
    ## nsites = number of sites across f loci
    ## npoly = number of polymorphic sites in nsites
    ## poly = frequency of polymorphic sites