Example de novo GBS analysis using pyRAD


Please direct questions about pyRAD usage to the google group thread (link)


This tutorial focuses on the analysis of single-end GBS reads as well as methods for salvaging short fragments containing adapter sequence that often result from poor size selection on this type of library preparation. If you have not yet read the full tutorial, you should start there for a broader description of how pyRAD works.

This tutorial is written as an IPython notebook. Each cell that begins with the header %%bash or with ! should be executed in a command line shell, for example by copying and pasting the text (but excluding the header). If you have IPython notebook installed on your machine you can download this notebook and edit/execute the cells yourself.


A short description of the genotyping-by-sequencing (GBS) library preparation.

Genomic DNAs are digested with a single restriction enzyme such that the resulting DNA fragments have the same cutsite overhang on both ends. A key difference between GBS and other library types is that the adapter+barcode can ligate to either ends of these fragments, meaning that the sequence read may have been sequenced from either end.

GBS, like ddRAD, is cheaper to prepare than RAD, and easier to prepare in your own lab. Some GBS preps do not include a size selection step, in which case the data generally include many short fragments that when read from either end in different reads will overlap considerably. For this reason pyrad uses reverse complement clustering to identify that these reads come from the same fragment. Some people add a size selection step to GBS preps, in which case, depending on the size range selected, overlap between sequenced reads from either ends of fragments can be reduced.

In this tutorial I show how to test whether overlap occurs in your GBS data, and how to setup the params file to perform reverse complement clustering in the case that it does occur; or to not perform this type of clustering if your data do not overlap (reverse complement clustering is a bit slower.)


The bash script in the cell below can be used to download an example GBS data set and unarchive it into your current directory. This data set was simulated with the following parameters

  • 12 taxa, 1 sampled individual per taxon
  • 2000 GBS loci (1000 DNA fragments sequenced from each end as first reads)
  • 20X sequencing depth
  • theta = 0.014
  • sequencing error rate = 0.005
  • no mutations to restriction sites
  • no indels
  • fragment size window (50-300)

This size selection window was chosen to demonstrate the analysis of short fragments.

Example data set

In [1]:
%%bash
wget -q http://dereneaton.com/downloads/simGBS.zip  
unzip simGBS.zip
Archive:  simGBS.zip
  inflating: simGBS.barcodes         
  inflating: simGBS_R1.fastq.gz      

The two necessary files below should now be located in your current directory/

  • simGBS_R1.fastq.gz : Illumina fastQ formatted reads (gzip compressed)
  • simGBS.barcodes : barcode map file

pyRAD Analysis

We begin by creating the params.txt file, which sets all of the parameters for the pyrad analysis. A default template should always be created first using the -n option in pyRAD. Then you can use any text editor to fill in the relevant lines with parameter valuels.

In [3]:
%%bash
pyRAD -n
	new params.txt file created

Let's take a look at the default settings.

In [4]:
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.2  **======================== affected step ==
./                          ## 1. Working directory                                 (all)
./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 16)  (s1)
./*.barcodes                ## 3. Loc. of barcode file (if not line 16)             (s1)
vsearch                     ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                      ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                       ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
2                           ## 7. N processors (parallel)                           (all)
6                           ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                           ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.88                         ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
rad                         ## 11. Datatype: rad,gbs,ddrad,pairgbs,pairddrad        (all)
4                           ## 12. MinCov: min samples in a final locus             (s7)
3                           ## 13. MaxSH: max inds with shared hetero site          (s7)
c88d6m4p3                   ## 14. Prefix name for final output (no spaces)         (s7)
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: loc. of de-multiplexed data                      (s2)
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
                       ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
                       ## 30.opt.: output formats: p,n,a,s,v,u,t,m,k,g,* (see docs) (s7)
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

We want to change a few of these. You should do this by editing params.txt with any text editor. I use the script below to automate the changes here. The most important changes below are the following effects:

  • line 11 sets the datatype to 'gbs', meaning we will test for reverse-complement clusters in step 3
  • line 21 sets filter to 1, meaning we will trim reads with an Illumina adapter detected in step 2.
  • line 32 sets the minimum length of kept trimmed reads to 50.
In [5]:
%%bash
sed -i '/## 10. /c\.85                 ## 10. lowered clust thresh... ' params.txt
sed -i '/## 11. /c\gbs                 ## 11. changed datatype to gbs ' params.txt
sed -i '/## 14. /c\c85m4p3             ## 14. outprefix... ' params.txt
sed -i '/## 21./c\1                    ## 21. set filter to 1 ' params.txt
sed -i '/## 24./c\8                    ## 24. increased maxH ' params.txt
sed -i '/## 30./c\*                    ## 30. all output formats ' params.txt
sed -i '/## 32./c\50                   ## 32. keep fragments longer than 50 ' params.txt
In [6]:
%%bash
cat params.txt
==** parameter inputs for pyRAD version 3.0.2  **======================== affected step ==
./                          ## 1. Working directory                                 (all)
./*.fastq.gz                ## 2. Loc. of non-demultiplexed files (if not line 16)  (s1)
./*.barcodes                ## 3. Loc. of barcode file (if not line 16)             (s1)
vsearch                     ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                      ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                       ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
2                           ## 7. N processors (parallel)                           (all)
6                           ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                           ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.85                 ## 10. lowered clust thresh... 
gbs                 ## 11. changed datatype to gbs 
4                           ## 12. MinCov: min samples in a final locus             (s7)
3                           ## 13. MaxSH: max inds with shared hetero site          (s7)
c85m4p3             ## 14. outprefix... 
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: loc. of de-multiplexed data                      (s2)
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
1                    ## 21. set filter to 1 
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
8                    ## 24. increased maxH 
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
*                    ## 30. all output formats 
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
50                   ## 32. keep fragments longer than 50 
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Let's take a look at what the raw data look like.

Your input data will be in fastQ format, usually ending in .fq or .fastq. Your data could be split among multiple files, or all within a single file. The file/s may be compressed with gzip so that they have a .gz, but do not need to be compressed. The location of these files should be entered on line 2 of the params file unless your data are already demultiplexed in which case you can skip this step and instead list the location of your demultiplexed files on line 18. Below are the first three reads in the example data set.

In [7]:
%%bash 
less simGBS_R1.fastq.gz | head -n 12 | cut -c 1-80
@lane1_fakedata0_R1_0 1:N:0:
ATAATATGCAGATCTGTAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_0 1:N:0:
ATAATATGCAGAGAAAGTCCAAAAATATATATCCCAATTACCCATCGACCTACGTTGTTCGCGATACCCAAGTTACTTTT
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata0_R1_1 1:N:0:
ATAATATGCAGATCTGTAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Each read takes four lines. The first is the name of the read (its location on the plate). The second line contains the sequence data. The third line is a spacer. And the fourth line the quality scores for the base calls. In this case arbitrarily high since the data were simulated.

These are 100 bp single-end reads prepared as GBS. The first six bases form the barcode and the next five bases (TGCAG) the restriction site overhang. All following bases make up the sequence data.

Step 1: de-multiplexing

This step uses information in the barcodes file to sort data into a separate file for each sample, which is then saved in a new directory called fastq/

In [8]:
%%bash
pyRAD -p params.txt -s 1

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 1: sorting reads by barcode
	 .
In [9]:
%%bash
ls fastq/
1A0_R1.fq.gz
1B0_R1.fq.gz
1C0_R1.fq.gz
1D0_R1.fq.gz
2E0_R1.fq.gz
2F0_R1.fq.gz
2G0_R1.fq.gz
2H0_R1.fq.gz
3I0_R1.fq.gz
3J0_R1.fq.gz
3K0_R1.fq.gz
3L0_R1.fq.gz

The statistics for step 1

For each raw data file the number of reads is shown, the number for which the cut site was detected, and the number for which the detected barcode matched in the barcodes file. Then for each true barcode it lists the observed barcodes that matched within the allowed number of mismatches, and their number of occurrences. Finally at the bottom the non-matching barcodes are shown, and their occurrences as well.

In [10]:
%%bash
cat stats/s1.sorting.txt
file    	Nreads	cut_found	bar_matched
simGBS_R1.fastq.gz	480000	480000	480000


sample	true_bar	obs_bars	N_obs
2H0    	AAATAT    	AAATAT	40000   
1C0    	AAGGAG    	AAGGAG	40000   
1B0    	ATAATA    	ATAATA	40000   
1A0    	CATCAT    	CATCAT	40000   
3L0    	GAGGTG    	GAGGTG	40000   
2G0    	GATGTA    	GATGTA	40000   
3J0    	GGTTGA    	GGTTGA	40000   
2E0    	TAAGAT    	TAAGAT	40000   
3I0    	TAGGTA    	TAGGTA	40000   
1D0    	TGGGTT    	TGGGTT	40000   
3K0    	TGTGTA    	TGTGTA	40000   
2F0    	TTTTTG    	TTTTTG	40000   

nomatch  	_            	0

Step 2: quality filtering

Step 2 filters reads based on quality scores, and can be used to detect and trim off Illumina adapters if present, which we expect to have in these simulated data. Filter option 1 looks for a fuzzy match to the reverse complement cutsite and the Illumina adapter. If present the adapter is chopped off and the read is either discarded or kept if the remaining length is longer than the set minimum length (here 50) it is kept. If you wanted, you could instead use external software to filter adapter sequences from your data.

In [11]:
%%bash
pyRAD -p params.txt -s 2

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

	step 2: editing raw reads 
	............
In [13]:
%%bash
ls edits/
1A0.edit
1B0.edit
1C0.edit
1D0.edit
2E0.edit
2F0.edit
2G0.edit
2H0.edit
3I0.edit
3J0.edit
3K0.edit
3L0.edit

The filtered data are written in fasta format (quality scores removed) into a new directory called edits/

In [14]:
%%bash
head -n 10 edits/1A0.edit | cut -c 1-80
>1A0_0_r1
TGCAGATCTGCAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAGATCGCA
>1A0_1_r1
TGCAGAGAAAGTCCAAAAATATATATCCCAATTACCCATCGACCTACGTTGTTCGCGATACCCAAGTTACTTTTTGAGGA
>1A0_2_r1
TGCAGATCTGCAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAGATCGCA
>1A0_3_r1
TGCAGAGAAAGTCCAAAAATATATATCCCAATTACCCATCGACCTACGTTGTTCGCGATACCCAAGTTACTTTTTGAGGA
>1A0_4_r1
TGCAGATCTGCAAAGCCGTATGTAATTCTGTTGTAGTTTAGACTTCTCACTGTCGTTAGTTATTTAATGAGGAGATCGCA

You can see here that of the 40K reads for each sample about 39,900 pass filtering. Of these, about 33,000 passed with no adapter sequence detected, and about 6,000 had the adapter trimmed off but still passed because their remaining length was longer than 50 bp. Only about 100 reads were excluded from each sample.

In [16]:
%%bash
cat stats/s2.rawedit.txt
sample 	Nreads	passed	passed.w.trim	passed.total
1A0	40000	33560	6340	39900
1B0	40000	33558	6341	39899
1C0	40000	33561	6340	39901
1D0	40000	33521	6340	39861
2E0	40000	33520	6361	39881
2F0	40000	33521	6359	39880
2G0	40000	33520	6361	39881
2H0	40000	33541	6340	39881
3I0	40000	33540	6350	39890
3J0	40000	33538	6342	39880
3K0	40000	33521	6360	39881
3L0	40000	33539	6340	39879

    Nreads = total number of reads for a sample
    passed = retained reads that passed quality filtering at full length
    passed.w.trim= retained reads that were trimmed due to detection of adapters
    passed.total  = total kept reads of sufficient length
    note: you can set the option in params file to include trimmed reads of xx length. 

Step 3: clustering within-samples

Step 3 de-replicates and then clusters reads within samples by the set clustering threshold and writes the clusters to new files in a directory called clust.xx.

Below I show the results of this analysis which uses reverse complement clustering to detect overlap among sequences originating from either end of short GBS fragments. I then show how to check for overlap in your data, and how to turn off reverse complement clustering if overlap is not detected, because it will allow your analysis to run much faster.

In [17]:
%%bash
pyRAD -p params.txt -s 3

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	de-replicating files for clustering...

	step 3: within-sample clustering of 12 samples at 
	        '.85' similarity. Running 2 parallel jobs
	 	with up to 6 threads per job. If needed, 
		adjust to avoid CPU and MEM limits

	sample 1C0 finished, 1594 loci
	sample 1A0 finished, 1594 loci
	sample 1B0 finished, 1594 loci
	sample 3I0 finished, 1594 loci
	sample 2H0 finished, 1593 loci
	sample 3J0 finished, 1592 loci
	sample 3K0 finished, 1594 loci
	sample 2G0 finished, 1594 loci
	sample 2E0 finished, 1594 loci
	sample 3L0 finished, 1593 loci
	sample 1D0 finished, 1592 loci
	sample 2F0 finished, 1593 loci

overlapping clusters

As you can see below, within the first few clusters we find two reverse complement matching clusters. The first cluster has reads from either end that overlap almost completely, meaning this fragment was only a little more than 100 bp long, and we sequenced 100 bp data. The third cluster is offset a little more. I found that reads which ovelap by less than about 30% are difficult to detect, and the minimum amount of overlap allowed is not a parameter you can change in the params file, but it can be edited in the pyrad code.

In [18]:
%%bash
less clust.85/1A0.clustS.gz | head -n 26 | cut -c 1-80
>1A0_27700_r1;size=20;
TGCAGACTTACTGAGTAAGTTCCATTAAAAATGAATTTCGGAGATGAATAGTGTGATCTCTGCCCGGTGAGGTTGCTGCG
>1A0_27701_c1;size=20;-
-------------------------------------------------AGTGTGATCTCTGCCCGGTGAGGTTGCTGCG
//
//
>1A0_5203_c1;size=19;-
-----CGTCGTATGTCGCTCAGCACCTCGCGTGTATGTTGTTCTGGCCACCATGGACTGACAAGACGTCCGCACGGTTAA
>1A0_5202_c1;size=19;
TGCAGCGTCGTATGTCGCTCAGCACCTCGCGTGTATGTTGTTCTGGCCACCATGGACTGACAAGACGTCCGCACGGTTAA
>1A0_5200_c1;size=1;+
TGCAGCGTCGTATGTCGCTCAGCACCTCGCGTGTATCTTGTTCTGGCCACCATGGACTGACAAGACGTCCGCACGGTTAA
>1A0_5201_c1;size=1;-
-----CGTCGTATGTCGCTCAGCACCTCGCGTGTATCTTGTTCTGGCCACCATGGACTGACAAGACGTCCGCACGGTTAA
//
//
>1A0_26781_r1;size=19;
TGCAGTTGAACTAACCAGCGTTTTTACTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACAAACAGAGGGC
>1A0_26780_c1;size=18;-
--------------------------CTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACAAACAGAGGGC
>1A0_26804_c1;size=1;-
--------------------------CTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACGAACAGAGGGC
>1A0_26812_c1;size=1;-
--------------------------CTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACAAACAGAGGGC
>1A0_26805_r1;size=1;+
TGCAGTTGAACTAACCAGCGTTTTTACTACTGGCGAAAATAGCTGGTACCATACCCTCGTTGTACAGACGAACAGAGGGC

The stats output tells you how many clusters were found, and their mean depth of coverage. It also tells you how many pass your minimum depth setting. You can use this information to decide if you wish to increase or decrease the mindepth. It also shows histograms of coverage depth for each sample. Below I show sample 1A0 which clearly shows the number of samples in this data set that matched normally plus those which matched in the reverse complement direction, which doubled the coverage.

In [19]:
%%bash
head -n 53 stats/s3.clusters.txt
taxa	total	dpt.me	dpt.sd	d>5.tot	d>5.me	d>5.sd	badpairs
1A0	1594	25.031	8.678	1594	25.031	8.678	0
1B0	1594	25.031	8.679	1594	25.031	8.679	0
1C0	1594	25.032	8.705	1593	25.047	8.687	0
1D0	1592	25.038	8.709	1591	25.053	8.691	0
2E0	1594	25.019	8.698	1593	25.035	8.68	0
2F0	1593	25.034	8.68	1593	25.034	8.68	0
2G0	1594	25.019	8.698	1593	25.035	8.68	0
2H0	1593	25.035	8.707	1592	25.05	8.689	0
3I0	1594	25.025	8.686	1594	25.025	8.686	0
3J0	1592	25.05	8.689	1592	25.05	8.689	0
3K0	1594	25.019	8.698	1593	25.035	8.68	0
3L0	1593	25.034	8.68	1593	25.034	8.68	0

    ## total = total number of clusters, including singletons
    ## dpt.me = mean depth of clusters
    ## dpt.sd = standard deviation of cluster depth
    ## >N.tot = number of clusters with depth greater than N
    ## >N.me = mean depth of clusters with depth greater than N
    ## >N.sd = standard deviation of cluster depth for clusters with depth greater than N
    ## badpairs = mismatched 1st & 2nd reads (only for paired ddRAD data)

HISTOGRAMS

    
sample: 1A0
bins	depth_histogram	cnts
   :	0------------50-------------100%
0 	                                    0
5 	                                    0
10 	                                    0
15 	                                    0
20 	***********************             1193
25 	                                    0
30 	                                    0
35 	                                    0
40 	*********                           401
50 	                                    0
100 	                                    0
250 	                                    0
500 	                                    0

sample: 1B0
bins	depth_histogram	cnts
   :	0------------50-------------100%
0 	                                    0
5 	                                    0
10 	                                    0
15 	*                                   2
20 	***********************             1191
25 	                                    0
30 	                                    0

Toggling reverse-complement clustering

For any single end GBS analysis I suggest that you first begin step 3 with the datatype set to gbs, which tells pyRAD to perform reverse complement clustering. Let this run for about 15 minutes, and then look at the resulting data file in the clust directory that has a ".u" ending, using the unix command "less". I use the command "head" below to print only the first 20 lines.

In [20]:
%%bash
head -n 20 clust.85/1A0.u
1A0_1021_c1;size=20;	1A0_1020_c1;size=20;	100.0	0	-	92.6
1A0_10501_r1;size=20;	1A0_10500_r1;size=20;	100.0	0	-	35.1
1A0_11221_r1;size=20;	1A0_11220_r1;size=20;	100.0	0	-	38.3
1A0_11341_r1;size=20;	1A0_11340_r1;size=20;	100.0	0	-	74.5
1A0_11381_r1;size=20;	1A0_11380_r1;size=20;	100.0	0	-	72.3
1A0_11481_r1;size=20;	1A0_11480_r1;size=20;	100.0	0	-	76.6
1A0_11721_r1;size=20;	1A0_11720_r1;size=20;	100.0	0	-	88.3
1A0_11941_r1;size=20;	1A0_11940_r1;size=20;	100.0	0	-	92.6
1A0_12061_r1;size=20;	1A0_12060_r1;size=20;	100.0	0	-	80.9
1A0_121_c1;size=20;	1A0_120_c1;size=20;	100.0	0	-	92.3
1A0_12221_r1;size=20;	1A0_12220_r1;size=20;	100.0	0	-	43.6
1A0_12421_r1;size=20;	1A0_12420_r1;size=20;	100.0	0	-	88.3
1A0_1261_c1;size=20;	1A0_1260_c1;size=20;	100.0	0	-	93.3
1A0_12961_r1;size=20;	1A0_12960_r1;size=20;	100.0	0	-	76.6
1A0_13801_r1;size=20;	1A0_13800_r1;size=20;	100.0	0	-	35.1
1A0_1421_c1;size=20;	1A0_1420_c1;size=20;	100.0	0	-	92.3
1A0_14521_r1;size=20;	1A0_14520_r1;size=20;	100.0	0	-	76.6
1A0_15241_r1;size=20;	1A0_15240_r1;size=20;	100.0	0	-	37.2
1A0_1541_c1;size=20;	1A0_1540_c1;size=20;	100.0	0	-	91.7
1A0_1581_c1;size=20;	1A0_1580_c1;size=20;	100.0	0	-	92.3

The ".u" file records the clustering process, showing which hits match to seeds, their measured similarity, the direction in which they match, and the number of indels. If your data do not contain any reverse complement clusters, meaning you see only "+" in column 5, and no "-", then you do not need to use reverse complement clustering. In that case you could change the datatype option on from 'gbs' to 'rad' and you would get a speed increase. If speed is not a concern, or if you see many "-" in the .u file then you should perform the rest of your analysis using the 'gbs' option.

Steps 4 & 5: Call consensus sequences

Step 4 jointly infers the error-rate and heterozygosity across samples.

In [21]:
%%bash
pyRAD -p params.txt -s 4

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 4: estimating error rate and heterozygosity
	............
In [17]:
%%bash
cat stats/Pi_E_estimate.txt
taxa	H	E
1C0	0.00127475	0.00049401	
2F0	0.00135011	0.0004923	
3I0	0.00151533	0.00046613	
1D0	0.0012648	0.00050004	
1B0	0.00149071	0.00049023	
3K0	0.00153561	0.00051244	
2H0	0.00134803	0.00049295	
2E0	0.00138114	0.00049448	
2G0	0.00162647	0.00050323	
1A0	0.00146453	0.00053174	
3L0	0.00158455	0.00051403	
3J0	0.00123198	0.00050979	

Step 5 calls consensus sequences using the parameters inferred above, and filters for paralogs.

In [22]:
%%bash
pyRAD -p params.txt -s 5

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 5: creating consensus seqs for 12 samples, using H=0.00142 E=0.00050
	............
In [23]:
%%bash
cat stats/s5.consens.txt
taxon               	nloci	f1loci	f2loci	nsites	npoly	poly
3J0       	1592	1592	1582	144424	186	0.0012879
3L0       	1593	1593	1583	144401	219	0.0015166
1A0       	1594	1594	1584	144633	211	0.0014589
2G0       	1594	1593	1586	144633	233	0.001611
2H0       	1593	1592	1575	143678	195	0.0013572
2E0       	1594	1593	1584	144296	207	0.0014346
3K0       	1594	1593	1582	144157	212	0.0014706
1B0       	1594	1594	1586	144695	214	0.001479
1D0       	1592	1591	1579	144128	186	0.0012905
3I0       	1594	1594	1591	145171	213	0.0014672
2F0       	1593	1593	1587	144886	193	0.0013321
1C0       	1594	1593	1586	144815	183	0.0012637

    ## nloci = number of loci
    ## f1loci = number of loci with >N depth coverage
    ## f2loci = number of loci with >N depth and passed paralog filter
    ## nsites = number of sites across f loci
    ## npoly = number of polymorphic sites in nsites
    ## poly = frequency of polymorphic sites

Step 6: Cluster across samples

Step 6 clusters consensus sequences across samples. Again this uses reverse complement clustering if the datatype is set to 'gbs'. This step can take a long time for very large data sets (>100 individuals). I suggest trying it first. If it takes too long you can implement the hierarchical clustering method (described in detail in a separate tutorial).

In [24]:
%%bash
pyRAD -p params.txt -s 6
vsearch v1.0.7_linux_x86_64, 7.5GB RAM, 4 cores
https://github.com/torognes/vsearch


	finished clustering

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------


	step 6: clustering across 12 samples at '.85' similarity 

Reading file /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_GBS/clust.85/cat.haplos_ 100%
1828942 nt in 19005 seqs, min 57, max 152, avg 96
Indexing sequences 100%
Counting unique k-mers 100%
Clustering 100%
Writing clusters 100%
Clusters: 1594 Size min 1, max 19, avg 11.9
Singletons: 1, 0.0% of seqs, 0.1% of clusters
In [25]:
%%bash
pyRAD -p params.txt -s 7
	ingroup 1A0,1B0,1C0,1D0,2E0,2F0,2G0,2H0,3I0,3J0,3K0,3L0
	addon 
	exclude 
	
	final stats written to:
	 /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_GBS/stats/c85m4p3.stats
	output files being written to:
	 /home/deren/Dropbox/Public/PyRAD_TUTORIALS/tutorial_GBS/outfiles/ directory

	writing nexus file
	writing phylip file
	  + writing full SNPs file
	  + writing unlinked SNPs file
	  + writing STRUCTURE file
	  + writing geno file
	  ** must enter group/clade assignments for treemix output 
	writing vcf file
	writing alleles file
	  ** must enter group/clade assignments for migrate-n output 

     ------------------------------------------------------------
      pyRAD : RADseq for phylogenetics & introgression analyses
     ------------------------------------------------------------

...

Final stats output

As you can see we did not recover all 2000 loci in the data set. This is because some reads were filtered for containing Illumina adapters, and others were merged when they overlapped. Below I print the first several loci in the ".loci" output to show the variable length loci that are recovered from this GBS analysis that included fragmented and overlapping reads.

In [26]:
%%bash
cat stats/c85m4p3.stats

1591        ## loci with > minsp containing data
1591        ## loci with > minsp containing data & paralogs removed
1591        ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
1A0	1581
1B0	1583
1C0	1584
1D0	1579
2E0	1582
2F0	1585
2G0	1584
2H0	1574
3I0	1588
3J0	1581
3K0	1580
3L0	1581


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	0	*	1591
5	0	*	1591
6	1	*	1591
7	3	*	1590
8	2	*	1587
9	8	*	1585
10	10	*	1577
11	37	*	1567
12	1530	*	1530


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	26	0	466	0
1	112	112	543	543
2	259	630	328	1199
3	257	1401	169	1706
4	314	2657	60	1946
5	259	3952	18	2036
6	180	5032	6	2072
7	95	5697	1	2079
8	52	6113	0	2079
9	20	6293	0	2079
10	9	6383	0	2079
11	6	6449	0	2079
12	2	6473	0	2079
total var= 6473
total pis= 2079
sampled unlinked SNPs= 1565
sampled unlinked bi-allelic SNPs= 1445
In [29]:
%%bash
head -n 200 outfiles/c85m4p3.loci | cut -c 1-85
>1A0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
>1B0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGAYAATGCATCTTGCTTATCTAAT
>1C0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
>1D0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
>2E0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
>2F0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
>2G0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
>3I0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
>3J0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
>3K0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
>3L0    TAGCGGTCTTGAGCTAGAGGGACGAAGAGACTAATTGACAATGCATCTTGCTTATCTAAT
//                                            -                     |
>1A0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT
>1B0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT
>1C0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT
>1D0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT
>2E0    TAATCAAAACAGTAWGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT
>2F0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTCCGTTGCCGTCCGGGACCGCGCT
>2G0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTTCCGTCCGGGACCGCGCT
>2H0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT
>3I0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT
>3J0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCSTCCGGGACCGCGCT
>3K0    TAATCAAAACAGTATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT
>3L0    TAATCAAAACAATATGGTCTTAGACCATGGCAAGCAACGTGATATCGATTTATTGCGTTGCCGTCCGGGACCGCGCT
//                 -  -                                       -    -  -              
>1A0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
>1B0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
>1C0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
>1D0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
>2E0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGACGTGGTCAAGTT
>2F0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGACGTGGTCAAGTT
>2G0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
>2H0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
>3I0    TCGTAGGCCTGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
>3J0    TCGTAGGCCTGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
>3K0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
>3L0    TCGTAGGCCGGCGCATCCGGTGCCGTTTATGTTTATTACTGTACAATGACTTGCAACTGCACAGATGTGGTCAAGTT
//               *                                                       *           
>1A0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>1B0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>1C0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>1D0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>2E0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>2F0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>2G0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>2H0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>3I0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>3J0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>3K0    ATACATAGTACAGTCCTCTAATTGGAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
>3L0    ATACATAGTACAGTCCTCTAATTGCAGCTCCAAGAACACCGAGGTCGTGTACAGATGGTTGTCTGCCCAAGCGGAAT
//                              -                                                    
>1A0    TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>1B0    KGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>1C0    TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>1D0    TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>2E0    TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTCATATCCAAGTGTCTGGGATGAAACGAATAATGGACTAC
>2F0    TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTCATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>2G0    TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTCATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>2H0    TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>3I0    TGCATATGCGGCTCACGGCCAGGCTTCCTCAGGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>3J0    TGCATATGCGGCTCACGGCCAGGCTTCCTCASGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>3K0    TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
>3L0    TGCATATGCGGCTCACGGCCAGGCTTCCTCACGTTCTTGATATCCATGTGTCTGGGATGAAACGAATAATGGACTAC
//      -                              *      *       -                              
>1A0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC
>1B0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC
>1C0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC
>1D0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC
>2E0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACTACAGTTGCCTCCAGGAATAATC
>2F0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACTACAGTTGCCTCCAGGAATAATC
>2G0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACTACAGTTGCCTCCAGGAATAATC
>2H0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC
>3I0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC
>3J0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC
>3K0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACAGTTGCCTCCAGGAATAATC
>3L0    ACTGAGCAAAGCTATACATGTGCATAGAAATGGTGGGTTTTAATGGTCGACCACGACCGTTGCCTCCAGGAATAATC
//                                                            *  -                   
>1A0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>1B0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGRGGTCATGGTATAATTCCCTA
>1C0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>1D0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>2E0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>2F0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>2G0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>2H0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>3I0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>3J0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>3K0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
>3L0    TCATTCGAAAGCGACGTACTCGGAGTAGCTACACACGTCTTAGCGTGTCGGTTTAGGGGTCATGGTATAATTCCCTA
//                                                              -                    
>1A0    CRTTGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG
>1B0    CATTGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG
>1C0    CATTGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG
>1D0    CATTGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG
>2E0    CATGGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTWTTTCTCTAAC
>2F0    CATGGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAC
>2G0    CATGGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAC
>2H0    CATGGTTTAACAGAGGTGTTTATAGACATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG
>3I0    CATGGTTTAACAGAGGTGTTTATAGAGATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG
>3J0    CATGGTTTAACAGAGGTGTTTATAGAGATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG
>3K0    CATGGTTTAACAGAGGTGTTTATAGAGATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG
>3L0    CATGGTTTAACAGAGGTGTTTATAGAGATCCCAACTGGGCTGGCTATAAGTACGACGTTAACCAGTTTTTCTCTAAG
//       - *                      *                                       -         *
>1A0    AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT
>1B0    AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT
>1C0    AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT
>1D0    AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT
>2E0    AAAATGGCAGCCGTACGCGAGCGTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTGCCCTT
>2F0    AAAATGGCAGCCGTACGCGAGCGTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTGCCCTT
>2G0    AAAATGGCAGCCGTACGCGAGCGTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTGCCCTT
>2H0    AAAATGGCAGCCGTACGCGAGCGTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTGCCCTT
>3I0    AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT
>3J0    AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT
>3K0    AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT
>3L0    AAAATGGCAGCCGTACGCGAGCTTCGCGGTTGTACTGGGGGGGCGCAGCGTCCAGCCCTCTCACCCGAGCTCCTCTT
//                            *                                                * *   
>1A0    CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA
>1B0    CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA
>1C0    CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA
>1D0    CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA
>2E0    CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA
>2F0    CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA
>2G0    CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA
>2H0    CTAGTACTTACGTAGAACGCCCCCCGCCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTCTCTA
>3I0    CTAGTACTTACGTAGAACGCCCCCCGTCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTGTCTA
>3J0    CTAGTACTTACGTAGAACGCCCCCCGTCGCTATGATGCTCCCCAAGCAGCTGTGGAGTGTGTCTA
>3K0    CTAGTACTTACGTAGAACGCCCCCCGTCACTATGATGCTCCCCAAGCAGCTGTGGAGTGTGTCTA
>3L0    CTAGTACTTACGTAGARCGCCCCCCGTCGCTATGATGCTCCCCAAGCAGCCGTGGAGTGTCTCTA
//                      -         * -                     -         *    |
>1A0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>1B0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>1C0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>1D0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>2E0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>2F0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>2G0    ATAAATCTGTAGAGACGTAGACTAAGCGTGGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>2H0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>3I0    ATAAATCTGTAGAGACGTAGACTAAGCATCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>3J0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>3K0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
>3L0    ATAAATCTGTAGAGACGTAGACTAAGCGTCGGTGGACAGTGCACGGGTCTTGTGTTTTTACTTAATAGAGCAAACCC
//                                 - -                                               
>1A0    GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>1B0    GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>1C0    GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>1D0    GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>2E0    GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>2F0    GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>2G0    GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>2H0    GACATCGTTATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>3I0    GACATCGTGATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>3J0    GACATCGTGATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>3K0    GACATCGTGATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
>3L0    GACATCGTGATGATTGAATTAGACACTAAGGGGGAGCGGTCGCTCGGTCTCAGGATCTGGGACTAAAATA
//              *                                                             |
>1A0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>1B0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAAGGATAACGGTCC
>1C0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>1D0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>2E0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>2F0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>2G0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>2H0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>3I0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>3J0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>3K0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTGTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
>3L0    TTATTGCTCGTGGCAGAAGGAATGATAAGTCCGCAGTTCTCTTCTCTTATGCCGGACGTGAATAACGATAACGGTCC
//                                              -                        -           
>1A0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGACCTTCTTTGCCACGCGGGC
>1B0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
>1C0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
>1D0    CGATGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
>2E0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
>2F0    CGCTGATTGTTTATTTCAATAGCTATATCAGGAWTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
>2G0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACYAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
>2H0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
>3I0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCMACGCGGGC
>3J0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
>3K0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
>3L0    CGCTGATTGTTTATTTCAATAGCTATATCAGGATTATTAGAACCAACTGGGAGCGAGGCCTTCTTTGCCACGCGGGC
//        -                              -         -             -          -        
>1A0    CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>1B0    CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>1C0    CACCGATGAGACGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGCCTTTTGGGCTACTTCGAAT
>1D0    CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>2E0    CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>2F0    CACCGATGAGCCGRGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>2G0    CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>2H0    CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>3I0    CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>3J0    CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>3K0    CACCGAYGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCATCTCGTCCCCTGGGACTTTTGGGCTACTTCGAAT
>3L0    CACCGATGAGCCGGGGGATATCCAAATGGTTTATTAAACAGCAKCTCGTCCCCTGGGACTTTTGGACTACTTCGAAT
//            -   -  -                             -             -       -           
>1A0    TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTACAGCACAGTGACTCG
>1B0    TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTACAGCACAGTGACTCG
>1C0    TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTACAGCGCAGTGACTCG
>1D0    TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCTTCTTCGCACTACAGCACAGTGACTCG
>2E0    TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTTCAGCACAGTGACTCG
>2F0    TTGCCCCCGCAGTCCAAACTCACATGGCGGCGCCGTTGCGGGGAGAGTTCATCTTCGCACTACAGCACAGTGACTCG