## Simulate a really big data set for testing¶

This is will be used to test memory limits in ipyrad, and also for comparing run times with other software. RAD data sets have three size dimensions that greatly affect run times: the number of taxa, the number of loci, and the length of reads. Here we will set all three to quite large values, by simulating paired-end 150bp reads for 300 taxa for 100K loci. For simplicity we do not simulate any missing data, but this is not expected to affect run times or memory limits except to the extent that it increased the total number of loci in the data set.

In [7]:
## import Python libraries

In [9]:
%%bash

## this will take about XX minutes to run, sorry, the code is not parallelized
## we simulate 360 tips by using the default 12 taxon tree and requesting 40
## individuals per taxon. Default is theta=0.002. Crown age= 5*2Nu (check this)
simrrls -L 100000 -f pairddrad -l 150 -n 30 -o Big_i360_L100K

## because it takes a long time to simulate this data set, you can alternatively
## The data set is hosted on anaconda, just run the following to get it.

Process is terminated.


In [20]:
data = ip.Assembly("bigHorsePaired")

data.set_params("project_dir", "bigdata")
data.set_params("raw_fastq_path", "bigHorsePaired_R*_.fastq.gz")
data.set_params("barcodes_path", "bigHorsePaired_barcodes.txt")
data.set_params("restriction_overhang", ("TGCAG", "CCGG"))
data.get_params()

  New Assembly: bigHorsePaired
0   assembly_name               bigHorsePaired
1   project_dir                 ./bigdata
2   raw_fastq_path              ./bigHorsePaired_R*_.fastq.gz
3   barcodes_path               ./bigHorsePaired_barcodes.txt
4   sorted_fastq_path
5   assembly_method             denovo
6   reference_sequence
8   restriction_overhang        ('TGCAG', 'CCGG')
9   max_low_qual_bases          5
10  phred_Qscore_offset         33
11  mindepth_statistical        6
12  mindepth_majrule            6
13  maxdepth                    10000
14  clust_threshold             0.85
15  max_barcode_mismatch        0
17  filter_min_trim_len         35
18  max_alleles_consens         2
19  max_Ns_consens              (5, 5)
20  max_Hs_consens              (8, 8)
21  min_samples_locus           4
22  max_SNPs_locus              (20, 20)
23  max_Indels_locus            (8, 8)
24  max_shared_Hs_locus         0.5
25  edit_cutsites               (0, 0)
26  trim_overhang               (0, 0, 0, 0)
27  output_formats              ['l', 'p', 's', 'v']
28  pop_assign_file


## Demultiplex the data files¶

All of the software methods we are comparing will use the same demultiplexed data files, so we will not count this step towards the total run times comparisons.

In [ ]:
data.run('1')


## The data files¶

These are the demultiplexed data files that each method will analyze.

In [ ]:
ls -l bigdata/bigHorsePaired_fastqs/


### Assemble the data set with ipyrad¶

In [ ]:
data.run("234567")


### Assemble the data set with stacks¶

In [ ]:



### Assemble the data set with ddocent¶

In [ ]:



### Assemble the data set with AftrRAD¶

We do not attempt this, since this software was unable to assemble much smaller data sets.

### Compare run times¶

In [ ]: