Assumptions- as with most worflows in development - a working directory is used.
#SR edits - adding direct link to ipynb viewer and raw files
!date
Fri Feb 21 13:31:01 PST 2014
filename='BSMAP2MK_workflow'
!echo 'http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/''{filename}''.ipynb'
http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/BSMAP2MK_workflow.ipynb
!echo 'http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/''{filename}''.ipynb'
http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/BSMAP2MK_workflow.ipynb
#file ID
fid="CgF_nov"
#TIMESTAMP
date=!date +%m%d_%H%M
#working directory (parent)
wd="/Volumes/web/Mollusk/bs_larvae_exp/"
#where is bsmap
bsmap="/Users/Shared/Apps/bsmap-2.73/"
#fastq files location R1 location
R1="/Volumes/web/trilobite/Crassostrea_gigas_HTSdata/batterbox/FCC39EM/Sample_BS_CgF/filtered_BS_CgF_TTAGGC_L004_R1.fastq.gz"
#fastq files location R2 location
#comment out if SE
#R2="/Volumes/web/trilobite/Crassostrea_gigas_HTSdata/batterbox/FCC39EM/Sample_BS_CgF/filtered_BS_CgF_TTAGGC_L004_R1.fastq.gz"
#genome file
genome="/Volumes/web/whale/ensembl/ftp.ensemblgenomes.org/pub/release-21/metazoa/fasta/crassostrea_gigas/dna/Crassostrea_gigas.GCA_000297895.1.21.dna_sm.genome.fa"
cd {wd}
/Volumes/web/Mollusk/bs_larvae_exp
mkdir {fid}_{date}
cd {fid}_{date}
/Volumes/web/Mollusk/bs_larvae_exp/CgF_nov
#option - number of processes
!{bsmap}bsmap -a {R1} -d {genome} -o bsmap_out.sam -p 1
BSMAP v2.73 Start at: Wed Feb 12 16:25:40 2014 Input reference file: /Volumes/web/whale/ensembl/ftp.ensemblgenomes.org/pub/release-21/metazoa/fasta/crassostrea_gigas/dna/Crassostrea_gigas.GCA_000297895.1.21.dna_sm.genome.fa (format: FASTA) Load in 7658 db seqs, total size 557717710 bp. 9 secs passed total_kmers: 43046721 Create seed table. 25 secs passed max number of mismatches: read_length * 8% max gap size: 0 kmer cut-off ratio:5e-07 max multi-hits: 100 max Ns: 5 seed size: 16 index interval: 4 quality cutoff: 0 base quality char: '!' min fragment size:28 max fragemt size:500 start from read #1 end at read #4294967295 additional alignment: T in reads => C in reference mapping strand: ++,-+ Single-end alignment(1 threads) Input read file: /Volumes/web/trilobite/Crassostrea_gigas_HTSdata/batterbox/FCC39EM/Sample_BS_CgF/filtered_BS_CgF_TTAGGC_L004_R1.fastq.gz (format: gzipped FASTQ) Output file: bsmap_out.sam (format: SAM) Thread #0: 50000 reads finished. 29 secs passed Thread #0: 100000 reads finished. 34 secs passed Thread #0: 150000 reads finished. 39 secs passed Thread #0: 200000 reads finished. 43 secs passed Thread #0: 250000 reads finished. 48 secs passed Thread #0: 300000 reads finished. 53 secs passed Thread #0: 345868 reads finished. 57 secs passed Total number of aligned reads: 218893 (63%) Done. Finished at Wed Feb 12 16:26:38 2014 Total time consumed: 58 secs
!python {bsmap}methratio.py -d {genome} -u -z -g -o methratio_out.txt -s {bsmap}samtools bsmap_out.sam
@ Wed Feb 12 16:26:38 2014: reading reference /Volumes/web/whale/ensembl/ftp.ensemblgenomes.org/pub/release-21/metazoa/fasta/crassostrea_gigas/dna/Crassostrea_gigas.GCA_000297895.1.21.dna_sm.genome.fa ... @ Wed Feb 12 16:27:11 2014: reading bsmap_out.sam ... [samopen] SAM header is present: 7658 sequences. @ Wed Feb 12 16:27:19 2014: combining CpG methylation from both strands ... @ Wed Feb 12 16:27:48 2014: writing methratio_out.txt ... @ Wed Feb 12 16:29:45 2014: done. total 168921 valid mappings, 1626861 covered cytosines, average coverage: 1.11 fold.
#command for only obtaining the context '__CG_'
!grep "[A-Z][A-Z]CG[A-Z]" <methratio_out.txt> methratio_out_CG.txt
#obtaining a filtered file with at least 5x coverage
!awk '$8 >= 5' <methratio_out_CG.txt> methratio_out_CG5x.txt
#Now I need to format my files to be read into the methylKit package in R
!/Volumes/web/Mollusk/bs_larvae_exp/methratio.awk.sh methratio_out_CG5x.txt > methratio_out_CG_mkit.txt
!head methratio_out_CG_mkit.txt
chr.Base chr base strand coverage freqC freqT C12802.90 C12802 90 F 1 0.00 1.00 C12802.99 C12802 99 F 1 0.00 1.00 C12802.119 C12802 119 F 1 0.00 1.00 C12802.149 C12802 149 F 1 0.00 1.00 C12960.67 C12960 67 F 1 0.00 1.00 C12960.123 C12960 123 F 1 0.00 1.00 C13208.148 C13208 148 F 1 1.00 0.00 C13208.184 C13208 184 F 1 1.00 0.00 C13766.145 C13766 145 F 1 0.00 1.00