Olympia Oyster (Pat) -Initial- PacBio Data Analysis

Includes conversion, mapping RNA-seq data, and blast comparison of transcriptome
Focus is on scaffolds made from assembly of reads > 10k bp

In [1]:
#updated - trying out new software
!date
Sat Feb 22 10:43:06 PST 2014

pbh5tools

In [45]:
!bash5tools.py m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5.1 --outFilePrefix myreads --outType fasta --readType Raw
usage: bash5tools.py [-h] [--verbose] [--version] [--profile] [--debug] [--outFilePrefix OUTFILEPREFIX]
                     [--readType {ccs,subreads,unrolled}] [--outType OUTTYPE] [--minLength MINLENGTH]
                     [--minReadScore MINREADSCORE] [--minPasses MINPASSES]
                     input.bas.h5
bash5tools.py: error: argument --readType: invalid choice: 'Raw' (choose from 'ccs', 'subreads', 'unrolled')
In [43]:
!bash5tools.py -h
usage: bash5tools.py [-h] [--verbose] [--version] [--profile] [--debug] [--outFilePrefix OUTFILEPREFIX]
                     [--readType {ccs,subreads,unrolled}] [--outType OUTTYPE] [--minLength MINLENGTH]
                     [--minReadScore MINREADSCORE] [--minPasses MINPASSES]
                     input.bas.h5

Tool for extracting data from .bas.h5 files

positional arguments:
  input.bas.h5          input .bas.h5 filename

optional arguments:
  -h, --help            show this help message and exit
  --verbose, -v         Set the verbosity level (default: None)
  --version             show program's version number and exit
  --profile             Print runtime profile at exit (default: False)
  --debug               Run within a debugger session (default: False)
  --outFilePrefix OUTFILEPREFIX
                        output filename prefix [None]
  --readType {ccs,subreads,unrolled}
                        read type (ccs, subreads, or unrolled) []
  --outType OUTTYPE     output file type (fasta, fastq) [fasta]

Read filtering arguments:
  --minLength MINLENGTH
                        min read length [0]
  --minReadScore MINREADSCORE
                        min read score, valid only with --readType={unrolled,subreads} [0]
  --minPasses MINPASSES
                        min number of CCS passes, valid only with --readType=ccs [0]
!wget http://eagle.fish.washington.edu/whale/Pat/Analysis_Results/m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5
In [33]:
pwd
Out[33]:
u'/Users/sr320/Desktop'
In [37]:
!wc m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5.1
^C

Initial analysis below


Data was provided by core facility with comments:

Data are grouped into folders by SMRT cell. Folders are generically named by the position of the cell in the run (A01_1, B01_1, etc.). Reads are available in PacBio's HDF5 format as .bas.h5 files. Metadata for each SMRT cell is available in PacBio's XML format.

Tools for analysis of .bas.h5 files are available on PacBio's DevNet site (http://pacbiodevnet.com/) including the SMRT Analysis toolkit which can perform de novo assembly and resequencing.

Data was downloaded locally.
http://eagle.fish.washington.edu/whale/index.php?dir=Pat%2F

Two files

Screenshot%207/8/13%2012:20%20PM

Conversion to fastq

via Giles.

Basically I took the bas.h5 file and ran it through the pbh5tools, specifically bash5tools.py.

The exact command I ran was

bash5tools.py --readType unrolled --outType fastq inputfile.bas.h5

where inputfile.bas.h5 was the name of the file you got from them.

My only question was whether I did the --readType correctly. I found some info on that online as well as some examples of running this command and they all talked about using a "Raw" mode which was removed from this version of bash5tools so I assumed it was the unrolled option because the other two options (ccs and subreads) where the same in previous versions.

To Install,

First you need numpy and h5py, both are python libraries. I installed these via MacPorts. Then you to install the pbcore library, which is a python library provided by PacBios (same website as the pbh5tools).

Basically download pbcore, unpack, then run python setup.py build sudo python setup.py install

This will install it with all the other python libraries, then do same with pbh5tools. You might need to add things to your PYTHONPATH and PATH environment variables. If you want to, you can customize where it goes by adding a --prefix to the install command but then you have to adjust PYTHONPATH and PATH (which is what I did).

Links:

Understanding PacBios reads https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Understanding-PacBio-transcriptome-data#wiki-readexplained

Examples for bash5tools.py http://seqanswers.com/forums/archive/index.php/t-16895.html

Github for tools https://github.com/PacificBiosciences


resulting in
http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_1.fastq

Importing into CLC

info via CLC

CLC bio tools are currently not optimized to handle the specific error profile of PacBio reads, and we therefore do not yet provide support for using this data type.

If you still wish to try using PacBio data in the Genomics Workbench, then if you have your data in standard fastq format, you should be able to succeed in importing it into the CLC bio Genomics Workbench using the Illumina import option. If you choose to do this, please choose the option "NCBI/Sanger or Illumina Pipeline 1.8 or later" under 'Quality scores' in the Import wizard .

Our developers are currently evaluating methods for error correction of PacBio reads (using short read data), and for hybrid de novo assembly using PacBio and short read data combined as input. If the outcome of this work matches the performance of tools commonly used for PacBio error correction (e.g. Celera Assembler using the pacBioToCA utility; Koren et al. 2012. Nature Biotechnology), or hybrid de novo assembly (e.g. ALPATHS-LG; Ribeiro et al. 2012. Genome Research), we will be considering providing such functionality in the future.

In [5]:
!head /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa
>OlyO_Pat_PacBio_10k_contig_1
AAAAAAAGGGAGATGTTTTCCTCATGTTGAATTGAATTCTTCAACTCATTTAAATCAGGC
AAGTGTTCAACACTAACGCTAGTTCCGGTTAGCCTGTAGGTCTAATAACTTTTGTCCAAA
CTACCAGGATATACAATTATGCACTGTTTAGCAGGGAGACATCACAAAAGGTATTTAATT
CCGATTACGCAAGAGCTTTTCTGCGTATTGATCAGGTTTTTTGAATCGAGGCAATGCTAC
CTTACAGACAAAGTTTGTTGTTGTTCAGGGGTTTCAATAGTCACGTTTTAAAGGCAGCAT
AATTTCGTTATAATTCTAATGGTCGTTATACAATTTAGTTTACCAATATTTGACCTATCA
TTTGAGTCAAATACTTGGTCCTGATTGTGGTTTCATAGCCATTGTTAAGTGCCGTTTTAC
ACACTGTATTTGACTACGGACTGTTCCGTTACCTGATCAAGACTATGGGGCCTCCCACGG
CGGGTGTGACGCGGTCAACAGGGGATGCCTACTCCTCTCTAGGCAGCCTGATCCCCACTT

Ran cd-hit-est via webserver

Summary: Nothing combined at 90% level (probably to big for cd-est hit)


Importing in iPlant

Screenshot%207/10/13%207:09%20AM

TopHat2-PE
Screenshot%207/10/13%207:10%20AM


Screenshot%207/10/13%207:11%20AM


Reference Genome - OlyO_Pat_PacBio_10k_contigs.fa

Output - OlyO_10kcontigs_TopHatm106

In [32]:
#finished

output file
http://eagle.fish.washington.edu/cnidarian/OlyO10k_filtered_106A_Male_Mix_TAGCTT_L004_R1.bam

Screenshot%207/10/13%201:53%20PM


Annotating 10kcontigs

http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa

Have working copy of transcriptome http://eagle.fish.washington.edu/cnidarian/OlyOv3_400bp.fasta and will blastn against 10k contigs

In [8]:
!makeblastdb -in /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa -dbtype nucl -out /Volumes/Bay3/Software/ncbi-blast-2.2.27\+/db/OlyO_PacBio_10kcontigs

Building a new DB, current time: 07/10/2013 07:40:30
New DB name:   /Volumes/Bay3/Software/ncbi-blast-2.2.27+/db/OlyO_PacBio_10kcontigs
New DB title:  /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 553 sequences in 0.575964 seconds.
In [11]:
!blastn -query /Volumes/web/cnidarian/OlyOv3_400bp.fasta -db /Volumes/Bay3/Software/ncbi-blast-2.2.27\+/db/OlyO_PacBio_10kcontigs -out /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs -outfmt 6 -evalue 1E-20 -num_threads 2 -task blastn
In [12]:
!head /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs
4486232	OlyO_Pat_PacBio_10k_contig_319	89.39	132	4	5	142	264	21178	21048	1e-39	 161
4486232	OlyO_Pat_PacBio_10k_contig_319	90.00	130	4	6	268	390	13091	12964	1e-38	 158
4486232	OlyO_Pat_PacBio_10k_contig_319	90.00	130	4	6	268	390	20765	20638	1e-38	 158
4486232	OlyO_Pat_PacBio_10k_contig_319	87.88	132	6	5	142	264	13504	13374	5e-37	 152
4486232	OlyO_Pat_PacBio_10k_contig_319	84.50	129	9	9	144	267	11561	11683	4e-25	 113
4486232	OlyO_Pat_PacBio_10k_contig_319	84.50	129	9	9	144	267	19235	19357	4e-25	 113
4486232	OlyO_Pat_PacBio_10k_contig_319	82.14	140	9	11	142	265	5867	5728	7e-23	 105
4486255	OlyO_Pat_PacBio_10k_contig_420	77.51	418	31	33	36	397	9827	10237	4e-63	 239
4486256	OlyO_Pat_PacBio_10k_contig_420	77.51	418	31	33	5	366	10237	9827	4e-63	 239
4486297	OlyO_Pat_PacBio_10k_contig_420	78.89	289	34	17	2	276	24459	24184	2e-49	 194
In [13]:
!head /Volumes/web/cnidarian/OlyOv3_400bp.fasta
>4485895 length 400 cvg_67.8_tip_0
ACCCAGAAAGGTTTAAAGAATGTATTTGATGAAGCCATTCTGGCTGCTCTGGAACCTCCTGAACCACCCAAAAAGAAGAAGTGTGTGTTGTTGTAATCTT
TGAACTCTCGTCAGTTTCATGTGTAATCATAGAATGATTTCAACTTGTCATCTGTGGGAAAATCTTGTGCAAAATTAAAAATAAAAACCACTTTTATACA
TGTCTGGATAAGTATTTTCACAGATGGAAGAGTGCGGGTTGAAATAGAGATTATTCCAACTTTCTGAAGAAAAGGAATATTTGAAGTTCCTGAGACGGAA
AAGGCAGGTGTTATTTTCAAGCGAACCACTAGCACAGTGCTGTGGTTTTATTATCCCATATGGGTCCAATGAACATATGATTTGTAAATATATATATAAT

>4485897 length 400 cvg_5.5_tip_1
ACACTGCACATCGCGGTCCTAGATTTTAACGACAACACGCCGTATTTCCTTAACAGTACATATAAATTTAGTGAAAATGAATCCACTTACAACAGAACAA
GAATTGGAGCATTGTATGCCCATGATCTGGACTCGGGACAAAACGCCAATATAACGTTTTCTATCTCTGGAGGAAACAGTCAAGGACATTTTCAAATAGA
TCCATACACGGGTGATTTATTCATAAATGGCCTTATCGACAGAGAAAATGTATCCTCATACAACCTGAGAGTTACTATAAGAGATAATCCGTCTAATCCG
In [14]:
!grep -c "OlyO_Pat_" /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs 
1730

Will try perl script I modified to take this "reverse" blast and make a gff file.

Another option in SQLShare

In [27]:
!2_Blast2gff.pl -i /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs -o /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff -d "OlyOv3_400" -p EXON -s "something"
In [30]:
mv /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff /Volumes/web/cnidarian/
mv: /Volumes/web/cnidarian/OlyO10kcontigs_exon_a.gff: set owner/group (was: 501/20): Operation not permitted

perl script would not write to eagle so had to move

In [31]:
!head /Volumes/web/cnidarian/OlyO10kcontigs_exon_a.gff
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	21178	21048	1e-39	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	13091	12964	1e-38	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	20765	20638	1e-38	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	13504	13374	5e-37	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	11561	11683	4e-25	+	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	19235	19357	4e-25	+	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	5867	5728	7e-23	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_420	blastn:OlyOv3_400	blastn	9827	10237	4e-63	+	.	4486255	
OlyO_Pat_PacBio_10k_contig_420	blastn:OlyOv3_400	blastn	10237	9827	4e-63	-	.	4486256	
OlyO_Pat_PacBio_10k_contig_420	blastn:OlyOv3_400	blastn	24459	24184	2e-49	-	.	4486297	

Screenshot%207/10/13%208:18%20AM

In [2]:
!wget https://usegalaxy.org/dataset/display?dataset_id=bbd44e69cb8906b586614e867a9985b1&to_ext=bed
In [3]:
pwd
Out[3]:
u'/Users/Steven/Dropbox/Steven/ipython_nb'
In [4]:
ls
BiGill_Gene_Methylation.ipynb*         README.md
BiGill_RNAseq.ipynb                    README_old.md
BiGill_Tran_Elements.ipynb*            Roberts_cv.ipynb
BiGill_array.ipynb                     Ruphi_OA_RNAseq.ipynb*
BiGo - methratio error.ipynb*          TJGR_CpG_binding.ipynb
BiGo_GFF_dev.ipynb*                    TJGR_CpG_islands.ipynb
BiGo_RNAseq.ipynb*                     TJGR_Methylation_GenomeSnapshot.ipynb
BiGo_larvae.ipynb                      TJGR_Mgo_Expression.ipynb*
BiGo_larvae_2.ipynb                    TJGR_OysterGenome_IGV.ipynb*
BiGo_larvae_methylkit.ipynb            TJGR_ProteinAnnot.ipynb
BiGo_methratio.ipynb                   TJGR_pearl.ipynb*
BiGo_methratio_mito.ipynb              UW_SoftwareBootcamp.ipynb
BlackAb_Annot.ipynb*                   _BiGO_expPLOT.ipynb
CC_ampk.ipynb                          _BiGo_RNAseq.pdf
EY_methratio.ipynb                     _BiGo_larvae_bsmap274.ipynb
Gene Specific Methylation.ipynb*       _MGarray_blast.ipynb
LT_C1q.ipynb                           _scratch.ipynb
OA_MS_data_check.ipynb                 fish546/
OlyO_Chi_Exp.ipynb*                    gen_workflows.ipynb
OlyO_GonadExp.ipynb                    img/
OlyO_PacBio.ipynb*                     intersectbed.ipynb
OlyO_transcriptome.ipynb               snippets.md
OsHV_host.ipynb*                       tools/
PAG_2014.ipynb

Version 2 of genome

In [12]:
!/Users/sr320/Dropbox/Steven/gt/fish546/2_Blast2Gff.pl \
-i /Volumes/web/dermochelys/Bioinformatics/IGV\ stuff/otbnos.tab \
-o /Volumes/web/cnidarian/Olytranv2_OlyPacBio_v02.gff \
-d "OlyOvtran_v2" \
-p EXON \
-s "something"
In [5]:
!head /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs
4486232	OlyO_Pat_PacBio_10k_contig_319	89.39	132	4	5	142	264	21178	21048	1e-39	 161
4486232	OlyO_Pat_PacBio_10k_contig_319	90.00	130	4	6	268	390	13091	12964	1e-38	 158
4486232	OlyO_Pat_PacBio_10k_contig_319	90.00	130	4	6	268	390	20765	20638	1e-38	 158
4486232	OlyO_Pat_PacBio_10k_contig_319	87.88	132	6	5	142	264	13504	13374	5e-37	 152
4486232	OlyO_Pat_PacBio_10k_contig_319	84.50	129	9	9	144	267	11561	11683	4e-25	 113
4486232	OlyO_Pat_PacBio_10k_contig_319	84.50	129	9	9	144	267	19235	19357	4e-25	 113
4486232	OlyO_Pat_PacBio_10k_contig_319	82.14	140	9	11	142	265	5867	5728	7e-23	 105
4486255	OlyO_Pat_PacBio_10k_contig_420	77.51	418	31	33	36	397	9827	10237	4e-63	 239
4486256	OlyO_Pat_PacBio_10k_contig_420	77.51	418	31	33	5	366	10237	9827	4e-63	 239
4486297	OlyO_Pat_PacBio_10k_contig_420	78.89	289	34	17	2	276	24459	24184	2e-49	 194
In [4]:
!head /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	21178	21048	1e-39	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	13091	12964	1e-38	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	20765	20638	1e-38	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	13504	13374	5e-37	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	11561	11683	4e-25	+	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	19235	19357	4e-25	+	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	5867	5728	7e-23	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_420	blastn:OlyOv3_400	blastn	9827	10237	4e-63	+	.	4486255	
OlyO_Pat_PacBio_10k_contig_420	blastn:OlyOv3_400	blastn	10237	9827	4e-63	-	.	4486256	
OlyO_Pat_PacBio_10k_contig_420	blastn:OlyOv3_400	blastn	24459	24184	2e-49	-	.	4486297	
In [11]:
!head /Volumes/web/dermochelys/Bioinformatics/IGV\ stuff/otbnos.tab
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_549	87.16	7716	122	762	6	7148	9554	2135	0.0	7956
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_550	86.10	6915	116	736	873	7148	7401	693	0.0	6665
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_550	87.76	1699	18	167	874	2434	7475	9121	0.0	1810
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_550	85.99	828	18	90	2747	3508	11421	10626	0.0	 797
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_550	80.39	1188	15	153	2441	3428	9248	10417	0.0	 702
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_550	91.46	82	1	5	2559	2637	5083	5005	1e-21	 108
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_1964	81.58	7947	208	1059	6	7148	9076	1582	0.0	5406
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_5046	87.24	4946	52	527	873	5348	5837	10673	0.0	5103
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_252	82.64	6913	77	873	1241	7150	6128	12920	0.0	5081
Olurida_trim_nodups_v2reads_contig_9	OlyO_Pat_PacBio_1_contig_2528	84.63	5140	96	596	6	4590	4524	9524	0.0	4475
In [ ]:
# could just awk etc "!awk '{ print $10 }'
In [13]:
!head /Volumes/web/cnidarian/Olytranv2_OlyPacBio_v02.gff
OlyO_Pat_PacBio_1_contig_549	blastn:OlyOvtran_v2	blastn	9554	2135	0.0	-	.	Olurida_trim_nodups_v2reads_contig_9	
OlyO_Pat_PacBio_1_contig_550	blastn:OlyOvtran_v2	blastn	7401	693	0.0	-	.	Olurida_trim_nodups_v2reads_contig_9	
OlyO_Pat_PacBio_1_contig_550	blastn:OlyOvtran_v2	blastn	7475	9121	0.0	+	.	Olurida_trim_nodups_v2reads_contig_9	
OlyO_Pat_PacBio_1_contig_550	blastn:OlyOvtran_v2	blastn	11421	10626	0.0	-	.	Olurida_trim_nodups_v2reads_contig_9	
OlyO_Pat_PacBio_1_contig_550	blastn:OlyOvtran_v2	blastn	9248	10417	0.0	+	.	Olurida_trim_nodups_v2reads_contig_9	
OlyO_Pat_PacBio_1_contig_550	blastn:OlyOvtran_v2	blastn	5083	5005	1e-21	-	.	Olurida_trim_nodups_v2reads_contig_9	
OlyO_Pat_PacBio_1_contig_1964	blastn:OlyOvtran_v2	blastn	9076	1582	0.0	-	.	Olurida_trim_nodups_v2reads_contig_9	
OlyO_Pat_PacBio_1_contig_5046	blastn:OlyOvtran_v2	blastn	5837	10673	0.0	+	.	Olurida_trim_nodups_v2reads_contig_9	
OlyO_Pat_PacBio_1_contig_252	blastn:OlyOvtran_v2	blastn	6128	12920	0.0	+	.	Olurida_trim_nodups_v2reads_contig_9	
OlyO_Pat_PacBio_1_contig_2528	blastn:OlyOvtran_v2	blastn	4524	9524	0.0	+	.	Olurida_trim_nodups_v2reads_contig_9	
In [ ]: