#!/usr/bin/env python # coding: utf-8 # scikit-bio 0.2.1 introduced the [I/O registry](http://scikit-bio.org/docs/0.5.0/io.html), which provides unified input and output (I/O). Previously, there were many ways to read and write various file formats into and out of scikit-bio objects. The scikit-bio I/O registry provides a single entry point to all I/O using a simple procedural interface. Additionally, the registry dynamically generates an equivalent object-oriented API for any scikit-bio object that can be serialized or deserialized (i.e., written to or read from a file). Finally, the registry supports automatic file format detection via file sniffers. See the [``skbio.io`` docs](http://scikit-bio.org/docs/0.5.0/io.html) for more details on the I/O registry and the benefits it provides. # # We will explore some of the features provided by the I/O registry in the following examples. # ## Procedural and object-oriented API # scikit-bio provides a simple procedural and object-oriented (OO) API for performing file I/O. Both APIs produce the same results and are provided for convenience to the user. Every scikit-bio object that can be read into has a `read` method, and every object that can be written from has a `write` method. These methods and their documentation are *automatically generated* by the I/O registry at runtime! # # For example, let's read a [Newick](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.newick.html) file (taken from [Greengenes](http://www.ncbi.nlm.nih.gov/pubmed/22134646) version `13_8`, 61% OTUs) using the procedural API ([skbio.io.read](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.registry.read.html#skbio.io.registry.read)): # In[1]: import skbio.io from skbio import TreeNode tree_1 = skbio.io.read('data/61_otus.tree', format='newick', into=TreeNode) # Let's now read the same file using the equivalent OO API ([TreeNode.read](http://scikit-bio.org/docs/0.5.0/generated/skbio.tree.TreeNode.read.html#skbio.tree.TreeNode.read)): # In[2]: tree_2 = TreeNode.read('data/61_otus.tree', format='newick') # We obtain equivalent in-memory `TreeNode` objects: # In[3]: tree_1 # In[4]: tree_2 # **Note:** ``TreeNode`` currently does not have ``__eq__``/``equals`` methods, which is why we aren't performing a direct equality comparison here. See [#465](https://github.com/biocore/scikit-bio/issues/465) to track progress on this. # ## Sniffers # What if you don't know the format of a file? The I/O registry provides *file sniffers*, which attempt to determine a file's format by inspecting the beginning of a file. # # For example, suppose we have a file whose format is unknown. To sniff its format: # In[5]: from skbio.io import sniff sniff('data/unknown_file_1') # The sniffer returns a tuple containing the format name and parameters (kwargs), respectively. Format parameters may be returned if the sniffer thinks they are necessary to correctly read the file in the sniffed format. # # In the above example, the sniffer guessed the file format to be [ordination](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.ordination.html). It did not guess any format parameters, which is correct because there are none for the format. # Not every file can be identified by the sniffer. For example, let's try to sniff another unknown file: # In[6]: from skbio.io import UnrecognizedFormatError try: sniff('data/unknown_file_2') except UnrecognizedFormatError as e: print(e) # The sniffer raised an `UnrecognizedFormatError` because it could not identify the file's format (which was intentionally meaningless). A special case of a file that is especially meaningless is the empty file. Sniffing an empty file returns the following: # In[7]: sniff('data/unknown_file_3') # This information can be especially helpful in error messages, when users may have inadvertently created an empty file. # # As previously mentioned, sniffers may return format parameters as kwargs. For example, let's try sniffing another unknown file: # In[8]: sniff('data/unknown_file_4') # The sniffer identified the file as being in [lsmat](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.lsmat.html) format. It also recommends using a comma as the delimiter when reading the file. This differs from the default delimiter, which is the tab character. # # Let's try reading this lsmat-formatted file into a `DistanceMatrix` object: # In[9]: from skbio import DistanceMatrix dm = DistanceMatrix.read('data/unknown_file_4') print(dm) # Note that we didn't have to specify `delimiter=','` in the `DistanceMatrix.read` call because the sniffer is invoked automatically and its recommendations are used by the reader (this can be disabled by explicitly providing the format name and passing `verify=False`). # # We can override the sniffer's recommendations by passing our own delimiter: # In[10]: from skbio import DistanceMatrix from skbio.io import LSMatFormatError try: DistanceMatrix.read('data/unknown_file_4', delimiter='\t') except LSMatFormatError as e: print(e) # Note that we receive a warning stating that the sniffer's recommendations are being ignored in favor of our input. In this case, the lsmat file cannot be successfully read using a tab as the delimiter. # ## Streaming files to a different format # The previous examples showed how to read a file into an object that is represented entirely in memory. In some cases, this is infeasible because you may not have enough memory to store all of the file's contents at once (e.g., reading a 10GB FASTA file on a computer with only 2GB of RAM). # # To address this issue, scikit-bio's I/O registry provides a *streaming* interface via its procedural [skbio.io.read](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.registry.read.html) and [skbio.io.write](http://scikit-bio.org/docs/0.5.0/skbio.io.registry.write.html) functions. When using a streaming reader, pieces of the parsed file are yielded on demand using a Python [generator](https://wiki.python.org/moin/Generators). For example, streaming a FASTA file would yield a single `skbio.Sequence` (or subclass) object at a time instead of loading all sequences at once. # # To illustrate how this works, let's convert an Illumina 1.3 FASTQ file (old Illumina variant) into an Illumina 1.8 FASTQ file (current Illumina variant). This FASTQ file is a small subset (100 sequences) taken from the [Moving Pictures of the Human Microbiome](http://www.ncbi.nlm.nih.gov/pubmed/21624126) study. For more details on FASTQ variants (i.e., different quality score encoding schemes), see [skbio.io.fastq](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.fastq.html). # # First let's create a streaming reader to yield each sequence in our input FASTQ file: # In[11]: seq_gen = skbio.io.read('data/moving_pictures_subset_100.fastq', format='fastq', variant='illumina1.3') seq_gen # Note that nothing has been loaded into memory from disk at this point because `seq_gen` is a generator. We can pass this generator to `skbio.io.write` to have it write a single sequence at a time in the Illumina 1.8 FASTQ variant: # In[12]: skbio.io.write(seq_gen, into='data/moving_pictures_subset_100_illumina1.8.fastq', format='fastq', variant='illumina1.8') # Note that since both streaming readers *and* writers were used, only a single biological sequence was ever loaded into memory at any given point in time. # # Examining the original and converted FASTQ files shows that the quality scores are encoded differently: # In[13]: get_ipython().system('head -4 data/moving_pictures_subset_100.fastq') # In[14]: get_ipython().system('head -4 data/moving_pictures_subset_100_illumina1.8.fastq') # ## Streaming [QSeq](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.qseq.html) to [FASTQ](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.fastq.html) variants # Building on the previous example, let's see a more complex example where we convert a [QSeq](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.qseq.html) file into every [FASTQ](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.fastq.html) variant supported by scikit-bio: Sanger, Illumina 1.3, and Illumina 1.8. The QSeq file contains a subset of the sequences in the [Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms](http://www.nature.com/ismej/journal/v6/n8/full/ismej20128a.html) study. Its quality scores are encoded using the Illumina 1.3 variant. # # **Note:** scikit-bio does not currently support the Solexa FASTQ variant. See [#719](https://github.com/biocore/scikit-bio/issues/719) to track progress on this. # # For each supported variant, we create a streaming reader on the source QSeq file and feed the reader to a streaming writer, supplying the desired variant to convert to: # In[15]: for variant in 'sanger', 'illumina1.3', 'illumina1.8': qseq_gen = skbio.io.read('data/s_1_1_0001_qseq.txt', format='qseq', filter=True, variant='illumina1.3') skbio.io.write(qseq_gen, into='data/s_1_1_0001_%s.fastq' % variant, format='fastq', variant=variant) # Note that we passed `filter=True` when reading the source QSeq file, which filters out sequences that didn't pass CASAVA's filter. This is the QSeq reader's default behavior, but we provide it for explicitness. Thus, the converted FASTQ files will have fewer sequences than the source QSeq file (pass `filter=False` to retain and convert *all* sequences). # # Examining the converted FASTQ files shows that the quality scores are encoded differently, but the IDs and sequence data are identical: # In[16]: get_ipython().system('head -4 data/s_1_1_0001_illumina1.3.fastq') # In[17]: get_ipython().system('head -4 data/s_1_1_0001_illumina1.8.fastq') # In[18]: get_ipython().system('head -4 data/s_1_1_0001_sanger.fastq') # Note that the encoded quality scores are the same for Illumina 1.8 and Sanger because both encoding schemes use the same Phred offset (but they have different [valid ranges](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.fastq.html#quality-score-variants)). # # Next, let's read the original QSeq file and each of the converted FASTQ files into `list` objects: # In[19]: sc_original = list(skbio.io.read('data/s_1_1_0001_qseq.txt', format='qseq', variant='illumina1.3')) sc_illumina13 = list(skbio.io.read('data/s_1_1_0001_illumina1.3.fastq', format='fastq', variant='illumina1.3')) sc_illumina18 = list(skbio.io.read('data/s_1_1_0001_illumina1.8.fastq', format='fastq', variant='illumina1.8')) sc_sanger = list(skbio.io.read('data/s_1_1_0001_sanger.fastq', format='fastq', variant='sanger')) # To verify that the quality scores of each are equal, we define a testing function: # In[20]: import itertools def quality_scores_all_equal(*lists): for l1, l2 in itertools.combinations(lists, 2): for s1, s2 in zip(l1, l2): if (s1.positional_metadata['quality'] != s2.positional_metadata['quality']).any(): return False return True # We see that all lists of sequences are equal, even though their quality scores have been stored on disk using different encoding schemes. The in-memory representations of the Phred quality scores are identical! # In[21]: quality_scores_all_equal(sc_original, sc_illumina13, sc_illumina18, sc_sanger) # ## Writing to [FASTA/QUAL](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.fasta.html) # As a final example, let's write one of our existing `SequenceCollection`s in [FASTA/QUAL](http://scikit-bio.org/docs/0.5.0/generated/skbio.io.format.fasta.html) format. This example is different from what we've seen so far because the file format involves more than a single file (sequence data is stored in a FASTA file, and quality scores are stored in a QUAL file). # # We simply provide two filepaths when calling `write`: # # # *Note: Currently skbio.io only supports generators, not iterables, see [#1031](https://github.com/biocore/scikit-bio/issues/1031) to track progress. # In[22]: generator = (x for x in sc_illumina13) skbio.io.write(generator, into='data/s_1_1_0001.fasta', qual='data/s_1_1_0001.qual', format='fasta') # If we view the resulting FASTA and QUAL files, we see that the IDs, sequence data, and quality scores (now encoded as integer ASCII characters) match our previous results: # In[23]: get_ipython().system('head -4 data/s_1_1_0001.fasta') # In[24]: get_ipython().system('head -4 data/s_1_1_0001.qual')