scikit-bio 0.2.1 introduced the I/O registry, which provides unified input and output (I/O). Previously, there were many ways to read and write various file formats into and out of scikit-bio objects. The scikit-bio I/O registry provides a single entry point to all I/O using a simple procedural interface. Additionally, the registry dynamically generates an equivalent object-oriented API for any scikit-bio object that can be serialized or deserialized (i.e., written to or read from a file). Finally, the registry supports automatic file format detection via file sniffers. See the skbio.io docs for more details on the I/O registry and the benefits it provides.

We will explore some of the features provided by the I/O registry in the following examples.

Procedural and object-oriented API

scikit-bio provides a simple procedural and object-oriented (OO) API for performing file I/O. Both APIs produce the same results and are provided for convenience to the user. Every scikit-bio object that can be read into has a read method, and every object that can be written from has a write method. These methods and their documentation are automatically generated by the I/O registry at runtime!

For example, let's read a Newick file (taken from Greengenes version 13_8, 61% OTUs) using the procedural API (skbio.io.read):

In [1]:
import skbio.io
from skbio import TreeNode
tree_1 = skbio.io.read('data/61_otus.tree', format='newick', into=TreeNode)

Let's now read the same file using the equivalent OO API (TreeNode.read):

In [2]:
tree_2 = TreeNode.read('data/61_otus.tree', format='newick')

We obtain equivalent in-memory TreeNode objects:

In [3]:
tree_1
Out[3]:
<TreeNode, name: unnamed, internal node count: 20, tips count: 22>
In [4]:
tree_2
Out[4]:
<TreeNode, name: unnamed, internal node count: 20, tips count: 22>

Note: TreeNode currently does not have __eq__/equals methods, which is why we aren't performing a direct equality comparison here. See #465 to track progress on this.

Sniffers

What if you don't know the format of a file? The I/O registry provides file sniffers, which attempt to determine a file's format by inspecting the beginning of a file.

For example, suppose we have a file whose format is unknown. To sniff its format:

In [5]:
from skbio.io import sniff
sniff('data/unknown_file_1')
Out[5]:
('ordination', {})

The sniffer returns a tuple containing the format name and parameters (kwargs), respectively. Format parameters may be returned if the sniffer thinks they are necessary to correctly read the file in the sniffed format.

In the above example, the sniffer guessed the file format to be ordination. It did not guess any format parameters, which is correct because there are none for the format.

Not every file can be identified by the sniffer. For example, let's try to sniff another unknown file:

In [6]:
from skbio.io import UnrecognizedFormatError
try:
    sniff('data/unknown_file_2')
except UnrecognizedFormatError as e:
    print(e)
Could not detect the format of 'data/unknown_file_2'

The sniffer raised an UnrecognizedFormatError because it could not identify the file's format (which was intentionally meaningless). A special case of a file that is especially meaningless is the empty file. Sniffing an empty file returns the following:

In [7]:
sniff('data/unknown_file_3')
Out[7]:
('<emptyfile>', {})

This information can be especially helpful in error messages, when users may have inadvertently created an empty file.

As previously mentioned, sniffers may return format parameters as kwargs. For example, let's try sniffing another unknown file:

In [8]:
sniff('data/unknown_file_4')
Out[8]:
('lsmat', {'delimiter': ','})

The sniffer identified the file as being in lsmat format. It also recommends using a comma as the delimiter when reading the file. This differs from the default delimiter, which is the tab character.

Let's try reading this lsmat-formatted file into a DistanceMatrix object:

In [9]:
from skbio import DistanceMatrix
dm = DistanceMatrix.read('data/unknown_file_4')
print(dm)
3x3 distance matrix
IDs:
'a', 'b', 'c'
Data:
[[  0.00000000e+00   1.00000000e-02   4.20000000e+00]
 [  1.00000000e-02   0.00000000e+00   1.20000000e+01]
 [  4.20000000e+00   1.20000000e+01   0.00000000e+00]]

Note that we didn't have to specify delimiter=',' in the DistanceMatrix.read call because the sniffer is invoked automatically and its recommendations are used by the reader (this can be disabled by explicitly providing the format name and passing verify=False).

We can override the sniffer's recommendations by passing our own delimiter:

In [10]:
from skbio import DistanceMatrix
from skbio.io import LSMatFormatError

try:
    DistanceMatrix.read('data/unknown_file_4', delimiter='\t')
except LSMatFormatError as e:
    print(e)
Header must start with delimiter '\t'.
/home/evan/biocore/scikit-bio/skbio/io/registry.py:557: ArgumentOverrideWarning: Best guess was: delimiter=',', continuing with user supplied: '\t'
  ArgumentOverrideWarning)

Note that we receive a warning stating that the sniffer's recommendations are being ignored in favor of our input. In this case, the lsmat file cannot be successfully read using a tab as the delimiter.

Streaming files to a different format

The previous examples showed how to read a file into an object that is represented entirely in memory. In some cases, this is infeasible because you may not have enough memory to store all of the file's contents at once (e.g., reading a 10GB FASTA file on a computer with only 2GB of RAM).

To address this issue, scikit-bio's I/O registry provides a streaming interface via its procedural skbio.io.read and skbio.io.write functions. When using a streaming reader, pieces of the parsed file are yielded on demand using a Python generator. For example, streaming a FASTA file would yield a single skbio.Sequence (or subclass) object at a time instead of loading all sequences at once.

To illustrate how this works, let's convert an Illumina 1.3 FASTQ file (old Illumina variant) into an Illumina 1.8 FASTQ file (current Illumina variant). This FASTQ file is a small subset (100 sequences) taken from the Moving Pictures of the Human Microbiome study. For more details on FASTQ variants (i.e., different quality score encoding schemes), see skbio.io.fastq.

First let's create a streaming reader to yield each sequence in our input FASTQ file:

In [11]:
seq_gen = skbio.io.read('data/moving_pictures_subset_100.fastq',
                        format='fastq',
                        variant='illumina1.3')
seq_gen
Out[11]:
<generator object <genexpr> at 0x7f9e0f3817e0>

Note that nothing has been loaded into memory from disk at this point because seq_gen is a generator. We can pass this generator to skbio.io.write to have it write a single sequence at a time in the Illumina 1.8 FASTQ variant:

In [12]:
skbio.io.write(seq_gen,
               into='data/moving_pictures_subset_100_illumina1.8.fastq',
               format='fastq', 
               variant='illumina1.8')
Out[12]:
'data/moving_pictures_subset_100_illumina1.8.fastq'

Note that since both streaming readers and writers were used, only a single biological sequence was ever loaded into memory at any given point in time.

Examining the original and converted FASTQ files shows that the quality scores are encoded differently:

In [13]:
!head -4 data/moving_pictures_subset_100.fastq
@HWI-EAS440_0386:1:23:17547:1423#0/1
TACGNAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGGGGGGATTGGTGTG
+HWI-EAS440_0386:1:23:17547:1423#0/1
hhhdHddddddddfehhfhhhghggfhhhfhhgggfhhgfgdfcfhehfdgfhggfggfggffgddfgdffdgdaagaaddcbdccc]a^ad__a]_____ba_`a`__^__\]^OWZR\Z\\WYTZ_U^BBBBBBBBBBBBBBBBBBBBBB
In [14]:
!head -4 data/moving_pictures_subset_100_illumina1.8.fastq
@HWI-EAS440_0386:1:23:17547:1423#0/1
TACGNAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGGGGGGATTGGTGTG
+
IIIE)EEEEEEEEGFIIGIIIHIHHGIIIGIIHHHGIIHGHEGDGIFIGEHGIHHGHHGHHGGHEEGHEGGEHEBBHBBEEDCEDDD>[email protected]@B>@@@@@[email protected]@@[email protected]@=>?08;3=;==8:5;@6?######################

Streaming QSeq to FASTQ variants

Building on the previous example, let's see a more complex example where we convert a QSeq file into every FASTQ variant supported by scikit-bio: Sanger, Illumina 1.3, and Illumina 1.8. The QSeq file contains a subset of the sequences in the Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms study. Its quality scores are encoded using the Illumina 1.3 variant.

Note: scikit-bio does not currently support the Solexa FASTQ variant. See #719 to track progress on this.

For each supported variant, we create a streaming reader on the source QSeq file and feed the reader to a streaming writer, supplying the desired variant to convert to:

In [15]:
for variant in 'sanger', 'illumina1.3', 'illumina1.8':
    qseq_gen = skbio.io.read('data/s_1_1_0001_qseq.txt', 
                             format='qseq',
                             filter=True,
                             variant='illumina1.3')
    skbio.io.write(qseq_gen, 
                   into='data/s_1_1_0001_%s.fastq' % variant, 
                   format='fastq',
                   variant=variant)

Note that we passed filter=True when reading the source QSeq file, which filters out sequences that didn't pass CASAVA's filter. This is the QSeq reader's default behavior, but we provide it for explicitness. Thus, the converted FASTQ files will have fewer sequences than the source QSeq file (pass filter=False to retain and convert all sequences).

Examining the converted FASTQ files shows that the quality scores are encoded differently, but the IDs and sequence data are identical:

In [16]:
!head -4 data/s_1_1_0001_illumina1.3.fastq
@M10_68:1:1:19607:29475#0/1
GACATAAGGGTGGTTAGTATACCGGCAAGGACGGGGTTACTAGTGACGTCCTTCCCCGTATGCCGGGCAATAATGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCTGCGGATTGGTTTCGCTGAATCAGATTATT
+
Z__c\JQ`cc[[_[bfff[[`Qbdge_YYOOHO^cF[FUb_VHMHV`T`dBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
In [17]:
!head -4 data/s_1_1_0001_illumina1.8.fastq
@M10_68:1:1:19607:29475#0/1
GACATAAGGGTGGTTAGTATACCGGCAAGGACGGGGTTACTAGTGACGTCCTTCCCCGTATGCCGGGCAATAATGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCTGCGGATTGGTTTCGCTGAATCAGATTATT
+
;@@D=+2ADD<<@<CGGG<<[email protected]::00)0?D'<'[email protected]).)7A5AE######################################################################################################
In [18]:
!head -4 data/s_1_1_0001_sanger.fastq
@M10_68:1:1:19607:29475#0/1
GACATAAGGGTGGTTAGTATACCGGCAAGGACGGGGTTACTAGTGACGTCCTTCCCCGTATGCCGGGCAATAATGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCTGCGGATTGGTTTCGCTGAATCAGATTATT
+
;@@D=+2ADD<<@<CGGG<<[email protected]::00)0?D'<'[email protected]).)7A5AE######################################################################################################

Note that the encoded quality scores are the same for Illumina 1.8 and Sanger because both encoding schemes use the same Phred offset (but they have different valid ranges).

Next, let's read the original QSeq file and each of the converted FASTQ files into list objects:

In [19]:
sc_original = list(skbio.io.read('data/s_1_1_0001_qseq.txt', format='qseq', variant='illumina1.3'))
sc_illumina13 = list(skbio.io.read('data/s_1_1_0001_illumina1.3.fastq', format='fastq', variant='illumina1.3'))
sc_illumina18 = list(skbio.io.read('data/s_1_1_0001_illumina1.8.fastq', format='fastq', variant='illumina1.8'))
sc_sanger = list(skbio.io.read('data/s_1_1_0001_sanger.fastq', format='fastq', variant='sanger'))

To verify that the quality scores of each are equal, we define a testing function:

In [20]:
import itertools

def quality_scores_all_equal(*lists):
    for l1, l2 in itertools.combinations(lists, 2):
        for s1, s2 in zip(l1, l2):
            if (s1.positional_metadata['quality'] != s2.positional_metadata['quality']).any():
                return False
    return True

We see that all lists of sequences are equal, even though their quality scores have been stored on disk using different encoding schemes. The in-memory representations of the Phred quality scores are identical!

In [21]:
quality_scores_all_equal(sc_original, sc_illumina13, sc_illumina18, sc_sanger)
Out[21]:
True

Writing to FASTA/QUAL

As a final example, let's write one of our existing SequenceCollections in FASTA/QUAL format. This example is different from what we've seen so far because the file format involves more than a single file (sequence data is stored in a FASTA file, and quality scores are stored in a QUAL file).

We simply provide two filepaths when calling write:

*Note: Currently skbio.io only supports generators, not iterables, see #1031 to track progress.

In [22]:
generator = (x for x in sc_illumina13)
skbio.io.write(generator, into='data/s_1_1_0001.fasta', qual='data/s_1_1_0001.qual', format='fasta')
Out[22]:
'data/s_1_1_0001.fasta'

If we view the resulting FASTA and QUAL files, we see that the IDs, sequence data, and quality scores (now encoded as integer ASCII characters) match our previous results:

In [23]:
!head -4 data/s_1_1_0001.fasta
>M10_68:1:1:19607:29475#0/1
GACATAAGGGTGGTTAGTATACCGGCAAGGACGGGGTTACTAGTGACGTCCTTCCCCGTATGCCGGGCAATAATGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCTGCGGATTGGTTTCGCTGAATCAGATTATT
>M10_68:1:1:9095:29475#0/1
GAATATCCTTAAGATGGCGTTCAGCAGCCCGCTTGCGGCCAAACTGCTTAACCGTCTTCTCGTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCGGTGGTCTATAGTGTTATTAATATCAAGTTGGGGGAGCACATTGTAGCATTGTGCC
In [24]:
!head -4 data/s_1_1_0001.qual
>M10_68:1:1:19607:29475#0/1
26 31 31 35 28 10 17 32 35 35 27 27 31 27 34 38 38 38 27 27 32 17 34 36 39 37 31 25 25 15 15 8 15 30 35 6 27 6 21 34 31 22 8 13 8 22 32 20 32 36 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
>M10_68:1:1:9095:29475#0/1
16 16 25 32 33 35 35 30 27 35 37 37 35 34 31 10 34 36 30 38 38 40 40 40 37 36 36 37 38 35 38 40 25 37 37 40 37 15 31 37 32 7 23 37 36 35 36 37 40 37 27 36 36 36 35 35 35 35 34 34 34 30 31 33 33 33 33 31 26 20 20 25 31 33 33 33 33 33 35 36 33 33 31 33 33 33 33 33 33 33 33 30 12 24 24 29 33 33 27 29 33 33 33 33 18 29 33 35 35 35 31 33 33 35 33 32 25 29 31 33 35 32 25 29 31 31 29 31 24 5 8 15 15 20 27 30 31 25 31 29 32 27 32 32 33 35 33 33 31 33 32 33