Using the LCA_Database API

LCA_Database objects combine a fast in-memory storage of signatures indexed by their hash values, with taxonomic lineage storage. They are limited to storing scaled DNA signatures with a single ksize; the scaled and ksize values are specified at creation.

Running this notebook.

You can run this notebook interactively via mybinder; click on this button: Binder

A rendered version of this notebook is available at sourmash.readthedocs.io under "Tutorials and notebooks".

You can also get this notebook from the doc/ subdirectory of the sourmash github repository. See binder/environment.yaml for installation dependencies.

What is this?

This is a Jupyter Notebook using Python 3. If you are running this via binder, you can use Shift-ENTER to run cells, and double click on code cells to edit them.

Contact: C. Titus Brown, [email protected] Please file issues on GitHub if you have any questions or comments!

Creating an LCA_Database object

Create an LCA_Database like so:

In [1]:
import sourmash
db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)

Create signatures for some genomes, load them, and add them:

In [2]:
!sourmash sketch dna -p k=31,scaled=1000 genomes/*
== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

computing signatures for files: genomes/akkermansia.fa, genomes/shew_os185.fa, genomes/shew_os223.fa
Computing a total of 1 signature(s).
skipping genomes/akkermansia.fa - already done
skipping genomes/shew_os185.fa - already done
skipping genomes/shew_os223.fa - already done
In [3]:
sig1 = sourmash.load_one_signature('akkermansia.fa.sig', ksize=31)
sig2 = sourmash.load_one_signature('shew_os185.fa.sig', ksize=31)
sig3 = sourmash.load_one_signature('shew_os223.fa.sig', ksize=31)
In [4]:
db.insert(sig1, ident='akkermansia')
db.insert(sig2, ident='shew_os185')
db.insert(sig3, ident='shew_os223')
Out[4]:
490

Run search and gather via the Index API

In [5]:
from pprint import pprint
pprint(db.search(sig1, threshold=0.1))
[(1.0,
  SourmashSignature('CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome', 6822e0b7),
  None)]
In [6]:
pprint(db.search(sig2, threshold=0.1))
[(1.0,
  SourmashSignature('NC_009665.1 Shewanella baltica OS185, complete genome', b47b13ef),
  None),
 (0.22846441947565543,
  SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6),
  None)]
In [7]:
pprint(db.gather(sig3))
[(1.0,
  SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6),
  None)]

Retrieve all signatures with signatures()

In [8]:
for i in db.signatures():
    print(i)
CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome
NC_009665.1 Shewanella baltica OS185, complete genome
NC_011663.1 Shewanella baltica OS223, complete genome

Access identifiers and names

The list of (unique) identifiers in the database can be accessed via the attribute ident_to_idx, which maps to integer identifiers; identifiers can also retrieve full names, which are taken from sig.name() upon insertion.

In [9]:
pprint(db.ident_to_idx.keys())
dict_keys(['akkermansia', 'shew_os185', 'shew_os223'])
In [10]:
pprint(db.ident_to_name)
{'akkermansia': 'CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete '
                'genome',
 'shew_os185': 'NC_009665.1 Shewanella baltica OS185, complete genome',
 'shew_os223': 'NC_011663.1 Shewanella baltica OS223, complete genome'}

Access hash values directly

The attribute hashval_to_idx contains a mapping from individual hash values to sets of idx indices.

See the method _find_signatures() for an example of how this is used in search and gather.

In [11]:
print('{} hash values total in this database'.format(len(db.hashval_to_idx)))
1300 hash values total in this database
In [12]:
all_idx = set()
for idx_set in db.hashval_to_idx.values():
    all_idx.update(idx_set)
print('belonging to signatures with idx {}'.format(all_idx))
belonging to signatures with idx {0, 1, 2}
In [13]:
first_three_hashvals = list(db.hashval_to_idx)[:3]
In [14]:
for hashval in first_three_hashvals:
    print('hashval {} belongs to idxs {}'.format(hashval, db.hashval_to_idx[hashval]))
hashval 17302105753387 belongs to idxs {0}
hashval 95741036335406 belongs to idxs {0}
hashval 165640715598232 belongs to idxs {0}
In [15]:
query_idx = 2
hashval_set = set()
for hashval, idx_set in db.hashval_to_idx.items():
    if query_idx in idx_set:
        hashval_set.add(hashval)
        
print('{} hashvals belong to query idx {}'.format(len(hashval_set), query_idx))

ident = db.idx_to_ident[query_idx]
print('query idx {} matches to ident {}'.format(query_idx, ident))

name = db.ident_to_name[ident]
print('query idx {} matches to name {}'.format(query_idx, name))
490 hashvals belong to query idx 2
query idx 2 matches to ident shew_os223
query idx 2 matches to name NC_011663.1 Shewanella baltica OS223, complete genome

Lineage storage and retrieval

In [16]:
from sourmash.lca import LineagePair
In [17]:
superkingdom = LineagePair('superkingdom', 'Bacteria')
phylum = LineagePair('phylum', 'Verrucomicrobia')
klass = LineagePair('class', 'Verrucomicrobiae')

lineage = (superkingdom, phylum, klass)
In [18]:
db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)
db.insert(sig1, lineage=lineage)
Out[18]:
499
In [19]:
# by default, the identifier is the signature name --
ident = sig1.name
idx = db.ident_to_idx[ident]
print("ident '{}' has idx {}".format(ident, idx))

lid = db.idx_to_lid[idx]
print("lid for idx {} is {}".format(idx, lid))

lineage = db.lid_to_lineage[lid]
display = sourmash.lca.display_lineage(lineage)
print("lineage for lid {} is {}".format(lid, display))
ident 'CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome' has idx 0
lid for idx 0 is 0
lineage for lid 0 is Bacteria;Verrucomicrobia;Verrucomicrobiae

Lineage manipulation

Default taxonomy ranks for lineages

While sourmash lineage functions can work with any taxonomy ranks and any taxonomy names, both the NCBI and GTDB taxonomies use superkingdom/phylum/etc, so there is a hard coded list availalbe via sourmash.lca.taxlist().

In [20]:
print(list(sourmash.lca.taxlist()))
['superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'strain']

Given a taxonomy as a list, you can then construct a lineage like so:

In [21]:
linstr1 = ["Bacteria", "Verrucomicrobia", "Verrucomicrobiae",
           "Verrucomicrobiales", "Akkermansiaceae", "Akkermansia",
           "Akkermansia muciniphila", "Akkermansia muciniphila ATCC BAA-835"]

lineage1 = [ LineagePair(*pair) for pair in zip(sourmash.lca.taxlist(), linstr1) ]
pprint(lineage1)
[LineagePair(rank='superkingdom', name='Bacteria'),
 LineagePair(rank='phylum', name='Verrucomicrobia'),
 LineagePair(rank='class', name='Verrucomicrobiae'),
 LineagePair(rank='order', name='Verrucomicrobiales'),
 LineagePair(rank='family', name='Akkermansiaceae'),
 LineagePair(rank='genus', name='Akkermansia'),
 LineagePair(rank='species', name='Akkermansia muciniphila'),
 LineagePair(rank='strain', name='Akkermansia muciniphila ATCC BAA-835')]
In [22]:
# display lineages as strings with 'sourmash.lca.display_lineage()'
sourmash.lca.display_lineage(lineage1)
Out[22]:
'Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835'

sourmash lowest-common-ancestor functions

The LCA functionality available in sourmash is built around some simple lineage manipulation functions -- build_tree and find_lca.

First, let's define some more lineages --

In [23]:
linstr2 = ["Bacteria", "Proteobacteria", "Gammaproteobacteria",
           "Alteromonadales", "Shewanellaceae", "Shewanella",
           "Shewanella baltica", "Shewanella baltica OS185"]
lineage2 = [ LineagePair(*pair) for pair in zip(sourmash.lca.taxlist(), linstr2) ]

linstr3 = ["Bacteria", "Proteobacteria", "Gammaproteobacteria",
           "Alteromonadales", "Shewanellaceae", "Shewanella",
           "Shewanella baltica", "Shewanella baltica OS223"]
lineage3 = [ LineagePair(*pair) for pair in zip(sourmash.lca.taxlist(), linstr3) ]

print('lineage2 is', sourmash.lca.display_lineage(lineage2))
print('lineage3 is', sourmash.lca.display_lineage(lineage3))
lineage2 is Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185
lineage3 is Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223

Now, build a tree structure that collapses these lineages where it can, and run some LCA analyses. Lineages 1 and 2 collapse to superkingdom:

In [24]:
tree = sourmash.lca.build_tree([lineage1, lineage2])
sourmash.lca.find_lca(tree)
Out[24]:
((LineagePair(rank='superkingdom', name='Bacteria'),), 2)

while lineages 2 and 3 collapse to species:

In [25]:
tree = sourmash.lca.build_tree([lineage2, lineage3])
sourmash.lca.find_lca(tree)
Out[25]:
((LineagePair(rank='superkingdom', name='Bacteria'),
  LineagePair(rank='phylum', name='Proteobacteria'),
  LineagePair(rank='class', name='Gammaproteobacteria'),
  LineagePair(rank='order', name='Alteromonadales'),
  LineagePair(rank='family', name='Shewanellaceae'),
  LineagePair(rank='genus', name='Shewanella'),
  LineagePair(rank='species', name='Shewanella baltica')),
 2)

Convenience functions let you make use of LCA_Database stored lineages

First, let's create a database from 3 signatures, and this time we'll store lineages in there:

In [26]:
db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)
db.insert(sig1, lineage=lineage1)
db.insert(sig2, lineage=lineage2)
db.insert(sig3, lineage=lineage3)
Out[26]:
490

Now, for any collection of hashes, you can retrieve all the lineage assignments into a dictionary: { hashval: set of lineages }

In [27]:
assignments = sourmash.lca.gather_assignments(sig2.minhash.get_mins(), [db])
print('num hashvals:', len(assignments))
num hashvals: 494
/Users/t/miniconda3/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecatedWarning: get_mins is deprecated as of 3.5 and will be removed in 5.0. Use .hashes property instead.
  """Entry point for launching an IPython kernel.

For example, this particular hashvalue belongs to two different lineages:

In [28]:
assignments[196037984804395]
Out[28]:
{(LineagePair(rank='superkingdom', name='Bacteria'),
  LineagePair(rank='phylum', name='Proteobacteria'),
  LineagePair(rank='class', name='Gammaproteobacteria'),
  LineagePair(rank='order', name='Alteromonadales'),
  LineagePair(rank='family', name='Shewanellaceae'),
  LineagePair(rank='genus', name='Shewanella'),
  LineagePair(rank='species', name='Shewanella baltica'),
  LineagePair(rank='strain', name='Shewanella baltica OS185')),
 (LineagePair(rank='superkingdom', name='Bacteria'),
  LineagePair(rank='phylum', name='Proteobacteria'),
  LineagePair(rank='class', name='Gammaproteobacteria'),
  LineagePair(rank='order', name='Alteromonadales'),
  LineagePair(rank='family', name='Shewanellaceae'),
  LineagePair(rank='genus', name='Shewanella'),
  LineagePair(rank='species', name='Shewanella baltica'),
  LineagePair(rank='strain', name='Shewanella baltica OS223'))}

sourmash also includes functions to summarize assignments by counting the number of times a particular lineage occurs; this is used by sourmash lca classify and sourmash lca summarize.

In [29]:
counter = sourmash.lca.count_lca_for_assignments(assignments)

# count_lca_for_assignments returns a collections.Counter object
for lineage, count in counter.most_common():
    print('{} hashes have LCA: {}'.format(count, sourmash.lca.display_lineage(lineage)))
311 hashes have LCA: Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185
183 hashes have LCA: Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica