#!/usr/bin/env python # coding: utf-8 # # The curious case of an optimization bug # TIP: if you're reading this on GitHub, I recommend starting a Binder session to have color output (and an interactive session!). Here is the link: [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/luizirber/nthash_bug/master?filepath=analysis.ipynb) # # Alternatively, use [nbviewer](https://nbviewer.jupyter.org/github/luizirber/nthash_bug/blob/master/analysis.ipynb), which preserves the formatting. # # UPDATE: I opened an [issue](https://github.com/bcgsc/ntHash/issues/7) and seems like the current `master` version is the culprit (more specifically [57af16a](https://github.com/bcgsc/ntHash/tree/57af16a972a553ecccea0cda25af85ac1f96a94b/)). Reverting back to the [`1.0.4` release](https://github.com/bcgsc/ntHash/releases/tag/1.0.4) fixes the problem. There is a new section at the end of this notebook with more details. # ## So, how did you get in this rabbit hole? # # Last week [Will Rowe](https://will-rowe.github.io/) posted the [preprint](https://doi.org/10.1101/408070) for [HULK](https://github.com/will-rowe/hulk), and in a comment in a related discussion [Heng Li](https://twitter.com/lh3lh3/status/1037487551504896001) suggested using `ntHash` to speed up hashing of k-mers, which makes more sense than using murmurhash3 when using sliding windows (since you can avoid recalculating the full hash all the time). He also linked to an answer on [the bioinformatics stack exchange](https://bioinformatics.stackexchange.com/questions/19/are-there-any-rolling-hash-functions-that-can-hash-a-dna-sequence-and-its-revers/293#293) with a very clear explanation by Karel Brinda on how it works. # # So clear was this explanation that I used it to implement a [Rust version](https://github.com/luizirber/nthash/blob/787faba1d9918791076d556dca0ea4bc38c85330/src/lib.rs). And since the default template for libraries (when running `cargo init --lib`) already includes a test example, I wrote a simple one that calculated the ntHash for a 5-mer and compared to the same value generated by the [original implementation (in C++)](https://github.com/bcgsc/ntHash). # # And... the test failed. # # I spent some time figuring out if I did something wrong, if there was something weird with my for loop, up to going to a whiteboard and checking indexing on the string by hand. It seemed to be right, so maybe there was something I was misinterpreting in the Rust syntax? # ## The implementations # # Well, back to the comfort zone: let's try in Python! # # Again, pretty straighforward to implement, but I had to make a `rol` function to rotate a 64-bit integer. Here is the implementation: # In[1]: get_ipython().system('cat nthash.py') # So I ran it, and... it matches my Rust result, but doesn't match the original implementation. Hmm. # # While re-reading the paper I looked into the [supplementary materials](https://academic.oup.com/bioinformatics/article/32/22/3492/2525588), and they have the `NT64` function defined (which is missing in the implementation on GitHub). So I used that to make the canonical version `NTC64`, which takes the minimum hash from both forward (`NT64F`) and reverse (`NT64R`) strands with the same parameters from the [current implementation](https://github.com/bcgsc/ntHash/blob/57af16a972a553ecccea0cda25af85ac1f96a94b/nthash.hpp) (so we can compare them easily). # # Here is the simple implementation, without worrying about optimizations: # In[2]: get_ipython().system('cat nthash_simple.hpp') # And... again it matches my Rust and Python code, but not the original implementation. To compare all of them I wrote a small programs that take a sequence from the first command line argument and generate the ntHash, printing also both forward and reverse strand values. Here is the C++ version, taking a preprocessor directive (set by passing `-DNTHASH_OPT` to the compiler) to compile with the original implementation or with the simple implementation based on the article: # In[3]: get_ipython().system('cat nt_main.cpp') # The Python version is at the end of the `nthash.py` file above. And finally the Rust version (using the `nthash` crate from my repo): # In[4]: get_ipython().system('cat src/main.rs') # ## Quickly run a test: Makefile # # I also put everything in one `Makefile`, so it's easy to compile and run a basic test: # In[5]: get_ipython().system('make -B') # In[6]: get_ipython().system('make test') # But for running more tests it is a bit annoying to go in the Makefile and change it, so we can benefit from the IPython `%%bash` magic: # In[7]: get_ipython().run_cell_magic('bash', '-l', './nt_opt AAAAA\n./nt_article AAAAA\npython nthash.py AAAAA\ncargo run -q AAAAA\n') # ## Comparing the implementations # # But we still have to set the sequence in 4 different places. Let's make a function to do that for us: # In[8]: import numpy as np import pandas as pd from IPython.display import display_html from functools import partial, reduce def color_same(unique, s): # 4 different colors here, because we have at most 4 # (HOPEFULLY) different values colors = dict(zip(unique, ('#ff7f0e', '#2ca02c', '#1f77b4', '#d62728'))) return ['color: {}'.format(colors[v]) for v in s] def run_for_seq(seq): ''' Returns a dataframe and a formatted styler (good for display) ''' opt = get_ipython().run_line_magic('sx', './nt_opt {seq}') simple = get_ipython().run_line_magic('sx', './nt_article {seq}') py = get_ipython().run_line_magic('sx', 'python nthash.py {seq}') rust = get_ipython().run_line_magic('sx', '. ~/.cargo/env && cargo run -q {seq}') def parse(result): out = {} for line in result: k, v = line.strip().split() out[k] = int(v, 16) return out df = pd.DataFrame.from_dict({ 'opt': parse(opt), 'simple': parse(simple), 'py': parse(py), 'rust': parse(rust) }).T unique_values = np.unique(df.values.flatten()) formatter = (df.style.format('0x{:0>16x}') .apply(partial(color_same, unique_values), axis=1)) return df, formatter # With out new function `run_for_seq` it is a bit easier to see patterns (using colors to highlight same values). For example, the output for our previous example (`AAAAA`) is now # In[9]: df, form = run_for_seq('AAAAA') form # ### 1-mers # # Let's start with the simple case: what is the hash for the 1-mers? # In[10]: for nt in "ACGT": df, form = run_for_seq(nt) print(f"Sequence: {nt}") display(form) # So far so good, all values in the columns are the same, # and `fhVal` is the canonical form for `A` and `C` and `rhVal` is the canonical form for `T` and `G` (since it ends up being `h(A)` and `h(C)`). # ### 2-mers: palindromes # # Let's start with 2-mers then! One case to analyze is for palindrome 2-mers like `GC`, where both forward and reverse strands are the same. We expect that all values will be the same, and indeed that's what we see for `GC`: # In[11]: df, form = run_for_seq('GC') form # But it's not what we see for the other cases (but at least `opt` is consistent and is giving the same value for `fhVal` and `rhVal`): # In[12]: for seq in ('AT', 'TA', 'GC', 'CG'): df, form = run_for_seq(seq) print(f"Sequence: {seq}") display(form) # ### 2-mers: same base # # When we check 2-mers with the same base, something interesting: `rhVal(CC)` matches `fhVal(GG)` for all implementations, but for all the other cases we see the same pattern (of `opt` disagreeing with the other implementations). # In[13]: for seq in ('AA', 'TT', 'CC', 'GG'): df, form = run_for_seq(seq) print(f"Sequence: {seq}") display(form) # It seems like there is something off with the optimizations implemented in `ntHash`, but I'm not sure how to help to track down more than providing some reduced test cases. # ## How to avoid similar problems? # ### Testing (unit and property-based) # # From reading the original code we see that there are no tests (the [`nttest.cpp`](https://github.com/bcgsc/ntHash/blob/57af16a972a553ecccea0cda25af85ac1f96a94b/nttest.cpp) code is for comparing `ntHash` with other hash functions). Unit testing would be a good start (because it helps to catch simple bugs), but because the codebase went through an optimization phase there is an option that really shines for this use case: property based testing. # # Unit tests are usually implementing by giving an input to a function and checking if the return value is correct. Property based testing extend this idea further by defining properties that the function must respect, and then generating random inputs to try to falsify the property. [This post](https://fsharpforfunandprofit.com/posts/property-based-testing/) has a great explanation of the process. # # I first heard about property based testing via [Hypothesis](https://hypothesis.works/), but sadly it is a Python library and I wasn't planning to write bindings for ntHash. There is an alternative for C++, [autocheck](https://github.com/thejohnfreeman/autocheck/), which is more similar to [QuickCheck](https://en.wikipedia.org/wiki/QuickCheck) and, for this purpose, is good enough to demonstrate the idea. # # For this specific use case, we can use an oracle to implement a property: the unoptimized implementation is our oracle, and the results for the optimized version must match the results for the unoptimized version. In `autocheck` this property can be written like this: # ```c++ # struct prop_nt_oracle { # bool operator () (const string& seq) { # uint64_t fhVal, rhVal, hValOpt, hValArticle; # # hValOpt = NTC64(seq.c_str(), seq.size(), fhVal, rhVal); # hValArticle = nthash::NTC64(seq.c_str(), seq.size(), fhVal, rhVal); # # return hValOpt == hValArticle; # } # }; # ``` # # The full code (using `catch` as a test runner) is in `test/test.cpp`: # In[14]: get_ipython().system('cat test/test.cpp') # I defined a new input generator, `seq_generator`, to create relevant genomic sequences for our test cases. It just takes a `size` and randomly generate a string of that size with `ACTGN` characters. # # To run this test case, there is a Makefile in the `test` directory that compiles everything: # In[15]: get_ipython().system(' cd test/ && make') get_ipython().system(' test/test') # And if the implementations don't match, it will keep finding inputs that falsify our property (in this case, any input where the optimized implementation doesn't match the simple one). # ## Update: ~~optimization~~ versioning problem # # I opened an [issue](https://github.com/bcgsc/ntHash/issues/7) and seems like the current `master` version is the culprit (more specifically [57af16a](https://github.com/bcgsc/ntHash/tree/57af16a972a553ecccea0cda25af85ac1f96a94b/)). Reverting back to the [`1.0.4` release](https://github.com/bcgsc/ntHash/releases/tag/1.0.4) fixes the problem. # # So, in the end it wasn't a case of an optimization bug, but the case of a versioning bug =] # # Re-running the analysis code with the `1.0.4` matches expectations: # In[16]: def run_for_seq(seq): ''' Returns a dataframe and a formatted styler (good for display) ''' opt = get_ipython().run_line_magic('sx', './nt_104 {seq}') simple = get_ipython().run_line_magic('sx', './nt_article {seq}') py = get_ipython().run_line_magic('sx', 'python nthash.py {seq}') rust = get_ipython().run_line_magic('sx', '. ~/.cargo/env && cargo run -q {seq}') def parse(result): out = {} for line in result: k, v = line.strip().split() out[k] = int(v, 16) return out df = pd.DataFrame.from_dict({ '1.0.4': parse(opt), 'simple': parse(simple), 'py': parse(py), 'rust': parse(rust) }).T unique_values = np.unique(df.values.flatten()) formatter = (df.style.format('0x{:0>16x}') .apply(partial(color_same, unique_values), axis=1)) return df, formatter # In[17]: df, form = run_for_seq('AAAAA') form # ### 1-mers # In[18]: for nt in "ACGT": df, form = run_for_seq(nt) print(f"Sequence: {nt}") display(form) # ### 2-mers: palindromes # In[19]: for seq in ('AT', 'TA', 'GC', 'CG'): df, form = run_for_seq(seq) print(f"Sequence: {seq}") display(form) # ### 2-mers: same base # In[20]: for seq in ('AA', 'TT', 'CC', 'GG'): df, form = run_for_seq(seq) print(f"Sequence: {seq}") display(form) # In[21]: get_ipython().system(' cd test && make test_104') get_ipython().system(' test/test_104')