Short study of the Lempel-Ziv complexity

In this short Jupyter notebook aims at defining and explaining the Lempel-Ziv complexity.

I will give examples, and benchmarks of different implementations.

  • Reference: Abraham Lempel and Jacob Ziv, « On the Complexity of Finite Sequences », IEEE Trans. on Information Theory, January 1976, p. 75–81, vol. 22, n°1.

Short definition

The Lempel-Ziv complexity is defined as the number of different substrings encountered as the stream is viewed from begining to the end.

As an example:

>>> s = '1001111011000010'
>>> lempel_ziv_complexity(s)  # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
8

Marking in the different substrings, this sequence $s$ has complexity $\mathrm{Lempel}$-$\mathrm{Ziv}(s) = 6$ because $s = 1001111011000010 = 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010$.

Other examples:

>>> lempel_ziv_complexity('1010101010101010')  # 1, 0, 10, 101, 01, 010, 1010
7
>>> lempel_ziv_complexity('1001111011000010000010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000
9
>>> lempel_ziv_complexity('100111101100001000001010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
10

Python implementation

In [112]:
def lempel_ziv_complexity(sequence):
    """Lempel-Ziv complexity for a binary sequence, in simple Python code."""
    sub_strings = set()
    n = len(sequence)
    ind = 0
    inc = 1
    # this while loop runs at most n times
    while True:
        if ind + inc > len(sequence):
            break
        # this can take some time, takes O(inc)
        sub_str = sequence[ind : ind + inc]
        # and this also, takes a O(log |size set|) in worst case
        # max value for inc = n / size set at the end
        # so worst case is that the set contains sub strings of the same size
        # and the worst loop takes a O(n / |S| * log(|S|))
        # ==> so if n/|S| is constant, it gives O(n log(n)) at the end
        # but if n/|S| = O(n) then it gives O(n^2)
        if sub_str in sub_strings:
            inc += 1
        else:
            sub_strings.add(sub_str)
            ind += inc
            inc = 1
    return len(sub_strings)

Tests (1/2)

In [4]:
s = '1001111011000010'
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
Out[4]:
8
In [5]:
%timeit lempel_ziv_complexity(s)
6.47 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]:
lempel_ziv_complexity('1010101010101010')  # 1, 0, 10, 101, 01, 010, 1010
Out[6]:
7
In [7]:
lempel_ziv_complexity('1001111011000010000010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000
Out[7]:
9
In [8]:
lempel_ziv_complexity('100111101100001000001010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
Out[8]:
10
In [9]:
%timeit lempel_ziv_complexity('100111101100001000001010')
8.82 µs ± 795 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]:
import random

def random_string(size, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
    return "".join(random.choices(alphabet, k=size))

def random_binary_sequence(size):
    return random_string(size, alphabet="01")
In [13]:
random_string(100)
random_binary_sequence(100)
Out[13]:
'JYLRJBDBFGTBMLFKNYQMJXIZJKKEVONGVKUCHNSSJCYROTATJDACWKCLWDEULMZWSQHJFFCQGMRCINHRIOLMEWWEPTOUUECJWAAN'
Out[13]:
'1110000010101011101000110010001100000011101110011010100101100110100010110110111000111000101100001010'
In [43]:
for (r, name) in zip(
    [random_string, random_binary_sequence],
    ["random strings in A..Z", "random binary sequences"]
    ):
    print("\nFor {}...".format(name))
    for n in [10, 100, 1000, 10000, 100000]:
        print("  of sizes {}, Lempel-Ziv complexity runs in:".format(n))
        %timeit lempel_ziv_complexity(r(n))
For random strings in A..Z...
  of sizes 10, Lempel-Ziv complexity runs in:
7.64 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  of sizes 100, Lempel-Ziv complexity runs in:
49.6 µs ± 6.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  of sizes 1000, Lempel-Ziv complexity runs in:
591 µs ± 78.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  of sizes 10000, Lempel-Ziv complexity runs in:
5.2 ms ± 770 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  of sizes 100000, Lempel-Ziv complexity runs in:
52.2 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For random binary sequences...
  of sizes 10, Lempel-Ziv complexity runs in:
6.04 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  of sizes 100, Lempel-Ziv complexity runs in:
46.8 µs ± 3.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  of sizes 1000, Lempel-Ziv complexity runs in:
491 µs ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  of sizes 10000, Lempel-Ziv complexity runs in:
5.76 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  of sizes 100000, Lempel-Ziv complexity runs in:
65.3 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can start to see that the time complexity of this function seems to grow linearly as the size grows.


Cython implementation

As this blog post explains it, we can easily try to use Cython in a notebook cell.

See the Cython documentation for more information.

In [16]:
%load_ext cython
In [45]:
%%cython
import cython

ctypedef unsigned int DTYPE_t

@cython.boundscheck(False) # turn off bounds-checking for entire function, quicker but less safe
def lempel_ziv_complexity_cython(str sequence not None):
    """Lempel-Ziv complexity for a string, in simple Cython code (C extension)."""
    
    cdef set sub_strings = set()
    cdef str sub_str = ""
    cdef DTYPE_t n = len(sequence)
    cdef DTYPE_t ind = 0
    cdef DTYPE_t inc = 1
    while True:
        if ind + inc > len(sequence):
            break
        sub_str = sequence[ind : ind + inc]
        if sub_str in sub_strings:
            inc += 1
        else:
            sub_strings.add(sub_str)
            ind += inc
            inc = 1
    return len(sub_strings)

Let try it!

In [37]:
s = '1001111011000010'
lempel_ziv_complexity_cython(s)  # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
Out[37]:
8
In [29]:
%timeit lempel_ziv_complexity(s)
%timeit lempel_ziv_complexity_cython(s)
4.97 µs ± 590 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
843 ns ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [30]:
lempel_ziv_complexity_cython('1010101010101010')  # 1, 0, 10, 101, 01, 010, 1010
Out[30]:
7
In [31]:
lempel_ziv_complexity_cython('1001111011000010000010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000
Out[31]:
9
In [32]:
lempel_ziv_complexity_cython('100111101100001000001010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
Out[32]:
10

Now for a test of the speed?

In [46]:
for (r, name) in zip(
    [random_string, random_binary_sequence],
    ["random strings in A..Z", "random binary sequences"]
    ):
    print("\nFor {}...".format(name))
    for n in [10, 100, 1000, 10000, 100000]:
        print("  of sizes {}, Lempel-Ziv complexity in Cython runs in:".format(n))
        %timeit lempel_ziv_complexity_cython(r(n))
For random strings in A..Z...
  of sizes 10, Lempel-Ziv complexity in Cython runs in:
3.9 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  of sizes 100, Lempel-Ziv complexity in Cython runs in:
25.8 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  of sizes 1000, Lempel-Ziv complexity in Cython runs in:
276 µs ± 50.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  of sizes 10000, Lempel-Ziv complexity in Cython runs in:
2.43 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  of sizes 100000, Lempel-Ziv complexity in Cython runs in:
28 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For random binary sequences...
  of sizes 10, Lempel-Ziv complexity in Cython runs in:
4.06 µs ± 444 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  of sizes 100, Lempel-Ziv complexity in Cython runs in:
29 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  of sizes 1000, Lempel-Ziv complexity in Cython runs in:
270 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  of sizes 10000, Lempel-Ziv complexity in Cython runs in:
2.74 ms ± 405 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  of sizes 100000, Lempel-Ziv complexity in Cython runs in:
30.9 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

$\implies$ Yay! It seems faster indeed! but only x2 times faster...


Numba implementation

As this blog post explains it, we can also try to use Numba in a notebook cell.

In [48]:
from numba import jit
In [79]:
@jit
def lempel_ziv_complexity_numba(sequence : str) -> int:
    """Lempel-Ziv complexity for a sequence, in Python code using numba.jit() for automatic speedup (hopefully)."""

    sub_strings = set()
    n : int= len(sequence)

    ind : int = 0
    inc : int = 1
    while True:
        if ind + inc > len(sequence):
            break
        sub_str : str = sequence[ind : ind + inc]
        if sub_str in sub_strings:
            inc += 1
        else:
            sub_strings.add(sub_str)
            ind += inc
            inc = 1
    return len(sub_strings)

Let try it!

In [80]:
s = '1001111011000010'
lempel_ziv_complexity_numba(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010
<ipython-input-79-307ea407f1ba>:1: NumbaWarning: 
Compilation is falling back to object mode WITH looplifting enabled because Function "lempel_ziv_complexity_numba" failed type inference due to: Internal error at <numba.typeinfer.CallConstraint object at 0x7f381acc52d0>.

[1] During: resolving callee type: BoundFunction(set.add for set(undefined))
[2] During: typing of call at <ipython-input-79-307ea407f1ba> (17)

Enable logging at debug level for details.

File "<ipython-input-79-307ea407f1ba>", line 17:
def lempel_ziv_complexity_numba(sequence : str) -> int:
    <source elided>
        else:
            sub_strings.add(sub_str)
            ^

  @jit
<ipython-input-79-307ea407f1ba>:1: NumbaWarning: 
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "lempel_ziv_complexity_numba" failed type inference due to: cannot determine Numba type of <class 'numba.dispatcher.LiftedLoop'>

File "<ipython-input-79-307ea407f1ba>", line 9:
def lempel_ziv_complexity_numba(sequence : str) -> int:
    <source elided>
    ind : int = 0
    inc : int = 1
    ^

  @jit
/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:178: NumbaWarning: Function "lempel_ziv_complexity_numba" was compiled in object mode without forceobj=True, but has lifted loops.

File "<ipython-input-79-307ea407f1ba>", line 2:
@jit
def lempel_ziv_complexity_numba(sequence : str) -> int:
^

  state.func_ir.loc))
/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:187: NumbaDeprecationWarning: 
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "<ipython-input-79-307ea407f1ba>", line 2:
@jit
def lempel_ziv_complexity_numba(sequence : str) -> int:
^

  warnings.warn(errors.NumbaDeprecationWarning(msg, state.func_ir.loc))
<ipython-input-79-307ea407f1ba>:1: NumbaWarning: 
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "lempel_ziv_complexity_numba" failed type inference due to: non-precise type pyobject
[1] During: typing of argument at <ipython-input-79-307ea407f1ba> (9)

File "<ipython-input-79-307ea407f1ba>", line 9:
def lempel_ziv_complexity_numba(sequence : str) -> int:
    <source elided>
    ind : int = 0
    inc : int = 1
    ^

  @jit
/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:178: NumbaWarning: Function "lempel_ziv_complexity_numba" was compiled in object mode without forceobj=True.

File "<ipython-input-79-307ea407f1ba>", line 9:
def lempel_ziv_complexity_numba(sequence : str) -> int:
    <source elided>
    ind : int = 0
    inc : int = 1
    ^

  state.func_ir.loc))
/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:187: NumbaDeprecationWarning: 
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "<ipython-input-79-307ea407f1ba>", line 9:
def lempel_ziv_complexity_numba(sequence : str) -> int:
    <source elided>
    ind : int = 0
    inc : int = 1
    ^

  warnings.warn(errors.NumbaDeprecationWarning(msg, state.func_ir.loc))
Out[80]:
8
In [53]:
%timeit lempel_ziv_complexity_numba(s)
43.7 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [54]:
lempel_ziv_complexity_numba('1010101010101010')  # 1, 0, 10, 101, 01, 010, 1010
Out[54]:
7
In [55]:
lempel_ziv_complexity_numba('1001111011000010000010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000
    9
Out[55]:
9
In [57]:
lempel_ziv_complexity_numba('100111101100001000001010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
Out[57]:
10
In [58]:
%timeit lempel_ziv_complexity_numba('100111101100001000001010')
48.1 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

$\implies$ Well... It doesn't seem that much faster from the naive Python code. We specified the signature when calling @numba.jit, and used the more appropriate data structure (string is probably the smaller, numpy array are probably faster). But even these tricks didn't help that much.

I tested, and without specifying the signature, the fastest approach is using string, compared to using lists or numpy arrays. Note that the @jit-powered function is compiled at runtime when first being called, so the signature used for the first call is determining the signature used by the compile function


Tests (2/2)

To test more robustly, let us generate some (uniformly) random binary sequences.

In [59]:
from numpy.random import binomial

def bernoulli(p, size=1):
    """One or more samples from a Bernoulli of probability p."""
    return binomial(1, p, size)
In [60]:
bernoulli(0.5, 20)
Out[60]:
array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1])

That's probably not optimal, but we can generate a string with:

In [61]:
''.join(str(i) for i in bernoulli(0.5, 20))
Out[61]:
'11110010000011111101'
In [62]:
def random_binary_sequence(n, p=0.5):
    """Uniform random binary sequence of size n, with rate of 0/1 being p."""
    return ''.join(str(i) for i in bernoulli(p, n))
In [63]:
random_binary_sequence(50)
random_binary_sequence(50, p=0.1)
random_binary_sequence(50, p=0.25)
random_binary_sequence(50, p=0.5)
random_binary_sequence(50, p=0.75)
random_binary_sequence(50, p=0.9)
Out[63]:
'10111010011001100110111110001111100001100111111010'
Out[63]:
'00000001000000001000000000001000100100000000000000'
Out[63]:
'00011000000000101010110000010011000100001000100001'
Out[63]:
'00000111111111000101000100100001100000101100000110'
Out[63]:
'11111100111011110010110111111101110011111011111111'
Out[63]:
'11111011111110111111111111111111111111111111111110'

And so, this function can test to check that the three implementations (naive, Cython-powered, Numba-powered) always give the same result.

In [64]:
def tests_3_functions(n, p=0.5, debug=True):
    s = random_binary_sequence(n, p=p)
    c1 = lempel_ziv_complexity(s)
    if debug:
        print("Sequence s = {} ==> complexity C = {}".format(s, c1))
    c2 = lempel_ziv_complexity_cython(s)
    c3 = lempel_ziv_complexity_numba(s)
    assert c1 == c2 == c3, "Error: the sequence {} gave different values of the Lempel-Ziv complexity from 3 functions ({}, {}, {})...".format(s, c1, c2, c3)
    return c1
In [65]:
tests_3_functions(5)
Sequence s = 11111 ==> complexity C = 2
Out[65]:
2
In [66]:
tests_3_functions(20)
Sequence s = 11000110100011110101 ==> complexity C = 9
Out[66]:
9
In [67]:
tests_3_functions(50)
Sequence s = 01000001001110111101111110011100101001011001111001 ==> complexity C = 17
Out[67]:
17
In [68]:
tests_3_functions(500)
Sequence s = 10110110111101100010011010110001001001101100110000110110010100111001110100001110001111110111100011011011001010000001111011110000101101011110011010111000111111101110111110010100110011000111011101100101101101000000111001010100100010000010011011100111010101001100001110101100101000000111010110011001010000100001111110110010011111100100001000011010011001001010100001111011101101000001100001011101010101100101100011000101101010010101001010011011011010000010100101101000110100100000111001100100110011011101 ==> complexity C = 99
Out[68]:
99
In [69]:
tests_3_functions(5000)
Sequence s = 10100100011000100101101111111001010001101101010100011101100101101100101111100110010110001010010101010000011101110011101111110110010101001011010010101000000111001011001000000101001010111001110000100101001111100001100000010111001100000010100010011010110001111111001111001101001101111111001111111110010110101100110010111001110101101010111010011000010100000000000111010100110000001110010101001101000100100111101100110110111001101011110101110011000001001101100101010100101100101010000100111111110100011101111110001110110011011001110100101000101000110010101010010000100101111010001111000100001110001000101011110111010000000110110000111110101110110010000110110110010110101110100011100010011011100101010001011111101001100100000100001101011010000011010001101101001011001100010101101110011100011111111111001001111100110101011110011100000010010100001110010111011011111011000101111001011101000001100000000101000010010110101010000011100110100101100111110101011010101100111010000100011111000111110001011100010011111110000001010010010000001101101001111110010100000101101111110101110001110001110001000001001011010101000011101000000110001001110001110101110011100111000011001100111011011000111110001111011100001001110011101100001000010010100011001101001001110110101101010111001001001001011000001011001111010001000110111011101101001100100001100001100010001011011010001001000101010001011101011111001010011000010011000110111111100110010101100111010010111101110110111011101110001101001000010010001001111001111111101000101111010101110100011110100001101100000100100010101011001101001111110001111010111101001100101000100100011000100011100101001110110010100011100001011010110110111000100011000101101000010100100101100100001000110010000110100100011010011000100001101011011010111010010110100110011111101011010101011101111111101011111001011010111010100110111001101001101011011001100001110000001011110001010100110101000010100101111000110111001101110000000011010101100010111100111011111100111110001110101010001000101111111110011111100010110010111011110111110111011110001110001101101101111110111100110101000001101111110111100010001110111110001001100011100000010111100100101011001101101011010110110011010110100111001011011011110000000010001110000011111111100011011001001001100100001110011000111110010110100100001011100001001000101000101001001101001000000110001100100001101111101001100100101000011011011111101000110101101101010000110001111011110011111001110111001100111010101010011110111001110111000000110101001010010111101000100100100101001001000111010100010111101011110100100111100110101010011101110011101110110101101111000111001111001010100001111110000100000001011000110001110101000001001001010001001000101110011010100100000110010000011100111001011111001001110010001011100101011101101011101111001000101101101011011110100001101100001011001101011000101000000001000110011011100010011010000111010011100010100010010010101000111101110110101011110100000011011010100011010100101011001100100000000110101011000100111001110001101001101001001110111011100111001001001000101011011111110100100000010111011100000101001001101110100010111100000000010101010000010100110011111001000010000111101111100100001000110000111110011001011011111011101111111011010001101111100010010010010010101000001000011110000000000101110111000001110000000111111011111001000110000000001010000010010110111000100101100111111010011101101111101111111011101101011100000011101111111111001111001100101001011000111010000110001001010111010100011111100011001101011110000100000010001100001000011111111111010111100111001010110000001011111110101000000100101010001001001111011110010011110000001100100110011110110110001110100100001000100001010110011110000100001110111011101101110100001010110111001101101010000011001101001100001111011000100110000101110100011011001000001100100101001110001000010111101110000111100101100011011011100110100001110000010111111001100011010001101110010000011000001111011010010100110101010011011101000111000100011010110101111011011001111100010001010110001111001000001101111010010001001010110111010000011110110101000010011001100100100000111011110001011110111000101001011010100000111111011100100101001001001100000010110110001100101011110001011000000010110111010111000000000110110100010111000111011101001110011101001100011000000101001111000101011101000010111111010100001000100000000110101100010110000111000110001101001010010100111010101000111101000110010100110010001001110110101000010010101001011001100001000100000101110100011000010000100010101100111001010100010111111010110000001000011100110000110110110010111111011010110001011000011011111000001001100100110110001100100110111110000101011000100001000110110100000001101110001100010101001011111111110100010001001001101101101101011000110000011000100010011111000110011011011100000101001001110011010100001011001011110100110001011011010001100111001011001101111110101101011101000111010111000000001010001001011000001001000101001010011010001110000001000010001110011001001110110100011101100110101110001100101010000010001000101101 ==> complexity C = 654
Out[69]:
654

Benchmarks

On two example of strings (binary sequences), we can compare our three implementation.

In [70]:
%timeit lempel_ziv_complexity('100111101100001000001010')
%timeit lempel_ziv_complexity_cython('100111101100001000001010')
%timeit lempel_ziv_complexity_numba('100111101100001000001010')
6.28 µs ± 295 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.22 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
50.1 µs ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [71]:
%timeit lempel_ziv_complexity('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_cython('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_numba('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
25.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
5.43 µs ± 442 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
84.4 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Let check the time used by all the three functions, for longer and longer sequences:

In [72]:
%timeit tests_3_functions(10, debug=False)
%timeit tests_3_functions(20, debug=False)
%timeit tests_3_functions(40, debug=False)
%timeit tests_3_functions(80, debug=False)
%timeit tests_3_functions(160, debug=False)
%timeit tests_3_functions(320, debug=False)
229 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
148 µs ± 52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
123 µs ± 6.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
205 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
288 µs ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
819 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [73]:
def test_cython(n):
    s = random_binary_sequence(n)
    c = lempel_ziv_complexity_cython(s)
    return c
In [74]:
%timeit test_cython(10)
%timeit test_cython(20)
%timeit test_cython(40)
%timeit test_cython(80)
%timeit test_cython(160)
%timeit test_cython(320)
23.9 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
53.4 µs ± 5.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
207 µs ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
369 µs ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [75]:
%timeit test_cython(640)
%timeit test_cython(1280)
%timeit test_cython(2560)
%timeit test_cython(5120)
679 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.85 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.97 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.47 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [76]:
%timeit test_cython(10240)
%timeit test_cython(20480)
10.7 ms ± 891 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
20.9 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Complexity ?

$\implies$ The function lempel_ziv_complexity_cython seems to be indeed (almost) linear in $n$, the length of the binary sequence $S$.

But let check more precisely, as it could also have a complexity of $\mathcal{O}(n \log n)$.

In [95]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(context="notebook", style="darkgrid", palette="hls", font="sans-serif", font_scale=1.4)
In [96]:
import numpy as np
import timeit
In [109]:
sizes = np.array(np.trunc(np.logspace(1, 6, 30)), dtype=int)

times = np.array([
    timeit.timeit(
        stmt="lempel_ziv_complexity_cython(random_string({}))".format(n),
        globals=globals(),
        number=10,
    )
    for n in sizes
])
In [110]:
plt.figure(figsize=(15, 10))
plt.plot(sizes, times, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity")
plt.show()
Out[110]:
<Figure size 1080x720 with 0 Axes>
Out[110]:
[<matplotlib.lines.Line2D at 0x7f3808d3e4d0>]
Out[110]:
Text(0.5, 0, 'Length $n$ of the binary sequence $S$')
Out[110]:
Text(0, 0.5, 'Time in $\\mu\\;\\mathrm{s}$')
Out[110]:
Text(0.5, 1.0, 'Time complexity of Lempel-Ziv complexity')
In [111]:
plt.figure(figsize=(15, 10))
plt.loglog(sizes, times, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, loglog scale")
plt.show()
Out[111]:
<Figure size 1080x720 with 0 Axes>
Out[111]:
[<matplotlib.lines.Line2D at 0x7f3808ccbf50>]
Out[111]:
Text(0.5, 0, 'Length $n$ of the binary sequence $S$')
Out[111]:
Text(0, 0.5, 'Time in $\\mu\\;\\mathrm{s}$')
Out[111]:
Text(0.5, 1.0, 'Time complexity of Lempel-Ziv complexity, loglog scale')

It is linear in $\log\log$ scale, so indeed the algorithm seems to have a linear complexity.

To sum-up, for a sequence $S$ of length $n$, it takes $\mathcal{O}(n)$ basic operations to compute its Lempel-Ziv complexity $\mathrm{Lempel}-\mathrm{Ziv}(S)$.


Conclusion

  • The Lempel-Ziv complexity is not too hard to implement, and it indeed represents a certain complexity of a binary sequence, capturing the regularity and reproducibility of the sequence.

  • Using the Cython was quite useful to have a $\simeq \times 100$ speed up on our manual naive implementation !

  • The algorithm is not easy to analyze, we have a trivial $\mathcal{O}(n^2)$ bound but experiments showed it is more likely to be $\mathcal{O}(n \log n)$ in the worst case, and $\mathcal{O}(n)$ in practice for "not too complicated sequences" (or in average, for random sequences).


(Experimental) Julia implementation

I want to (quickly) try to see if I can use Julia to write a faster version of this function. See issue #1.

In [118]:
%%time
%%script julia

"""Lempel-Ziv complexity for a sequence, in simple Julia code."""
function lempel_ziv_complexity(sequence)
    sub_strings = Set()
    n = length(sequence)

    ind = 1
    inc = 1
    while true
        if ind + inc > n
            break
        end
        sub_str = sequence[ind : ind + inc]
        if sub_str in sub_strings
            inc += 1
        else
            push!(sub_strings, sub_str)
            ind += inc
            inc = 1
        end
    end
    return length(sub_strings)
end

s = "1001111011000010"
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010

M = 1000;
N = 10000;
for _ in 1:M
    s = join(rand(0:1, N));
    lempel_ziv_complexity(s);
end
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010
"Lempel-Ziv complexity for a sequence, in simple Julia code."
lempel_ziv_complexity (generic function with 1 method)
"1001111011000010"
9
1000
10000
9
CPU times: user 6.89 ms, sys: 8 ms, total: 14.9 ms
Wall time: 3.85 s

And to compare it fairly, let us use Pypy for comparison.

In [122]:
%%time
%%pypy

def lempel_ziv_complexity(sequence):
    """Lempel-Ziv complexity for a binary sequence, in simple Python code."""
    sub_strings = set()
    n = len(sequence)

    ind = 0
    inc = 1
    while True:
        if ind + inc > len(sequence):
            break
        sub_str = sequence[ind : ind + inc]
        if sub_str in sub_strings:
            inc += 1
        else:
            sub_strings.add(sub_str)
            ind += inc
            inc = 1
    return len(sub_strings)

s = "1001111011000010"
lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010

from random import random

M = 1000
N = 10000
for _ in range(M):
    s = ''.join(str(int(random() < 0.5)) for _ in range(N))
    lempel_ziv_complexity(s)
CPU times: user 4.24 ms, sys: 8 ms, total: 12.2 ms
Wall time: 1.39 s

So we can check that on these 1000 random trials on strings of size 10000, the naive Julia version is slower than the naive Python version (executed by Pypy for speedup).


Ending notes

Thanks for reading! My implementation is now open-source and available on GitHub, on https://github.com/Naereen/Lempel-Ziv_Complexity.

It will be available from PyPi very soon, see https://pypi.python.org/pypi/lempel_ziv_complexity.

See this repo on GitHub for more notebooks, or on nbviewer.jupyter.org.

That's it for this demo! See you, folks!