# Short study of the Lempel-Ziv complexity¶

In this short Jupyter notebook aims at defining and explaining the Lempel-Ziv complexity.

I will give examples, and benchmarks of different implementations.

• Reference: Abraham Lempel and Jacob Ziv, « On the Complexity of Finite Sequences », IEEE Trans. on Information Theory, January 1976, p. 75–81, vol. 22, n°1.

## Short definition¶

The Lempel-Ziv complexity is defined as the number of different substrings encountered as the stream is viewed from begining to the end.

As an example:

>>> s = '1001111011000010'
>>> lempel_ziv_complexity(s)  # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
8


Marking in the different substrings, this sequence $s$ has complexity $\mathrm{Lempel}$-$\mathrm{Ziv}(s) = 6$ because $s = 1001111011000010 = 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010$.

Other examples:

>>> lempel_ziv_complexity('1010101010101010')  # 1, 0, 10, 101, 01, 010, 1010
7
>>> lempel_ziv_complexity('1001111011000010000010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000
9
>>> lempel_ziv_complexity('100111101100001000001010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
10


## Python implementation¶

In [112]:
def lempel_ziv_complexity(sequence):
"""Lempel-Ziv complexity for a binary sequence, in simple Python code."""
sub_strings = set()
n = len(sequence)
ind = 0
inc = 1
# this while loop runs at most n times
while True:
if ind + inc > len(sequence):
break
# this can take some time, takes O(inc)
sub_str = sequence[ind : ind + inc]
# and this also, takes a O(log |size set|) in worst case
# max value for inc = n / size set at the end
# so worst case is that the set contains sub strings of the same size
# and the worst loop takes a O(n / |S| * log(|S|))
# ==> so if n/|S| is constant, it gives O(n log(n)) at the end
# but if n/|S| = O(n) then it gives O(n^2)
if sub_str in sub_strings:
inc += 1
else:
ind += inc
inc = 1
return len(sub_strings)


## Tests (1/2)¶

In [4]:
s = '1001111011000010'
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010

Out[4]:
8
In [5]:
%timeit lempel_ziv_complexity(s)

6.47 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [6]:
lempel_ziv_complexity('1010101010101010')  # 1, 0, 10, 101, 01, 010, 1010

Out[6]:
7
In [7]:
lempel_ziv_complexity('1001111011000010000010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000

Out[7]:
9
In [8]:
lempel_ziv_complexity('100111101100001000001010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101

Out[8]:
10
In [9]:
%timeit lempel_ziv_complexity('100111101100001000001010')

8.82 µs ± 795 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [11]:
import random

def random_string(size, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
return "".join(random.choices(alphabet, k=size))

def random_binary_sequence(size):
return random_string(size, alphabet="01")

In [13]:
random_string(100)
random_binary_sequence(100)

Out[13]:
'JYLRJBDBFGTBMLFKNYQMJXIZJKKEVONGVKUCHNSSJCYROTATJDACWKCLWDEULMZWSQHJFFCQGMRCINHRIOLMEWWEPTOUUECJWAAN'
Out[13]:
'1110000010101011101000110010001100000011101110011010100101100110100010110110111000111000101100001010'
In [43]:
for (r, name) in zip(
[random_string, random_binary_sequence],
["random strings in A..Z", "random binary sequences"]
):
print("\nFor {}...".format(name))
for n in [10, 100, 1000, 10000, 100000]:
print("  of sizes {}, Lempel-Ziv complexity runs in:".format(n))
%timeit lempel_ziv_complexity(r(n))

For random strings in A..Z...
of sizes 10, Lempel-Ziv complexity runs in:
7.64 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
of sizes 100, Lempel-Ziv complexity runs in:
49.6 µs ± 6.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
of sizes 1000, Lempel-Ziv complexity runs in:
591 µs ± 78.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
of sizes 10000, Lempel-Ziv complexity runs in:
5.2 ms ± 770 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
of sizes 100000, Lempel-Ziv complexity runs in:
52.2 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For random binary sequences...
of sizes 10, Lempel-Ziv complexity runs in:
6.04 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
of sizes 100, Lempel-Ziv complexity runs in:
46.8 µs ± 3.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
of sizes 1000, Lempel-Ziv complexity runs in:
491 µs ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
of sizes 10000, Lempel-Ziv complexity runs in:
5.76 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
of sizes 100000, Lempel-Ziv complexity runs in:
65.3 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can start to see that the time complexity of this function seems to grow linearly as the size grows.

## Cython implementation¶

As this blog post explains it, we can easily try to use Cython in a notebook cell.

In [16]:
%load_ext cython

In [45]:
%%cython
import cython

ctypedef unsigned int DTYPE_t

@cython.boundscheck(False) # turn off bounds-checking for entire function, quicker but less safe
def lempel_ziv_complexity_cython(str sequence not None):
"""Lempel-Ziv complexity for a string, in simple Cython code (C extension)."""

cdef set sub_strings = set()
cdef str sub_str = ""
cdef DTYPE_t n = len(sequence)
cdef DTYPE_t ind = 0
cdef DTYPE_t inc = 1
while True:
if ind + inc > len(sequence):
break
sub_str = sequence[ind : ind + inc]
if sub_str in sub_strings:
inc += 1
else:
ind += inc
inc = 1
return len(sub_strings)


Let try it!

In [37]:
s = '1001111011000010'
lempel_ziv_complexity_cython(s)  # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010

Out[37]:
8
In [29]:
%timeit lempel_ziv_complexity(s)
%timeit lempel_ziv_complexity_cython(s)

4.97 µs ± 590 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
843 ns ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [30]:
lempel_ziv_complexity_cython('1010101010101010')  # 1, 0, 10, 101, 01, 010, 1010

Out[30]:
7
In [31]:
lempel_ziv_complexity_cython('1001111011000010000010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000

Out[31]:
9
In [32]:
lempel_ziv_complexity_cython('100111101100001000001010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101

Out[32]:
10

Now for a test of the speed?

In [46]:
for (r, name) in zip(
[random_string, random_binary_sequence],
["random strings in A..Z", "random binary sequences"]
):
print("\nFor {}...".format(name))
for n in [10, 100, 1000, 10000, 100000]:
print("  of sizes {}, Lempel-Ziv complexity in Cython runs in:".format(n))
%timeit lempel_ziv_complexity_cython(r(n))

For random strings in A..Z...
of sizes 10, Lempel-Ziv complexity in Cython runs in:
3.9 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
of sizes 100, Lempel-Ziv complexity in Cython runs in:
25.8 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
of sizes 1000, Lempel-Ziv complexity in Cython runs in:
276 µs ± 50.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
of sizes 10000, Lempel-Ziv complexity in Cython runs in:
2.43 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
of sizes 100000, Lempel-Ziv complexity in Cython runs in:
28 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For random binary sequences...
of sizes 10, Lempel-Ziv complexity in Cython runs in:
4.06 µs ± 444 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
of sizes 100, Lempel-Ziv complexity in Cython runs in:
29 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
of sizes 1000, Lempel-Ziv complexity in Cython runs in:
270 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
of sizes 10000, Lempel-Ziv complexity in Cython runs in:
2.74 ms ± 405 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
of sizes 100000, Lempel-Ziv complexity in Cython runs in:
30.9 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


$\implies$ Yay! It seems faster indeed! but only x2 times faster...

## Numba implementation¶

As this blog post explains it, we can also try to use Numba in a notebook cell.

In [48]:
from numba import jit

In [79]:
@jit
def lempel_ziv_complexity_numba(sequence : str) -> int:
"""Lempel-Ziv complexity for a sequence, in Python code using numba.jit() for automatic speedup (hopefully)."""

sub_strings = set()
n : int= len(sequence)

ind : int = 0
inc : int = 1
while True:
if ind + inc > len(sequence):
break
sub_str : str = sequence[ind : ind + inc]
if sub_str in sub_strings:
inc += 1
else:
ind += inc
inc = 1
return len(sub_strings)


Let try it!

In [80]:
s = '1001111011000010'
lempel_ziv_complexity_numba(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010

<ipython-input-79-307ea407f1ba>:1: NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function "lempel_ziv_complexity_numba" failed type inference due to: Internal error at <numba.typeinfer.CallConstraint object at 0x7f381acc52d0>.

[1] During: resolving callee type: BoundFunction(set.add for set(undefined))
[2] During: typing of call at <ipython-input-79-307ea407f1ba> (17)

Enable logging at debug level for details.

File "<ipython-input-79-307ea407f1ba>", line 17:
def lempel_ziv_complexity_numba(sequence : str) -> int:
<source elided>
else:
^

@jit
<ipython-input-79-307ea407f1ba>:1: NumbaWarning:
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "lempel_ziv_complexity_numba" failed type inference due to: cannot determine Numba type of <class 'numba.dispatcher.LiftedLoop'>

File "<ipython-input-79-307ea407f1ba>", line 9:
def lempel_ziv_complexity_numba(sequence : str) -> int:
<source elided>
ind : int = 0
inc : int = 1
^

@jit
/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:178: NumbaWarning: Function "lempel_ziv_complexity_numba" was compiled in object mode without forceobj=True, but has lifted loops.

File "<ipython-input-79-307ea407f1ba>", line 2:
@jit
def lempel_ziv_complexity_numba(sequence : str) -> int:
^

state.func_ir.loc))
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

File "<ipython-input-79-307ea407f1ba>", line 2:
@jit
def lempel_ziv_complexity_numba(sequence : str) -> int:
^

<ipython-input-79-307ea407f1ba>:1: NumbaWarning:
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "lempel_ziv_complexity_numba" failed type inference due to: non-precise type pyobject
[1] During: typing of argument at <ipython-input-79-307ea407f1ba> (9)

File "<ipython-input-79-307ea407f1ba>", line 9:
def lempel_ziv_complexity_numba(sequence : str) -> int:
<source elided>
ind : int = 0
inc : int = 1
^

@jit
/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:178: NumbaWarning: Function "lempel_ziv_complexity_numba" was compiled in object mode without forceobj=True.

File "<ipython-input-79-307ea407f1ba>", line 9:
def lempel_ziv_complexity_numba(sequence : str) -> int:
<source elided>
ind : int = 0
inc : int = 1
^

state.func_ir.loc))
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

File "<ipython-input-79-307ea407f1ba>", line 9:
def lempel_ziv_complexity_numba(sequence : str) -> int:
<source elided>
ind : int = 0
inc : int = 1
^


Out[80]:
8
In [53]:
%timeit lempel_ziv_complexity_numba(s)

43.7 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [54]:
lempel_ziv_complexity_numba('1010101010101010')  # 1, 0, 10, 101, 01, 010, 1010

Out[54]:
7
In [55]:
lempel_ziv_complexity_numba('1001111011000010000010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000
9

Out[55]:
9
In [57]:
lempel_ziv_complexity_numba('100111101100001000001010')  # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101

Out[57]:
10
In [58]:
%timeit lempel_ziv_complexity_numba('100111101100001000001010')

48.1 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


$\implies$ Well... It doesn't seem that much faster from the naive Python code. We specified the signature when calling @numba.jit, and used the more appropriate data structure (string is probably the smaller, numpy array are probably faster). But even these tricks didn't help that much.

I tested, and without specifying the signature, the fastest approach is using string, compared to using lists or numpy arrays. Note that the @jit-powered function is compiled at runtime when first being called, so the signature used for the first call is determining the signature used by the compile function

## Tests (2/2)¶

To test more robustly, let us generate some (uniformly) random binary sequences.

In [59]:
from numpy.random import binomial

def bernoulli(p, size=1):
"""One or more samples from a Bernoulli of probability p."""
return binomial(1, p, size)

In [60]:
bernoulli(0.5, 20)

Out[60]:
array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1])

That's probably not optimal, but we can generate a string with:

In [61]:
''.join(str(i) for i in bernoulli(0.5, 20))

Out[61]:
'11110010000011111101'
In [62]:
def random_binary_sequence(n, p=0.5):
"""Uniform random binary sequence of size n, with rate of 0/1 being p."""
return ''.join(str(i) for i in bernoulli(p, n))

In [63]:
random_binary_sequence(50)
random_binary_sequence(50, p=0.1)
random_binary_sequence(50, p=0.25)
random_binary_sequence(50, p=0.5)
random_binary_sequence(50, p=0.75)
random_binary_sequence(50, p=0.9)

Out[63]:
'10111010011001100110111110001111100001100111111010'
Out[63]:
'00000001000000001000000000001000100100000000000000'
Out[63]:
'00011000000000101010110000010011000100001000100001'
Out[63]:
'00000111111111000101000100100001100000101100000110'
Out[63]:
'11111100111011110010110111111101110011111011111111'
Out[63]:
'11111011111110111111111111111111111111111111111110'

And so, this function can test to check that the three implementations (naive, Cython-powered, Numba-powered) always give the same result.

In [64]:
def tests_3_functions(n, p=0.5, debug=True):
s = random_binary_sequence(n, p=p)
c1 = lempel_ziv_complexity(s)
if debug:
print("Sequence s = {} ==> complexity C = {}".format(s, c1))
c2 = lempel_ziv_complexity_cython(s)
c3 = lempel_ziv_complexity_numba(s)
assert c1 == c2 == c3, "Error: the sequence {} gave different values of the Lempel-Ziv complexity from 3 functions ({}, {}, {})...".format(s, c1, c2, c3)
return c1

In [65]:
tests_3_functions(5)

Sequence s = 11111 ==> complexity C = 2

Out[65]:
2
In [66]:
tests_3_functions(20)

Sequence s = 11000110100011110101 ==> complexity C = 9

Out[66]:
9
In [67]:
tests_3_functions(50)

Sequence s = 01000001001110111101111110011100101001011001111001 ==> complexity C = 17

Out[67]:
17
In [68]:
tests_3_functions(500)

Sequence s = 10110110111101100010011010110001001001101100110000110110010100111001110100001110001111110111100011011011001010000001111011110000101101011110011010111000111111101110111110010100110011000111011101100101101101000000111001010100100010000010011011100111010101001100001110101100101000000111010110011001010000100001111110110010011111100100001000011010011001001010100001111011101101000001100001011101010101100101100011000101101010010101001010011011011010000010100101101000110100100000111001100100110011011101 ==> complexity C = 99

Out[68]:
99
In [69]:
tests_3_functions(5000)

Sequence s = 10100100011000100101101111111001010001101101010100011101100101101100101111100110010110001010010101010000011101110011101111110110010101001011010010101000000111001011001000000101001010111001110000100101001111100001100000010111001100000010100010011010110001111111001111001101001101111111001111111110010110101100110010111001110101101010111010011000010100000000000111010100110000001110010101001101000100100111101100110110111001101011110101110011000001001101100101010100101100101010000100111111110100011101111110001110110011011001110100101000101000110010101010010000100101111010001111000100001110001000101011110111010000000110110000111110101110110010000110110110010110101110100011100010011011100101010001011111101001100100000100001101011010000011010001101101001011001100010101101110011100011111111111001001111100110101011110011100000010010100001110010111011011111011000101111001011101000001100000000101000010010110101010000011100110100101100111110101011010101100111010000100011111000111110001011100010011111110000001010010010000001101101001111110010100000101101111110101110001110001110001000001001011010101000011101000000110001001110001110101110011100111000011001100111011011000111110001111011100001001110011101100001000010010100011001101001001110110101101010111001001001001011000001011001111010001000110111011101101001100100001100001100010001011011010001001000101010001011101011111001010011000010011000110111111100110010101100111010010111101110110111011101110001101001000010010001001111001111111101000101111010101110100011110100001101100000100100010101011001101001111110001111010111101001100101000100100011000100011100101001110110010100011100001011010110110111000100011000101101000010100100101100100001000110010000110100100011010011000100001101011011010111010010110100110011111101011010101011101111111101011111001011010111010100110111001101001101011011001100001110000001011110001010100110101000010100101111000110111001101110000000011010101100010111100111011111100111110001110101010001000101111111110011111100010110010111011110111110111011110001110001101101101111110111100110101000001101111110111100010001110111110001001100011100000010111100100101011001101101011010110110011010110100111001011011011110000000010001110000011111111100011011001001001100100001110011000111110010110100100001011100001001000101000101001001101001000000110001100100001101111101001100100101000011011011111101000110101101101010000110001111011110011111001110111001100111010101010011110111001110111000000110101001010010111101000100100100101001001000111010100010111101011110100100111100110101010011101110011101110110101101111000111001111001010100001111110000100000001011000110001110101000001001001010001001000101110011010100100000110010000011100111001011111001001110010001011100101011101101011101111001000101101101011011110100001101100001011001101011000101000000001000110011011100010011010000111010011100010100010010010101000111101110110101011110100000011011010100011010100101011001100100000000110101011000100111001110001101001101001001110111011100111001001001000101011011111110100100000010111011100000101001001101110100010111100000000010101010000010100110011111001000010000111101111100100001000110000111110011001011011111011101111111011010001101111100010010010010010101000001000011110000000000101110111000001110000000111111011111001000110000000001010000010010110111000100101100111111010011101101111101111111011101101011100000011101111111111001111001100101001011000111010000110001001010111010100011111100011001101011110000100000010001100001000011111111111010111100111001010110000001011111110101000000100101010001001001111011110010011110000001100100110011110110110001110100100001000100001010110011110000100001110111011101101110100001010110111001101101010000011001101001100001111011000100110000101110100011011001000001100100101001110001000010111101110000111100101100011011011100110100001110000010111111001100011010001101110010000011000001111011010010100110101010011011101000111000100011010110101111011011001111100010001010110001111001000001101111010010001001010110111010000011110110101000010011001100100100000111011110001011110111000101001011010100000111111011100100101001001001100000010110110001100101011110001011000000010110111010111000000000110110100010111000111011101001110011101001100011000000101001111000101011101000010111111010100001000100000000110101100010110000111000110001101001010010100111010101000111101000110010100110010001001110110101000010010101001011001100001000100000101110100011000010000100010101100111001010100010111111010110000001000011100110000110110110010111111011010110001011000011011111000001001100100110110001100100110111110000101011000100001000110110100000001101110001100010101001011111111110100010001001001101101101101011000110000011000100010011111000110011011011100000101001001110011010100001011001011110100110001011011010001100111001011001101111110101101011101000111010111000000001010001001011000001001000101001010011010001110000001000010001110011001001110110100011101100110101110001100101010000010001000101101 ==> complexity C = 654

Out[69]:
654

## Benchmarks¶

On two example of strings (binary sequences), we can compare our three implementation.

In [70]:
%timeit lempel_ziv_complexity('100111101100001000001010')
%timeit lempel_ziv_complexity_cython('100111101100001000001010')
%timeit lempel_ziv_complexity_numba('100111101100001000001010')

6.28 µs ± 295 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.22 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
50.1 µs ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [71]:
%timeit lempel_ziv_complexity('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_cython('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_numba('10011110110000100000101000100100101010010111111011001111111110101001010110101010')

25.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
5.43 µs ± 442 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
84.4 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Let check the time used by all the three functions, for longer and longer sequences:

In [72]:
%timeit tests_3_functions(10, debug=False)
%timeit tests_3_functions(20, debug=False)
%timeit tests_3_functions(40, debug=False)
%timeit tests_3_functions(80, debug=False)
%timeit tests_3_functions(160, debug=False)
%timeit tests_3_functions(320, debug=False)

229 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
148 µs ± 52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
123 µs ± 6.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
205 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
288 µs ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
819 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [73]:
def test_cython(n):
s = random_binary_sequence(n)
c = lempel_ziv_complexity_cython(s)
return c

In [74]:
%timeit test_cython(10)
%timeit test_cython(20)
%timeit test_cython(40)
%timeit test_cython(80)
%timeit test_cython(160)
%timeit test_cython(320)

23.9 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
53.4 µs ± 5.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
207 µs ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
369 µs ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [75]:
%timeit test_cython(640)
%timeit test_cython(1280)
%timeit test_cython(2560)
%timeit test_cython(5120)

679 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.85 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.97 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.47 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [76]:
%timeit test_cython(10240)
%timeit test_cython(20480)

10.7 ms ± 891 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
20.9 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Complexity ?¶

$\implies$ The function lempel_ziv_complexity_cython seems to be indeed (almost) linear in $n$, the length of the binary sequence $S$.

But let check more precisely, as it could also have a complexity of $\mathcal{O}(n \log n)$.

In [95]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(context="notebook", style="darkgrid", palette="hls", font="sans-serif", font_scale=1.4)

In [96]:
import numpy as np
import timeit

In [109]:
sizes = np.array(np.trunc(np.logspace(1, 6, 30)), dtype=int)

times = np.array([
timeit.timeit(
stmt="lempel_ziv_complexity_cython(random_string({}))".format(n),
globals=globals(),
number=10,
)
for n in sizes
])

In [110]:
plt.figure(figsize=(15, 10))
plt.plot(sizes, times, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity")
plt.show()

Out[110]:
<Figure size 1080x720 with 0 Axes>
Out[110]:
[<matplotlib.lines.Line2D at 0x7f3808d3e4d0>]
Out[110]:
Text(0.5, 0, 'Length $n$ of the binary sequence $S$')
Out[110]:
Text(0, 0.5, 'Time in $\\mu\\;\\mathrm{s}$')
Out[110]:
Text(0.5, 1.0, 'Time complexity of Lempel-Ziv complexity')
In [111]:
plt.figure(figsize=(15, 10))
plt.loglog(sizes, times, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, loglog scale")
plt.show()

Out[111]:
<Figure size 1080x720 with 0 Axes>
Out[111]:
[<matplotlib.lines.Line2D at 0x7f3808ccbf50>]
Out[111]:
Text(0.5, 0, 'Length $n$ of the binary sequence $S$')
Out[111]:
Text(0, 0.5, 'Time in $\\mu\\;\\mathrm{s}$')
Out[111]:
Text(0.5, 1.0, 'Time complexity of Lempel-Ziv complexity, loglog scale')

It is linear in $\log\log$ scale, so indeed the algorithm seems to have a linear complexity.

To sum-up, for a sequence $S$ of length $n$, it takes $\mathcal{O}(n)$ basic operations to compute its Lempel-Ziv complexity $\mathrm{Lempel}-\mathrm{Ziv}(S)$.

## Conclusion¶

• The Lempel-Ziv complexity is not too hard to implement, and it indeed represents a certain complexity of a binary sequence, capturing the regularity and reproducibility of the sequence.

• Using the Cython was quite useful to have a $\simeq \times 100$ speed up on our manual naive implementation !

• The algorithm is not easy to analyze, we have a trivial $\mathcal{O}(n^2)$ bound but experiments showed it is more likely to be $\mathcal{O}(n \log n)$ in the worst case, and $\mathcal{O}(n)$ in practice for "not too complicated sequences" (or in average, for random sequences).

## (Experimental) Julia implementation¶

I want to (quickly) try to see if I can use Julia to write a faster version of this function. See issue #1.

In [118]:
%%time
%%script julia

"""Lempel-Ziv complexity for a sequence, in simple Julia code."""
function lempel_ziv_complexity(sequence)
sub_strings = Set()
n = length(sequence)

ind = 1
inc = 1
while true
if ind + inc > n
break
end
sub_str = sequence[ind : ind + inc]
if sub_str in sub_strings
inc += 1
else
push!(sub_strings, sub_str)
ind += inc
inc = 1
end
end
return length(sub_strings)
end

s = "1001111011000010"
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010

M = 1000;
N = 10000;
for _ in 1:M
s = join(rand(0:1, N));
lempel_ziv_complexity(s);
end
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010

"Lempel-Ziv complexity for a sequence, in simple Julia code."
lempel_ziv_complexity (generic function with 1 method)
"1001111011000010"
9
1000
10000
9
CPU times: user 6.89 ms, sys: 8 ms, total: 14.9 ms
Wall time: 3.85 s


And to compare it fairly, let us use Pypy for comparison.

In [122]:
%%time
%%pypy

def lempel_ziv_complexity(sequence):
"""Lempel-Ziv complexity for a binary sequence, in simple Python code."""
sub_strings = set()
n = len(sequence)

ind = 0
inc = 1
while True:
if ind + inc > len(sequence):
break
sub_str = sequence[ind : ind + inc]
if sub_str in sub_strings:
inc += 1
else:
ind += inc
inc = 1
return len(sub_strings)

s = "1001111011000010"
lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010

from random import random

M = 1000
N = 10000
for _ in range(M):
s = ''.join(str(int(random() < 0.5)) for _ in range(N))
lempel_ziv_complexity(s)

CPU times: user 4.24 ms, sys: 8 ms, total: 12.2 ms
Wall time: 1.39 s


So we can check that on these 1000 random trials on strings of size 10000, the naive Julia version is slower than the naive Python version (executed by Pypy for speedup).

## Ending notes¶

Thanks for reading! My implementation is now open-source and available on GitHub, on https://github.com/Naereen/Lempel-Ziv_Complexity.

It will be available from PyPi very soon, see https://pypi.python.org/pypi/lempel_ziv_complexity.

See this repo on GitHub for more notebooks, or on nbviewer.jupyter.org.

That's it for this demo! See you, folks!