Comparing performance of btables in memory and cached disk

In this tutorial we are going to see the performance of the btable container using different compressors and compression levels, the only difference with "Comparing speedups when using compression with blz btables stored in disk" is that this one will work with memory and cached disk.

Reading data from disk is a really slow operation to improve loading times the operating system stores a copy of the read data into memory, so every other read is faster than the previous one. This is what is called caching and this is what I want to compare with working only with in memory BLZ. This is the normal scenario when working with data and this is why both are being compared.

In [1]:
import pandas as pd
import numpy as np
import pylab as plt
import blz
import sys
import csv
from time import time
from shutil import rmtree
from collections import defaultdict

Let's create a btable in disk and a btable on memory.

In [2]:
t1 = time()
df = pd.read_hdf('h5/msft-18-SEP-2013-blosc9.h5', '/msft')
t2 = time()

print "Reading h5: " + str(t2 - t1)

dt = np.dtype([(k, v if v.kind!='O' else 'S49') for k,v in df.dtypes.iteritems()])

t1 = time()
btm = blz.fromiter((i[1:] for i in df.itertuples()), dtype=dt, count=len(df), 
                  bparams=blz.bparams(clevel=0))
t2 = time()

print "Converting h5 to btable in memory: " + str(t2 - t1)

#If the blz already exist, remove it
rmtree('blz/msft-18-SEP-2013-blosc9.blz', ignore_errors=True)

t1 = time()
btd = btm.copy(rootdir='blz/msft-18-SEP-2013-blosc9.blz', bparams=blz.bparams(clevel=0))
t2 = time()

print "Copying btable from memory to disk: " + str(t2-t1)
Reading h5: 4.99775981903
Converting h5 to btable in memory: 7.79994988441
Copying btable from memory to disk: 10.7896511555

We will take the Open-High-Low-Close points for that chart, for every level, for every compressor. This will be our "heavy" operation.

In [3]:
def get_OHLC_points(src):
    
    windows = []
    window_size = 79
    sample_size = 2455
    
    iter = src.where('(Type == \'Trade\')')

    for i in xrange(sample_size):
    
        local_max = - np.inf
        local_min = np.inf
        average_price = 0
    
        for j in xrange(window_size):
        
            current = iter.next()
            price = current[7]
            average_price += price
            
            if(price > local_max):
                local_max = price
            
            if(price < local_min):
                local_min = price
            
            if(j == 0):
                local_open = price
            
            if(j == (window_size - 1)):
                local_close = price
        
        windows.append([average_price/float(window_size), local_open, local_max, local_min, local_close])  

Let's do the copy function.

In [4]:
def copy(src, memory, clevel=5, shuffle=True, cname="blosclz"):

    """

    Parameters
    ----------
    clevel : int (0 <= clevel < 10)
        The compression level.
    shuffle : bool
        Whether the shuffle filter is active or not.
    cname : string ('blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', others?)
        Select the compressor to use inside Blosc.

    """
    if memory == True:
        copied = src.copy(bparams=blz.bparams(clevel=clevel, shuffle=shuffle, cname=cname))
    else:
        copied = src.copy(rootdir='blz/temp.blz', bparams=blz.bparams(clevel=clevel, shuffle=shuffle, cname=cname))
        
    copied.flush()
    
    return copied

And now we just need a benchmark function.

In [7]:
def benchmark(data, cmethods, memory):
    
    if memory == True:
        base_name = 'csv/bbtablem_'
    else:
        base_name = 'csv/bbtable_'

    for method in cmethods:

        myfile = open(base_name + method + '.csv', 'wb')
        wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
        wr.writerow(['Compression level',
                     'Compressed size',
                     'Compression ratio',
                     'Compression time',
                     'OHLC time',
                     'Writing speed (MiB/s)',
                     'OHLC speed (MiB/s)'])

        for compression_level in xrange(0,10):

            #I use data so I only keep one copy in memory
            tc1 = time()
            data = copy(data, memory, compression_level, True, method)
            tc2 = time()

            #Now I get the poins and measure the time
            t1 = time()
            get_OHLC_points(data)
            t2 = time()
            
            #Uncompressed size in MiB
            uncompressed = data.nbytes/(2**20)

            #I should store stuff
            row = [compression_level, data.cbytes,
                   round(data.nbytes/float(data.cbytes),3),
                   str(tc2 - tc1),
                   str(t2 - t1),
                   uncompressed/(tc2 - tc1),
                   uncompressed/(t2 - t1)]

            #Add it to the csv
            wr.writerow(row)
            rmtree('blz/temp.blz', ignore_errors=True)

        myfile.close()

We have all the ingredients now, let's do some benchmarking.

In [8]:
cmethods = ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']

t1 = time()
benchmark(btd, cmethods, False)
t2 = time()

t3 = time()
benchmark(btm, cmethods, True)
t4 = time()

print 'Cached disk: ' + str(t2-t1)
print 'Memory only: ' + str(t4-t3)
Cached disk: 748.564900875
Memory only: 564.818388939

We can see all the gathered data now.

In [15]:
print 'blosclz data'
print 'Memory'
df =  pd.read_csv('csv/bbtablem_blosclz.csv')
df
blosclz data
Memory
Out[15]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 0.681122 2.263231 675.356186 203.249245
1 1 62562582 7.716 0.804646 2.227441 571.679958 206.515003
2 2 60615376 7.964 0.828417 2.152618 555.276023 213.693286
3 3 60114378 8.030 0.841646 2.186049 546.548102 210.425294
4 4 28904306 16.701 0.960289 2.399268 479.022460 191.725150
5 5 27821399 17.351 1.134386 2.458255 405.505687 187.124603
6 6 27450920 17.585 1.142648 2.458005 402.573677 187.143643
7 7 27136857 17.789 1.140152 2.495144 403.454902 184.358105
8 8 27091198 17.819 1.157194 2.430481 397.513330 189.262952
9 9 27033395 17.857 1.172375 2.481087 392.365936 185.402592

10 rows × 7 columns

In [16]:
print 'blosclz data'
print 'Disk'
df =  pd.read_csv('csv/bbtable_blosclz.csv')
df
blosclz data
Disk
Out[16]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 10.990172 3.124110 41.855578 147.241935
1 1 62562582 7.716 1.679541 2.767619 273.884374 166.207841
2 2 60615376 7.964 1.737121 2.741565 264.805984 167.787378
3 3 60114378 8.030 1.763058 2.736637 260.910313 168.089513
4 4 28904306 16.701 1.857233 2.958073 247.680279 155.506635
5 5 27821399 17.351 1.860526 2.963732 247.241898 155.209715
6 6 27450920 17.585 1.860842 2.876680 247.199925 159.906551
7 7 27136857 17.789 1.854447 2.967175 248.052367 155.029615
8 8 27091198 17.819 1.871443 3.153615 245.799627 145.864349
9 9 27033395 17.857 1.910575 2.970001 240.765226 154.882104

10 rows × 7 columns

In [18]:
print 'lz4 data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_lz4.csv')
df
lz4 data
Memory
Out[18]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 0.709622 2.190826 648.232525 209.966452
1 1 31210120 15.467 0.699199 2.303638 657.895714 199.684153
2 2 31210120 15.467 0.757916 2.341326 606.927438 196.469864
3 3 31210120 15.467 0.761211 2.227152 604.300129 206.541820
4 4 31210120 15.467 0.766530 2.341871 600.106947 196.424139
5 5 31210120 15.467 0.759634 2.428854 605.554766 189.389730
6 6 31210120 15.467 0.770773 2.244157 596.803520 204.976739
7 7 31210120 15.467 0.766620 2.323864 600.036399 197.946181
8 8 31210120 15.467 0.764220 2.250569 601.920913 204.392768
9 9 31210120 15.467 0.762074 2.331698 603.615927 197.281108

10 rows × 7 columns

In [17]:
print 'lz4 data'
print 'Disk'
df = pd.read_csv('csv/bbtable_lz4.csv')
df
lz4 data
Disk
Out[17]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 9.310785 2.775670 49.405071 165.725750
1 1 31210120 15.467 1.642699 2.819540 280.026955 163.147179
2 2 31210120 15.467 1.629550 2.790243 282.286524 164.860184
3 3 31210120 15.467 1.624146 2.801503 283.225772 164.197579
4 4 31210120 15.467 1.620733 2.847433 283.822193 161.549011
5 5 31210120 15.467 1.620822 2.820628 283.806578 163.084253
6 6 31210120 15.467 1.630892 2.913791 282.054231 157.869940
7 7 31210120 15.467 1.651789 2.862687 278.485941 160.688186
8 8 31210120 15.467 1.661538 2.805433 276.851948 163.967557
9 9 31210120 15.467 1.624195 2.863090 283.217207 160.665572

10 rows × 7 columns

In [19]:
print 'lz4hc data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_lz4hc.csv')
df
lz4hc data
Memory
Out[19]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 0.601254 2.203485 765.067693 208.760213
1 1 27466793 17.575 3.476439 2.412721 132.319307 190.656116
2 2 23961748 20.146 4.512663 2.281846 101.935373 201.591164
3 3 21717259 22.228 6.930293 2.223672 66.375262 206.865027
4 4 20475242 23.576 13.367165 2.200643 34.412683 209.029809
5 5 19908867 24.247 28.572774 2.280474 16.099242 201.712435
6 6 19752345 24.439 49.155402 2.253255 9.358076 204.149119
7 7 19711733 24.489 56.906914 2.264541 8.083376 203.131680
8 8 19710950 24.490 57.256365 2.243897 8.034041 205.000479
9 9 19710833 24.491 57.443949 2.261611 8.007806 203.394817

10 rows × 7 columns

In [20]:
print 'lz4hc data'
print 'Disk'
df = pd.read_csv('csv/bbtable_lz4hc.csv')
df
lz4hc data
Disk
Out[20]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 10.022066 3.445173 45.898720 133.520154
1 1 27466793 17.575 4.448363 2.818698 103.408825 163.195919
2 2 23961748 20.146 5.389405 2.779829 85.352650 165.477803
3 3 21717259 22.228 8.024240 2.839225 57.326301 162.016040
4 4 20475242 23.576 14.552245 2.809012 31.610243 163.758649
5 5 19908867 24.247 29.689302 2.799546 15.493796 164.312356
6 6 19752345 24.439 50.277322 2.786857 9.149254 165.060503
7 7 19711733 24.489 57.907075 2.704130 7.943762 170.110154
8 8 19710950 24.490 58.364596 2.765320 7.881490 166.346039
9 9 19710833 24.491 58.455211 2.771683 7.869273 165.964147

10 rows × 7 columns

In [21]:
print 'snappy data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_snappy.csv')
df
snappy data
Memory
Out[21]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 0.604864 2.201925 760.501383 208.908111
1 1 410696358 1.175 0.851744 2.334054 540.068418 197.081988
2 2 410696358 1.175 0.853806 2.296802 538.764063 200.278470
3 3 410696358 1.175 0.862138 2.373132 533.557252 193.836669
4 4 410696358 1.175 0.869969 2.308243 528.754394 199.285774
5 5 410696358 1.175 0.869898 2.290146 528.797580 200.860547
6 6 410696358 1.175 0.860055 2.290641 534.849534 200.817166
7 7 410696358 1.175 0.872584 2.316078 527.169958 198.611623
8 8 410696358 1.175 0.865230 2.266900 531.650492 202.920282
9 9 410696358 1.175 0.865937 2.305180 531.216478 199.550571

10 rows × 7 columns

In [23]:
print 'snappy data'
print 'Disk'
df = pd.read_csv('csv/bbtable_snappy.csv')
df
snappy data
Disk
Out[23]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 10.879534 5.507512 42.281223 83.522286
1 1 410696358 1.175 7.985037 2.960860 57.607749 155.360266
2 2 410696358 1.175 11.170788 2.792554 41.178832 164.723768
3 3 410696358 1.175 11.086721 2.931666 41.491077 156.907375
4 4 410696358 1.175 10.727255 2.943897 42.881427 156.255466
5 5 410696358 1.175 7.165172 2.955973 64.199437 155.617123
6 6 410696358 1.175 7.418792 3.115795 62.004704 147.634867
7 7 410696358 1.175 7.775094 3.132531 59.163272 146.846116
8 8 410696358 1.175 7.291722 2.939524 63.085237 156.487925
9 9 410696358 1.175 6.927933 2.878196 66.397871 159.822333

10 rows × 7 columns

In [25]:
print 'zlib data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_zlib.csv')
df
zlib data
Memory
Out[25]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 0.662771 2.162894 694.055729 212.678013
1 1 21562571 22.387 3.683232 3.536478 124.890311 130.072913
2 2 19985468 24.154 4.939962 3.484861 93.118127 131.999530
3 3 19364324 24.929 5.524262 3.473563 83.269038 132.428865
4 4 17011833 28.376 8.573870 3.662094 53.651385 125.611198
5 5 16773528 28.779 9.709417 3.655160 47.376687 125.849486
6 6 15251112 31.652 12.199962 3.398706 37.705035 135.345630
7 7 15156305 31.850 14.717500 3.480614 31.255308 132.160592
8 8 14295434 33.768 29.362271 3.468930 15.666363 132.605731
9 9 14114077 34.202 45.618963 3.359124 10.083526 136.940457

10 rows × 7 columns

In [26]:
print 'zlib data'
print 'Disk'
df = pd.read_csv('csv/bbtable_zlib.csv')
df
zlib data
Disk
Out[26]:
Compression level Compressed size Compression ratio Compression time OHLC time Writing speed (MiB/s) OHLC speed (MiB/s)
0 0 485063744 0.995 9.947921 3.326485 46.240817 138.284098
1 1 21562571 22.387 4.717685 4.090439 97.505446 112.457371
2 2 19985468 24.154 4.840078 4.022526 95.039793 114.356003
3 3 19364324 24.929 5.575667 3.940279 82.501342 116.743002
4 4 17011833 28.376 8.509122 4.047293 54.059632 113.656216
5 5 16773528 28.779 9.763632 4.058696 47.113615 113.336894
6 6 15251112 31.652 12.165651 3.965459 37.811375 116.001708
7 7 15156305 31.850 14.992578 3.990026 30.681848 115.287469
8 8 14295434 33.768 29.675020 3.967705 15.501253 115.936038
9 9 14114077 34.202 45.902231 4.076327 10.021299 112.846692

10 rows × 7 columns

Let's do some more plotting. For that I first need to make some dictionaries out of this data. I will create dictionaries of the disk benchmark too, so we can easily compare.

In [2]:
def get_dict(filename):
    
    columns = defaultdict(list)
    with open(filename) as f:
        reader = csv.reader(f)
        reader.next()
        for row in reader:
            for (i,v) in enumerate(row):
                if i == 1:
                    columns[i].append(float(v)/131072)
                    continue
                columns[i].append(v)
    return columns

#In memory csv
blosclzm = get_dict('csv/bbtablem_blosclz.csv')
lz4m = get_dict('csv/bbtablem_lz4.csv')
lz4hcm = get_dict('csv/bbtablem_lz4hc.csv')
snappym = get_dict('csv/bbtablem_snappy.csv')
zlibm = get_dict('csv/bbtablem_zlib.csv')

#In disk
blosclz = get_dict('csv/bbtable_blosclz.csv')
lz4 = get_dict('csv/bbtable_lz4.csv')
lz4hc = get_dict('csv/bbtable_lz4hc.csv')
snappy = get_dict('csv/bbtable_snappy.csv')
zlib = get_dict('csv/bbtable_zlib.csv')

Now we are ready to plot.

In [3]:
%matplotlib inline

#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_xlabel('Compression ratio')
ax.set_ylabel('Compression level')


#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[2][1:], blosclzm[0][1:], 'bo-')
ax1.plot(blosclz[2][1:], blosclz[0][1:], 'ro-')

#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[2][1:], blosclzm[0][1:], 'bo-')
ax2.plot(lz4[2][1:], blosclz[0][1:], 'ro-')

#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[2][1:], lz4hcm[0][1:], 'bo-')
ax3.plot(lz4hc[2][1:], lz4hc[0][1:], 'ro-')

#snappy
ax4.set_title('snappy')
ax4.plot(snappym[2][1:], snappym[0][1:], 'bo-')
ax4.plot(snappy[2][1:], snappy[0][1:], 'ro-')

#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[2][1:], zlibm[0][1:], 'bo-', label='Memory')
ax5.plot(zlib[2][1:], zlib[0][1:], 'ro-', label='Disk')

#Legend
ax5.legend(bbox_to_anchor=(2, 1))

plt.show()

Above we can see that both methods (Disk and memory) get the same compression ratio, this was expected because this only depends on the compressor. I only used it to show that everything should work ok despite where it is stored.

In [4]:
#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_ylabel('Compression time (s)')
ax.set_xlabel('Compression ratio')

#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[2][1:], blosclzm[3][1:], 'bo-')
ax1.plot(blosclz[2][1:], blosclz[3][1:], 'ro-')

#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[2][1:], blosclzm[3][1:], 'bo-')
ax2.plot(lz4[2][1:], blosclz[3][1:], 'ro-')

#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[2][1:], lz4hcm[3][1:], 'bo-')
ax3.plot(lz4hc[2][1:], lz4hc[3][1:], 'ro-')

#snappy
ax4.set_title('snappy')
ax4.plot(snappym[2][1:], snappym[3][1:], 'bo-')
ax4.plot(snappy[2][1:], snappy[3][1:], 'ro-')

#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[2][1:], zlibm[3][1:], 'bo-', label='Memory')
ax5.plot(zlib[2][1:], zlib[3][1:], 'ro-', label='Disk')

#Legend
ax5.legend(bbox_to_anchor=(2, 1))

plt.show()

Here we can see that working with the dataset in memory is faster than reading it from disk. Due to the bigger scale of zlib and lz4hc we cannot appreciate it as well on those compressors.

In [12]:
#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_ylabel('OHLC time (s)')
ax.set_xlabel('Compression level')
plt.suptitle('OHLC/Compression')

#Y value for red line
Y = (float(blosclz[4][0]) + float(lz4[4][0]) + float(lz4hc[4][0]) + float(snappy[4][0]) + float(zlib[4][0]))/len(cmethods)
Ym = (float(blosclzm[4][0]) + float(lz4m[4][0]) + float(lz4hcm[4][0]) + float(snappym[4][0]) + float(zlibm[4][0]))/len(cmethods)

#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[0][1:], blosclzm[4][1:], 'bo-')
ax1.axhline(Ym, color = 'b')
ax1.plot(blosclz[0][1:], blosclz[4][1:], 'ro-')
ax1.axhline(Y, color = 'r')
ax1.set_ylim(ymin=2, ymax=4.5)

#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[0][1:], lz4m[4][1:], 'bo-')
ax2.axhline(Ym, color = 'b')
ax2.plot(lz4[0][1:], lz4[4][1:], 'ro-')
ax2.axhline(Y, color = 'r')
ax2.set_ylim(ymin=2, ymax=4.5)

#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[0][1:], lz4hcm[4][1:], 'bo-')
ax3.axhline(Ym, color = 'b')
ax3.plot(lz4hc[0][1:], lz4hc[4][1:], 'ro-')
ax3.axhline(Y, color = 'r')
ax3.set_ylim(ymin=2, ymax=4.5)

#snappy
ax4.set_title('snappy')
ax4.plot(snappym[0][1:], snappym[4][1:], 'bo-')
ax4.axhline(Ym, color = 'b')
ax4.plot(snappy[0][1:], snappy[4][1:], 'ro-')
ax4.axhline(Y, color = 'r')
ax4.set_ylim(ymin=2, ymax=4.5)

#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[0][1:], zlibm[4][1:], 'bo-', label='Memory')
ax5.axhline(Ym, color = 'b')
ax5.plot(zlib[0][1:], zlib[4][1:], 'ro-', label='Disk')
ax5.axhline(Y, color = 'r')
ax5.set_ylim(ymin=2, ymax=4.5)

#Legend
ax5.legend(bbox_to_anchor=(2, 1))

plt.show()

This may be one of the most interesting plots, the two horizontal lines represent the average time for that operation for the uncompressed dataset. That means that the compressed files (Represented as dots) below than line can be computed faster than the uncompressed dataset.

Here there are two important points:

  • You should compress data if you are going to store it on disk.
  • Sometimes even when working with memory, it is faster to compress data.

Let's see how many MiB/s we get.

In [37]:
#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_ylabel('MiB/s')
ax.set_xlabel('Compression level')

#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[0][1:], blosclzm[5][1:], 'bo-')
ax1.plot(blosclzm[0][1:], blosclzm[6][1:], 'go-')
ax1.plot(blosclz[0][1:], blosclz[5][1:], 'ro-')
ax1.plot(blosclz[0][1:], blosclz[6][1:], 'mo-')

#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[0][1:], lz4m[5][1:], 'bo-')
ax2.plot(lz4m[0][1:], lz4m[6][1:], 'go-')
ax2.plot(lz4[0][1:], lz4[5][1:], 'ro-')
ax2.plot(lz4[0][1:], lz4[6][1:], 'mo-')

#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[0][1:], lz4hcm[5][1:], 'bo-')
ax3.plot(lz4hcm[0][1:], lz4hcm[6][1:], 'go-')
ax3.plot(lz4hc[0][1:], lz4hc[5][1:], 'ro-')
ax3.plot(lz4hc[0][1:], lz4hc[6][1:], 'mo-')

#snappy
ax4.set_title('snappy')
ax4.plot(snappym[0][1:], snappym[5][1:], 'bo-')
ax4.plot(snappym[0][1:], snappym[6][1:], 'go-')
ax4.plot(snappy[0][1:], snappy[5][1:], 'ro-')
ax4.plot(snappy[0][1:], snappy[6][1:], 'mo-')

#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[0][1:], zlibm[5][1:], 'bo-', label='Compression speed (Memory)')
ax5.plot(zlibm[0][1:], zlibm[6][1:], 'go-', label='Computational speed (Memory)')
ax5.plot(zlib[0][1:], zlib[5][1:], 'ro-', label='Compression speed (Disk)')
ax5.plot(zlib[0][1:], zlib[6][1:], 'mo-', label='Computational speed (Disk)')

#Legend
ax5.legend(bbox_to_anchor=(2.7, 1))

plt.show()

This is the less understanable plot in my opinion. Higher lines mean better speed, so we can see how working with the datasets in memory is faster than its disk counterpart.

Conclusions

In general we have seen that for disk it is faster to use compressed data rather than the uncompressed one. What about memory? In the OHLC/Compression plot we can see that working with compressed data is nearly as fast (or sometimes even faster) than working with the uncompressed data. This is a very important point because you are not only working "at the same speed" but you are also saving memory.

Regarding speed

Reading and writing data from disk are really slow operations. Due to processors being very fast it is actually faster for most cases to compress the data before writing it to disk (so you write less data that is the slow operation) than writing the uncompressed data. This is true for both reading and writing data from disk. This is becoming true for memory too, as shown before.

Regarding memory

When working with big data I have realized that RAM memory is a big issue, you can't or shouldn't fit all the data in it. And that why BLZ is awesome, it allows you to work with big data without on disk without actually trying to load it all into memory. And I have shown in previous tutorials that those containers are really fast.

Also we are getting up to a 34 compression ratio, that means we are storing 34 times less data on disk, which allows us to store even more data!