Comparing performance of btables in memory and cached disk¶

In this tutorial we are going to see the performance of the btable container using different compressors and compression levels, the only difference with "Comparing speedups when using compression with blz btables stored in disk" is that this one will work with memory and cached disk.

Reading data from disk is a really slow operation to improve loading times the operating system stores a copy of the read data into memory, so every other read is faster than the previous one. This is what is called caching and this is what I want to compare with working only with in memory BLZ. This is the normal scenario when working with data and this is why both are being compared.

In [1]:

import pandas as pd
import numpy as np
import pylab as plt
import blz
import sys
import csv
from time import time
from shutil import rmtree
from collections import defaultdict

Let's create a btable in disk and a btable on memory.

In [2]:

t1 = time()
df = pd.read_hdf('h5/msft-18-SEP-2013-blosc9.h5', '/msft')
t2 = time()

print "Reading h5: " + str(t2 - t1)

dt = np.dtype([(k, v if v.kind!='O' else 'S49') for k,v in df.dtypes.iteritems()])

t1 = time()
btm = blz.fromiter((i[1:] for i in df.itertuples()), dtype=dt, count=len(df), 
                  bparams=blz.bparams(clevel=0))
t2 = time()

print "Converting h5 to btable in memory: " + str(t2 - t1)

#If the blz already exist, remove it
rmtree('blz/msft-18-SEP-2013-blosc9.blz', ignore_errors=True)

t1 = time()
btd = btm.copy(rootdir='blz/msft-18-SEP-2013-blosc9.blz', bparams=blz.bparams(clevel=0))
t2 = time()

print "Copying btable from memory to disk: " + str(t2-t1)

Reading h5: 4.99775981903
Converting h5 to btable in memory: 7.79994988441
Copying btable from memory to disk: 10.7896511555

We will take the Open-High-Low-Close points for that chart, for every level, for every compressor. This will be our "heavy" operation.

In [3]:

def get_OHLC_points(src):
    
    windows = []
    window_size = 79
    sample_size = 2455
    
    iter = src.where('(Type == \'Trade\')')

    for i in xrange(sample_size):
    
        local_max = - np.inf
        local_min = np.inf
        average_price = 0
    
        for j in xrange(window_size):
        
            current = iter.next()
            price = current[7]
            average_price += price
            
            if(price > local_max):
                local_max = price
            
            if(price < local_min):
                local_min = price
            
            if(j == 0):
                local_open = price
            
            if(j == (window_size - 1)):
                local_close = price
        
        windows.append([average_price/float(window_size), local_open, local_max, local_min, local_close])  

Let's do the copy function.

In [4]:

def copy(src, memory, clevel=5, shuffle=True, cname="blosclz"):

    """

    Parameters
    ----------
    clevel : int (0 <= clevel < 10)
        The compression level.
    shuffle : bool
        Whether the shuffle filter is active or not.
    cname : string ('blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', others?)
        Select the compressor to use inside Blosc.

    """
    if memory == True:
        copied = src.copy(bparams=blz.bparams(clevel=clevel, shuffle=shuffle, cname=cname))
    else:
        copied = src.copy(rootdir='blz/temp.blz', bparams=blz.bparams(clevel=clevel, shuffle=shuffle, cname=cname))
        
    copied.flush()
    
    return copied

And now we just need a benchmark function.

In [7]:

def benchmark(data, cmethods, memory):
    
    if memory == True:
        base_name = 'csv/bbtablem_'
    else:
        base_name = 'csv/bbtable_'

    for method in cmethods:

        myfile = open(base_name + method + '.csv', 'wb')
        wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
        wr.writerow(['Compression level',
                     'Compressed size',
                     'Compression ratio',
                     'Compression time',
                     'OHLC time',
                     'Writing speed (MiB/s)',
                     'OHLC speed (MiB/s)'])

        for compression_level in xrange(0,10):

            #I use data so I only keep one copy in memory
            tc1 = time()
            data = copy(data, memory, compression_level, True, method)
            tc2 = time()

            #Now I get the poins and measure the time
            t1 = time()
            get_OHLC_points(data)
            t2 = time()
            
            #Uncompressed size in MiB
            uncompressed = data.nbytes/(2**20)

            #I should store stuff
            row = [compression_level, data.cbytes,
                   round(data.nbytes/float(data.cbytes),3),
                   str(tc2 - tc1),
                   str(t2 - t1),
                   uncompressed/(tc2 - tc1),
                   uncompressed/(t2 - t1)]

            #Add it to the csv
            wr.writerow(row)
            rmtree('blz/temp.blz', ignore_errors=True)

        myfile.close()

We have all the ingredients now, let's do some benchmarking.

In [8]:

cmethods = ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']

t1 = time()
benchmark(btd, cmethods, False)
t2 = time()

t3 = time()
benchmark(btm, cmethods, True)
t4 = time()

print 'Cached disk: ' + str(t2-t1)
print 'Memory only: ' + str(t4-t3)

Cached disk: 748.564900875
Memory only: 564.818388939

We can see all the gathered data now.

In [15]:

print 'blosclz data'
print 'Memory'
df =  pd.read_csv('csv/bbtablem_blosclz.csv')
df

blosclz data
Memory

Out[15]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	0.681122	2.263231	675.356186	203.249245
1	1	62562582	7.716	0.804646	2.227441	571.679958	206.515003
2	2	60615376	7.964	0.828417	2.152618	555.276023	213.693286
3	3	60114378	8.030	0.841646	2.186049	546.548102	210.425294
4	4	28904306	16.701	0.960289	2.399268	479.022460	191.725150
5	5	27821399	17.351	1.134386	2.458255	405.505687	187.124603
6	6	27450920	17.585	1.142648	2.458005	402.573677	187.143643
7	7	27136857	17.789	1.140152	2.495144	403.454902	184.358105
8	8	27091198	17.819	1.157194	2.430481	397.513330	189.262952
9	9	27033395	17.857	1.172375	2.481087	392.365936	185.402592

10 rows × 7 columns

In [16]:

print 'blosclz data'
print 'Disk'
df =  pd.read_csv('csv/bbtable_blosclz.csv')
df

blosclz data
Disk

Out[16]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	10.990172	3.124110	41.855578	147.241935
1	1	62562582	7.716	1.679541	2.767619	273.884374	166.207841
2	2	60615376	7.964	1.737121	2.741565	264.805984	167.787378
3	3	60114378	8.030	1.763058	2.736637	260.910313	168.089513
4	4	28904306	16.701	1.857233	2.958073	247.680279	155.506635
5	5	27821399	17.351	1.860526	2.963732	247.241898	155.209715
6	6	27450920	17.585	1.860842	2.876680	247.199925	159.906551
7	7	27136857	17.789	1.854447	2.967175	248.052367	155.029615
8	8	27091198	17.819	1.871443	3.153615	245.799627	145.864349
9	9	27033395	17.857	1.910575	2.970001	240.765226	154.882104

10 rows × 7 columns

In [18]:

print 'lz4 data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_lz4.csv')
df

lz4 data
Memory

Out[18]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	0.709622	2.190826	648.232525	209.966452
1	1	31210120	15.467	0.699199	2.303638	657.895714	199.684153
2	2	31210120	15.467	0.757916	2.341326	606.927438	196.469864
3	3	31210120	15.467	0.761211	2.227152	604.300129	206.541820
4	4	31210120	15.467	0.766530	2.341871	600.106947	196.424139
5	5	31210120	15.467	0.759634	2.428854	605.554766	189.389730
6	6	31210120	15.467	0.770773	2.244157	596.803520	204.976739
7	7	31210120	15.467	0.766620	2.323864	600.036399	197.946181
8	8	31210120	15.467	0.764220	2.250569	601.920913	204.392768
9	9	31210120	15.467	0.762074	2.331698	603.615927	197.281108

10 rows × 7 columns

In [17]:

print 'lz4 data'
print 'Disk'
df = pd.read_csv('csv/bbtable_lz4.csv')
df

lz4 data
Disk

Out[17]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	9.310785	2.775670	49.405071	165.725750
1	1	31210120	15.467	1.642699	2.819540	280.026955	163.147179
2	2	31210120	15.467	1.629550	2.790243	282.286524	164.860184
3	3	31210120	15.467	1.624146	2.801503	283.225772	164.197579
4	4	31210120	15.467	1.620733	2.847433	283.822193	161.549011
5	5	31210120	15.467	1.620822	2.820628	283.806578	163.084253
6	6	31210120	15.467	1.630892	2.913791	282.054231	157.869940
7	7	31210120	15.467	1.651789	2.862687	278.485941	160.688186
8	8	31210120	15.467	1.661538	2.805433	276.851948	163.967557
9	9	31210120	15.467	1.624195	2.863090	283.217207	160.665572

10 rows × 7 columns

In [19]:

print 'lz4hc data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_lz4hc.csv')
df

lz4hc data
Memory

Out[19]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	0.601254	2.203485	765.067693	208.760213
1	1	27466793	17.575	3.476439	2.412721	132.319307	190.656116
2	2	23961748	20.146	4.512663	2.281846	101.935373	201.591164
3	3	21717259	22.228	6.930293	2.223672	66.375262	206.865027
4	4	20475242	23.576	13.367165	2.200643	34.412683	209.029809
5	5	19908867	24.247	28.572774	2.280474	16.099242	201.712435
6	6	19752345	24.439	49.155402	2.253255	9.358076	204.149119
7	7	19711733	24.489	56.906914	2.264541	8.083376	203.131680
8	8	19710950	24.490	57.256365	2.243897	8.034041	205.000479
9	9	19710833	24.491	57.443949	2.261611	8.007806	203.394817

10 rows × 7 columns

In [20]:

print 'lz4hc data'
print 'Disk'
df = pd.read_csv('csv/bbtable_lz4hc.csv')
df

lz4hc data
Disk

Out[20]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	10.022066	3.445173	45.898720	133.520154
1	1	27466793	17.575	4.448363	2.818698	103.408825	163.195919
2	2	23961748	20.146	5.389405	2.779829	85.352650	165.477803
3	3	21717259	22.228	8.024240	2.839225	57.326301	162.016040
4	4	20475242	23.576	14.552245	2.809012	31.610243	163.758649
5	5	19908867	24.247	29.689302	2.799546	15.493796	164.312356
6	6	19752345	24.439	50.277322	2.786857	9.149254	165.060503
7	7	19711733	24.489	57.907075	2.704130	7.943762	170.110154
8	8	19710950	24.490	58.364596	2.765320	7.881490	166.346039
9	9	19710833	24.491	58.455211	2.771683	7.869273	165.964147

10 rows × 7 columns

In [21]:

print 'snappy data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_snappy.csv')
df

snappy data
Memory

Out[21]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	0.604864	2.201925	760.501383	208.908111
1	1	410696358	1.175	0.851744	2.334054	540.068418	197.081988
2	2	410696358	1.175	0.853806	2.296802	538.764063	200.278470
3	3	410696358	1.175	0.862138	2.373132	533.557252	193.836669
4	4	410696358	1.175	0.869969	2.308243	528.754394	199.285774
5	5	410696358	1.175	0.869898	2.290146	528.797580	200.860547
6	6	410696358	1.175	0.860055	2.290641	534.849534	200.817166
7	7	410696358	1.175	0.872584	2.316078	527.169958	198.611623
8	8	410696358	1.175	0.865230	2.266900	531.650492	202.920282
9	9	410696358	1.175	0.865937	2.305180	531.216478	199.550571

10 rows × 7 columns

In [23]:

print 'snappy data'
print 'Disk'
df = pd.read_csv('csv/bbtable_snappy.csv')
df

snappy data
Disk

Out[23]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	10.879534	5.507512	42.281223	83.522286
1	1	410696358	1.175	7.985037	2.960860	57.607749	155.360266
2	2	410696358	1.175	11.170788	2.792554	41.178832	164.723768
3	3	410696358	1.175	11.086721	2.931666	41.491077	156.907375
4	4	410696358	1.175	10.727255	2.943897	42.881427	156.255466
5	5	410696358	1.175	7.165172	2.955973	64.199437	155.617123
6	6	410696358	1.175	7.418792	3.115795	62.004704	147.634867
7	7	410696358	1.175	7.775094	3.132531	59.163272	146.846116
8	8	410696358	1.175	7.291722	2.939524	63.085237	156.487925
9	9	410696358	1.175	6.927933	2.878196	66.397871	159.822333

10 rows × 7 columns

In [25]:

print 'zlib data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_zlib.csv')
df

zlib data
Memory

Out[25]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	0.662771	2.162894	694.055729	212.678013
1	1	21562571	22.387	3.683232	3.536478	124.890311	130.072913
2	2	19985468	24.154	4.939962	3.484861	93.118127	131.999530
3	3	19364324	24.929	5.524262	3.473563	83.269038	132.428865
4	4	17011833	28.376	8.573870	3.662094	53.651385	125.611198
5	5	16773528	28.779	9.709417	3.655160	47.376687	125.849486
6	6	15251112	31.652	12.199962	3.398706	37.705035	135.345630
7	7	15156305	31.850	14.717500	3.480614	31.255308	132.160592
8	8	14295434	33.768	29.362271	3.468930	15.666363	132.605731
9	9	14114077	34.202	45.618963	3.359124	10.083526	136.940457

10 rows × 7 columns

In [26]:

print 'zlib data'
print 'Disk'
df = pd.read_csv('csv/bbtable_zlib.csv')
df

zlib data
Disk

Out[26]:

	Compression level	Compressed size	Compression ratio	Compression time	OHLC time	Writing speed (MiB/s)	OHLC speed (MiB/s)
0	0	485063744	0.995	9.947921	3.326485	46.240817	138.284098
1	1	21562571	22.387	4.717685	4.090439	97.505446	112.457371
2	2	19985468	24.154	4.840078	4.022526	95.039793	114.356003
3	3	19364324	24.929	5.575667	3.940279	82.501342	116.743002
4	4	17011833	28.376	8.509122	4.047293	54.059632	113.656216
5	5	16773528	28.779	9.763632	4.058696	47.113615	113.336894
6	6	15251112	31.652	12.165651	3.965459	37.811375	116.001708
7	7	15156305	31.850	14.992578	3.990026	30.681848	115.287469
8	8	14295434	33.768	29.675020	3.967705	15.501253	115.936038
9	9	14114077	34.202	45.902231	4.076327	10.021299	112.846692

10 rows × 7 columns

Let's do some more plotting. For that I first need to make some dictionaries out of this data. I will create dictionaries of the disk benchmark too, so we can easily compare.

In [2]:

def get_dict(filename):
    
    columns = defaultdict(list)
    with open(filename) as f:
        reader = csv.reader(f)
        reader.next()
        for row in reader:
            for (i,v) in enumerate(row):
                if i == 1:
                    columns[i].append(float(v)/131072)
                    continue
                columns[i].append(v)
    return columns

#In memory csv
blosclzm = get_dict('csv/bbtablem_blosclz.csv')
lz4m = get_dict('csv/bbtablem_lz4.csv')
lz4hcm = get_dict('csv/bbtablem_lz4hc.csv')
snappym = get_dict('csv/bbtablem_snappy.csv')
zlibm = get_dict('csv/bbtablem_zlib.csv')

#In disk
blosclz = get_dict('csv/bbtable_blosclz.csv')
lz4 = get_dict('csv/bbtable_lz4.csv')
lz4hc = get_dict('csv/bbtable_lz4hc.csv')
snappy = get_dict('csv/bbtable_snappy.csv')
zlib = get_dict('csv/bbtable_zlib.csv')

Now we are ready to plot.

In [3]:

%matplotlib inline

#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_xlabel('Compression ratio')
ax.set_ylabel('Compression level')


#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[2][1:], blosclzm[0][1:], 'bo-')
ax1.plot(blosclz[2][1:], blosclz[0][1:], 'ro-')

#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[2][1:], blosclzm[0][1:], 'bo-')
ax2.plot(lz4[2][1:], blosclz[0][1:], 'ro-')

#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[2][1:], lz4hcm[0][1:], 'bo-')
ax3.plot(lz4hc[2][1:], lz4hc[0][1:], 'ro-')

#snappy
ax4.set_title('snappy')
ax4.plot(snappym[2][1:], snappym[0][1:], 'bo-')
ax4.plot(snappy[2][1:], snappy[0][1:], 'ro-')

#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[2][1:], zlibm[0][1:], 'bo-', label='Memory')
ax5.plot(zlib[2][1:], zlib[0][1:], 'ro-', label='Disk')

#Legend
ax5.legend(bbox_to_anchor=(2, 1))

plt.show()

Above we can see that both methods (Disk and memory) get the same compression ratio, this was expected because this only depends on the compressor. I only used it to show that everything should work ok despite where it is stored.

In [4]:

#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_ylabel('Compression time (s)')
ax.set_xlabel('Compression ratio')

#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[2][1:], blosclzm[3][1:], 'bo-')
ax1.plot(blosclz[2][1:], blosclz[3][1:], 'ro-')

#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[2][1:], blosclzm[3][1:], 'bo-')
ax2.plot(lz4[2][1:], blosclz[3][1:], 'ro-')

#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[2][1:], lz4hcm[3][1:], 'bo-')
ax3.plot(lz4hc[2][1:], lz4hc[3][1:], 'ro-')

#snappy
ax4.set_title('snappy')
ax4.plot(snappym[2][1:], snappym[3][1:], 'bo-')
ax4.plot(snappy[2][1:], snappy[3][1:], 'ro-')

#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[2][1:], zlibm[3][1:], 'bo-', label='Memory')
ax5.plot(zlib[2][1:], zlib[3][1:], 'ro-', label='Disk')

#Legend
ax5.legend(bbox_to_anchor=(2, 1))

plt.show()

Here we can see that working with the dataset in memory is faster than reading it from disk. Due to the bigger scale of zlib and lz4hc we cannot appreciate it as well on those compressors.

In [12]:

#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_ylabel('OHLC time (s)')
ax.set_xlabel('Compression level')
plt.suptitle('OHLC/Compression')

#Y value for red line
Y = (float(blosclz[4][0]) + float(lz4[4][0]) + float(lz4hc[4][0]) + float(snappy[4][0]) + float(zlib[4][0]))/len(cmethods)
Ym = (float(blosclzm[4][0]) + float(lz4m[4][0]) + float(lz4hcm[4][0]) + float(snappym[4][0]) + float(zlibm[4][0]))/len(cmethods)

#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[0][1:], blosclzm[4][1:], 'bo-')
ax1.axhline(Ym, color = 'b')
ax1.plot(blosclz[0][1:], blosclz[4][1:], 'ro-')
ax1.axhline(Y, color = 'r')
ax1.set_ylim(ymin=2, ymax=4.5)

#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[0][1:], lz4m[4][1:], 'bo-')
ax2.axhline(Ym, color = 'b')
ax2.plot(lz4[0][1:], lz4[4][1:], 'ro-')
ax2.axhline(Y, color = 'r')
ax2.set_ylim(ymin=2, ymax=4.5)

#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[0][1:], lz4hcm[4][1:], 'bo-')
ax3.axhline(Ym, color = 'b')
ax3.plot(lz4hc[0][1:], lz4hc[4][1:], 'ro-')
ax3.axhline(Y, color = 'r')
ax3.set_ylim(ymin=2, ymax=4.5)

#snappy
ax4.set_title('snappy')
ax4.plot(snappym[0][1:], snappym[4][1:], 'bo-')
ax4.axhline(Ym, color = 'b')
ax4.plot(snappy[0][1:], snappy[4][1:], 'ro-')
ax4.axhline(Y, color = 'r')
ax4.set_ylim(ymin=2, ymax=4.5)

#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[0][1:], zlibm[4][1:], 'bo-', label='Memory')
ax5.axhline(Ym, color = 'b')
ax5.plot(zlib[0][1:], zlib[4][1:], 'ro-', label='Disk')
ax5.axhline(Y, color = 'r')
ax5.set_ylim(ymin=2, ymax=4.5)

#Legend
ax5.legend(bbox_to_anchor=(2, 1))

plt.show()

This may be one of the most interesting plots, the two horizontal lines represent the average time for that operation for the uncompressed dataset. That means that the compressed files (Represented as dots) below than line can be computed faster than the uncompressed dataset.

Here there are two important points:

You should compress data if you are going to store it on disk.
Sometimes even when working with memory, it is faster to compress data.

Let's see how many MiB/s we get.

In [37]:

#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_ylabel('MiB/s')
ax.set_xlabel('Compression level')

#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[0][1:], blosclzm[5][1:], 'bo-')
ax1.plot(blosclzm[0][1:], blosclzm[6][1:], 'go-')
ax1.plot(blosclz[0][1:], blosclz[5][1:], 'ro-')
ax1.plot(blosclz[0][1:], blosclz[6][1:], 'mo-')

#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[0][1:], lz4m[5][1:], 'bo-')
ax2.plot(lz4m[0][1:], lz4m[6][1:], 'go-')
ax2.plot(lz4[0][1:], lz4[5][1:], 'ro-')
ax2.plot(lz4[0][1:], lz4[6][1:], 'mo-')

#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[0][1:], lz4hcm[5][1:], 'bo-')
ax3.plot(lz4hcm[0][1:], lz4hcm[6][1:], 'go-')
ax3.plot(lz4hc[0][1:], lz4hc[5][1:], 'ro-')
ax3.plot(lz4hc[0][1:], lz4hc[6][1:], 'mo-')

#snappy
ax4.set_title('snappy')
ax4.plot(snappym[0][1:], snappym[5][1:], 'bo-')
ax4.plot(snappym[0][1:], snappym[6][1:], 'go-')
ax4.plot(snappy[0][1:], snappy[5][1:], 'ro-')
ax4.plot(snappy[0][1:], snappy[6][1:], 'mo-')

#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[0][1:], zlibm[5][1:], 'bo-', label='Compression speed (Memory)')
ax5.plot(zlibm[0][1:], zlibm[6][1:], 'go-', label='Computational speed (Memory)')
ax5.plot(zlib[0][1:], zlib[5][1:], 'ro-', label='Compression speed (Disk)')
ax5.plot(zlib[0][1:], zlib[6][1:], 'mo-', label='Computational speed (Disk)')

#Legend
ax5.legend(bbox_to_anchor=(2.7, 1))

plt.show()

This is the less understanable plot in my opinion. Higher lines mean better speed, so we can see how working with the datasets in memory is faster than its disk counterpart.

Conclusions¶

In general we have seen that for disk it is faster to use compressed data rather than the uncompressed one. What about memory? In the OHLC/Compression plot we can see that working with compressed data is nearly as fast (or sometimes even faster) than working with the uncompressed data. This is a very important point because you are not only working "at the same speed" but you are also saving memory.

Regarding speed

Reading and writing data from disk are really slow operations. Due to processors being very fast it is actually faster for most cases to compress the data before writing it to disk (so you write less data that is the slow operation) than writing the uncompressed data. This is true for both reading and writing data from disk. This is becoming true for memory too, as shown before.

Regarding memory

When working with big data I have realized that RAM memory is a big issue, you can't or shouldn't fit all the data in it. And that why BLZ is awesome, it allows you to work with big data without on disk without actually trying to load it all into memory. And I have shown in previous tutorials that those containers are really fast.

Also we are getting up to a 34 compression ratio, that means we are storing 34 times less data on disk, which allows us to store even more data!