In this tutorial we are going to see the performance of the btable container using different compressors and compression levels, the only difference with "Comparing speedups when using compression with blz btables stored in disk" is that this one will work with memory and cached disk.
Reading data from disk is a really slow operation to improve loading times the operating system stores a copy of the read data into memory, so every other read is faster than the previous one. This is what is called caching and this is what I want to compare with working only with in memory BLZ. This is the normal scenario when working with data and this is why both are being compared.
import pandas as pd
import numpy as np
import pylab as plt
import blz
import sys
import csv
from time import time
from shutil import rmtree
from collections import defaultdict
Let's create a btable in disk and a btable on memory.
t1 = time()
df = pd.read_hdf('h5/msft-18-SEP-2013-blosc9.h5', '/msft')
t2 = time()
print "Reading h5: " + str(t2 - t1)
dt = np.dtype([(k, v if v.kind!='O' else 'S49') for k,v in df.dtypes.iteritems()])
t1 = time()
btm = blz.fromiter((i[1:] for i in df.itertuples()), dtype=dt, count=len(df),
bparams=blz.bparams(clevel=0))
t2 = time()
print "Converting h5 to btable in memory: " + str(t2 - t1)
#If the blz already exist, remove it
rmtree('blz/msft-18-SEP-2013-blosc9.blz', ignore_errors=True)
t1 = time()
btd = btm.copy(rootdir='blz/msft-18-SEP-2013-blosc9.blz', bparams=blz.bparams(clevel=0))
t2 = time()
print "Copying btable from memory to disk: " + str(t2-t1)
Reading h5: 4.99775981903 Converting h5 to btable in memory: 7.79994988441 Copying btable from memory to disk: 10.7896511555
We will take the Open-High-Low-Close points for that chart, for every level, for every compressor. This will be our "heavy" operation.
def get_OHLC_points(src):
windows = []
window_size = 79
sample_size = 2455
iter = src.where('(Type == \'Trade\')')
for i in xrange(sample_size):
local_max = - np.inf
local_min = np.inf
average_price = 0
for j in xrange(window_size):
current = iter.next()
price = current[7]
average_price += price
if(price > local_max):
local_max = price
if(price < local_min):
local_min = price
if(j == 0):
local_open = price
if(j == (window_size - 1)):
local_close = price
windows.append([average_price/float(window_size), local_open, local_max, local_min, local_close])
Let's do the copy function.
def copy(src, memory, clevel=5, shuffle=True, cname="blosclz"):
"""
Parameters
----------
clevel : int (0 <= clevel < 10)
The compression level.
shuffle : bool
Whether the shuffle filter is active or not.
cname : string ('blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', others?)
Select the compressor to use inside Blosc.
"""
if memory == True:
copied = src.copy(bparams=blz.bparams(clevel=clevel, shuffle=shuffle, cname=cname))
else:
copied = src.copy(rootdir='blz/temp.blz', bparams=blz.bparams(clevel=clevel, shuffle=shuffle, cname=cname))
copied.flush()
return copied
And now we just need a benchmark function.
def benchmark(data, cmethods, memory):
if memory == True:
base_name = 'csv/bbtablem_'
else:
base_name = 'csv/bbtable_'
for method in cmethods:
myfile = open(base_name + method + '.csv', 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(['Compression level',
'Compressed size',
'Compression ratio',
'Compression time',
'OHLC time',
'Writing speed (MiB/s)',
'OHLC speed (MiB/s)'])
for compression_level in xrange(0,10):
#I use data so I only keep one copy in memory
tc1 = time()
data = copy(data, memory, compression_level, True, method)
tc2 = time()
#Now I get the poins and measure the time
t1 = time()
get_OHLC_points(data)
t2 = time()
#Uncompressed size in MiB
uncompressed = data.nbytes/(2**20)
#I should store stuff
row = [compression_level, data.cbytes,
round(data.nbytes/float(data.cbytes),3),
str(tc2 - tc1),
str(t2 - t1),
uncompressed/(tc2 - tc1),
uncompressed/(t2 - t1)]
#Add it to the csv
wr.writerow(row)
rmtree('blz/temp.blz', ignore_errors=True)
myfile.close()
We have all the ingredients now, let's do some benchmarking.
cmethods = ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
t1 = time()
benchmark(btd, cmethods, False)
t2 = time()
t3 = time()
benchmark(btm, cmethods, True)
t4 = time()
print 'Cached disk: ' + str(t2-t1)
print 'Memory only: ' + str(t4-t3)
Cached disk: 748.564900875 Memory only: 564.818388939
We can see all the gathered data now.
print 'blosclz data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_blosclz.csv')
df
blosclz data Memory
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 0.681122 | 2.263231 | 675.356186 | 203.249245 |
1 | 1 | 62562582 | 7.716 | 0.804646 | 2.227441 | 571.679958 | 206.515003 |
2 | 2 | 60615376 | 7.964 | 0.828417 | 2.152618 | 555.276023 | 213.693286 |
3 | 3 | 60114378 | 8.030 | 0.841646 | 2.186049 | 546.548102 | 210.425294 |
4 | 4 | 28904306 | 16.701 | 0.960289 | 2.399268 | 479.022460 | 191.725150 |
5 | 5 | 27821399 | 17.351 | 1.134386 | 2.458255 | 405.505687 | 187.124603 |
6 | 6 | 27450920 | 17.585 | 1.142648 | 2.458005 | 402.573677 | 187.143643 |
7 | 7 | 27136857 | 17.789 | 1.140152 | 2.495144 | 403.454902 | 184.358105 |
8 | 8 | 27091198 | 17.819 | 1.157194 | 2.430481 | 397.513330 | 189.262952 |
9 | 9 | 27033395 | 17.857 | 1.172375 | 2.481087 | 392.365936 | 185.402592 |
10 rows × 7 columns
print 'blosclz data'
print 'Disk'
df = pd.read_csv('csv/bbtable_blosclz.csv')
df
blosclz data Disk
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 10.990172 | 3.124110 | 41.855578 | 147.241935 |
1 | 1 | 62562582 | 7.716 | 1.679541 | 2.767619 | 273.884374 | 166.207841 |
2 | 2 | 60615376 | 7.964 | 1.737121 | 2.741565 | 264.805984 | 167.787378 |
3 | 3 | 60114378 | 8.030 | 1.763058 | 2.736637 | 260.910313 | 168.089513 |
4 | 4 | 28904306 | 16.701 | 1.857233 | 2.958073 | 247.680279 | 155.506635 |
5 | 5 | 27821399 | 17.351 | 1.860526 | 2.963732 | 247.241898 | 155.209715 |
6 | 6 | 27450920 | 17.585 | 1.860842 | 2.876680 | 247.199925 | 159.906551 |
7 | 7 | 27136857 | 17.789 | 1.854447 | 2.967175 | 248.052367 | 155.029615 |
8 | 8 | 27091198 | 17.819 | 1.871443 | 3.153615 | 245.799627 | 145.864349 |
9 | 9 | 27033395 | 17.857 | 1.910575 | 2.970001 | 240.765226 | 154.882104 |
10 rows × 7 columns
print 'lz4 data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_lz4.csv')
df
lz4 data Memory
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 0.709622 | 2.190826 | 648.232525 | 209.966452 |
1 | 1 | 31210120 | 15.467 | 0.699199 | 2.303638 | 657.895714 | 199.684153 |
2 | 2 | 31210120 | 15.467 | 0.757916 | 2.341326 | 606.927438 | 196.469864 |
3 | 3 | 31210120 | 15.467 | 0.761211 | 2.227152 | 604.300129 | 206.541820 |
4 | 4 | 31210120 | 15.467 | 0.766530 | 2.341871 | 600.106947 | 196.424139 |
5 | 5 | 31210120 | 15.467 | 0.759634 | 2.428854 | 605.554766 | 189.389730 |
6 | 6 | 31210120 | 15.467 | 0.770773 | 2.244157 | 596.803520 | 204.976739 |
7 | 7 | 31210120 | 15.467 | 0.766620 | 2.323864 | 600.036399 | 197.946181 |
8 | 8 | 31210120 | 15.467 | 0.764220 | 2.250569 | 601.920913 | 204.392768 |
9 | 9 | 31210120 | 15.467 | 0.762074 | 2.331698 | 603.615927 | 197.281108 |
10 rows × 7 columns
print 'lz4 data'
print 'Disk'
df = pd.read_csv('csv/bbtable_lz4.csv')
df
lz4 data Disk
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 9.310785 | 2.775670 | 49.405071 | 165.725750 |
1 | 1 | 31210120 | 15.467 | 1.642699 | 2.819540 | 280.026955 | 163.147179 |
2 | 2 | 31210120 | 15.467 | 1.629550 | 2.790243 | 282.286524 | 164.860184 |
3 | 3 | 31210120 | 15.467 | 1.624146 | 2.801503 | 283.225772 | 164.197579 |
4 | 4 | 31210120 | 15.467 | 1.620733 | 2.847433 | 283.822193 | 161.549011 |
5 | 5 | 31210120 | 15.467 | 1.620822 | 2.820628 | 283.806578 | 163.084253 |
6 | 6 | 31210120 | 15.467 | 1.630892 | 2.913791 | 282.054231 | 157.869940 |
7 | 7 | 31210120 | 15.467 | 1.651789 | 2.862687 | 278.485941 | 160.688186 |
8 | 8 | 31210120 | 15.467 | 1.661538 | 2.805433 | 276.851948 | 163.967557 |
9 | 9 | 31210120 | 15.467 | 1.624195 | 2.863090 | 283.217207 | 160.665572 |
10 rows × 7 columns
print 'lz4hc data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_lz4hc.csv')
df
lz4hc data Memory
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 0.601254 | 2.203485 | 765.067693 | 208.760213 |
1 | 1 | 27466793 | 17.575 | 3.476439 | 2.412721 | 132.319307 | 190.656116 |
2 | 2 | 23961748 | 20.146 | 4.512663 | 2.281846 | 101.935373 | 201.591164 |
3 | 3 | 21717259 | 22.228 | 6.930293 | 2.223672 | 66.375262 | 206.865027 |
4 | 4 | 20475242 | 23.576 | 13.367165 | 2.200643 | 34.412683 | 209.029809 |
5 | 5 | 19908867 | 24.247 | 28.572774 | 2.280474 | 16.099242 | 201.712435 |
6 | 6 | 19752345 | 24.439 | 49.155402 | 2.253255 | 9.358076 | 204.149119 |
7 | 7 | 19711733 | 24.489 | 56.906914 | 2.264541 | 8.083376 | 203.131680 |
8 | 8 | 19710950 | 24.490 | 57.256365 | 2.243897 | 8.034041 | 205.000479 |
9 | 9 | 19710833 | 24.491 | 57.443949 | 2.261611 | 8.007806 | 203.394817 |
10 rows × 7 columns
print 'lz4hc data'
print 'Disk'
df = pd.read_csv('csv/bbtable_lz4hc.csv')
df
lz4hc data Disk
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 10.022066 | 3.445173 | 45.898720 | 133.520154 |
1 | 1 | 27466793 | 17.575 | 4.448363 | 2.818698 | 103.408825 | 163.195919 |
2 | 2 | 23961748 | 20.146 | 5.389405 | 2.779829 | 85.352650 | 165.477803 |
3 | 3 | 21717259 | 22.228 | 8.024240 | 2.839225 | 57.326301 | 162.016040 |
4 | 4 | 20475242 | 23.576 | 14.552245 | 2.809012 | 31.610243 | 163.758649 |
5 | 5 | 19908867 | 24.247 | 29.689302 | 2.799546 | 15.493796 | 164.312356 |
6 | 6 | 19752345 | 24.439 | 50.277322 | 2.786857 | 9.149254 | 165.060503 |
7 | 7 | 19711733 | 24.489 | 57.907075 | 2.704130 | 7.943762 | 170.110154 |
8 | 8 | 19710950 | 24.490 | 58.364596 | 2.765320 | 7.881490 | 166.346039 |
9 | 9 | 19710833 | 24.491 | 58.455211 | 2.771683 | 7.869273 | 165.964147 |
10 rows × 7 columns
print 'snappy data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_snappy.csv')
df
snappy data Memory
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 0.604864 | 2.201925 | 760.501383 | 208.908111 |
1 | 1 | 410696358 | 1.175 | 0.851744 | 2.334054 | 540.068418 | 197.081988 |
2 | 2 | 410696358 | 1.175 | 0.853806 | 2.296802 | 538.764063 | 200.278470 |
3 | 3 | 410696358 | 1.175 | 0.862138 | 2.373132 | 533.557252 | 193.836669 |
4 | 4 | 410696358 | 1.175 | 0.869969 | 2.308243 | 528.754394 | 199.285774 |
5 | 5 | 410696358 | 1.175 | 0.869898 | 2.290146 | 528.797580 | 200.860547 |
6 | 6 | 410696358 | 1.175 | 0.860055 | 2.290641 | 534.849534 | 200.817166 |
7 | 7 | 410696358 | 1.175 | 0.872584 | 2.316078 | 527.169958 | 198.611623 |
8 | 8 | 410696358 | 1.175 | 0.865230 | 2.266900 | 531.650492 | 202.920282 |
9 | 9 | 410696358 | 1.175 | 0.865937 | 2.305180 | 531.216478 | 199.550571 |
10 rows × 7 columns
print 'snappy data'
print 'Disk'
df = pd.read_csv('csv/bbtable_snappy.csv')
df
snappy data Disk
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 10.879534 | 5.507512 | 42.281223 | 83.522286 |
1 | 1 | 410696358 | 1.175 | 7.985037 | 2.960860 | 57.607749 | 155.360266 |
2 | 2 | 410696358 | 1.175 | 11.170788 | 2.792554 | 41.178832 | 164.723768 |
3 | 3 | 410696358 | 1.175 | 11.086721 | 2.931666 | 41.491077 | 156.907375 |
4 | 4 | 410696358 | 1.175 | 10.727255 | 2.943897 | 42.881427 | 156.255466 |
5 | 5 | 410696358 | 1.175 | 7.165172 | 2.955973 | 64.199437 | 155.617123 |
6 | 6 | 410696358 | 1.175 | 7.418792 | 3.115795 | 62.004704 | 147.634867 |
7 | 7 | 410696358 | 1.175 | 7.775094 | 3.132531 | 59.163272 | 146.846116 |
8 | 8 | 410696358 | 1.175 | 7.291722 | 2.939524 | 63.085237 | 156.487925 |
9 | 9 | 410696358 | 1.175 | 6.927933 | 2.878196 | 66.397871 | 159.822333 |
10 rows × 7 columns
print 'zlib data'
print 'Memory'
df = pd.read_csv('csv/bbtablem_zlib.csv')
df
zlib data Memory
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 0.662771 | 2.162894 | 694.055729 | 212.678013 |
1 | 1 | 21562571 | 22.387 | 3.683232 | 3.536478 | 124.890311 | 130.072913 |
2 | 2 | 19985468 | 24.154 | 4.939962 | 3.484861 | 93.118127 | 131.999530 |
3 | 3 | 19364324 | 24.929 | 5.524262 | 3.473563 | 83.269038 | 132.428865 |
4 | 4 | 17011833 | 28.376 | 8.573870 | 3.662094 | 53.651385 | 125.611198 |
5 | 5 | 16773528 | 28.779 | 9.709417 | 3.655160 | 47.376687 | 125.849486 |
6 | 6 | 15251112 | 31.652 | 12.199962 | 3.398706 | 37.705035 | 135.345630 |
7 | 7 | 15156305 | 31.850 | 14.717500 | 3.480614 | 31.255308 | 132.160592 |
8 | 8 | 14295434 | 33.768 | 29.362271 | 3.468930 | 15.666363 | 132.605731 |
9 | 9 | 14114077 | 34.202 | 45.618963 | 3.359124 | 10.083526 | 136.940457 |
10 rows × 7 columns
print 'zlib data'
print 'Disk'
df = pd.read_csv('csv/bbtable_zlib.csv')
df
zlib data Disk
Compression level | Compressed size | Compression ratio | Compression time | OHLC time | Writing speed (MiB/s) | OHLC speed (MiB/s) | |
---|---|---|---|---|---|---|---|
0 | 0 | 485063744 | 0.995 | 9.947921 | 3.326485 | 46.240817 | 138.284098 |
1 | 1 | 21562571 | 22.387 | 4.717685 | 4.090439 | 97.505446 | 112.457371 |
2 | 2 | 19985468 | 24.154 | 4.840078 | 4.022526 | 95.039793 | 114.356003 |
3 | 3 | 19364324 | 24.929 | 5.575667 | 3.940279 | 82.501342 | 116.743002 |
4 | 4 | 17011833 | 28.376 | 8.509122 | 4.047293 | 54.059632 | 113.656216 |
5 | 5 | 16773528 | 28.779 | 9.763632 | 4.058696 | 47.113615 | 113.336894 |
6 | 6 | 15251112 | 31.652 | 12.165651 | 3.965459 | 37.811375 | 116.001708 |
7 | 7 | 15156305 | 31.850 | 14.992578 | 3.990026 | 30.681848 | 115.287469 |
8 | 8 | 14295434 | 33.768 | 29.675020 | 3.967705 | 15.501253 | 115.936038 |
9 | 9 | 14114077 | 34.202 | 45.902231 | 4.076327 | 10.021299 | 112.846692 |
10 rows × 7 columns
Let's do some more plotting. For that I first need to make some dictionaries out of this data. I will create dictionaries of the disk benchmark too, so we can easily compare.
def get_dict(filename):
columns = defaultdict(list)
with open(filename) as f:
reader = csv.reader(f)
reader.next()
for row in reader:
for (i,v) in enumerate(row):
if i == 1:
columns[i].append(float(v)/131072)
continue
columns[i].append(v)
return columns
#In memory csv
blosclzm = get_dict('csv/bbtablem_blosclz.csv')
lz4m = get_dict('csv/bbtablem_lz4.csv')
lz4hcm = get_dict('csv/bbtablem_lz4hc.csv')
snappym = get_dict('csv/bbtablem_snappy.csv')
zlibm = get_dict('csv/bbtablem_zlib.csv')
#In disk
blosclz = get_dict('csv/bbtable_blosclz.csv')
lz4 = get_dict('csv/bbtable_lz4.csv')
lz4hc = get_dict('csv/bbtable_lz4hc.csv')
snappy = get_dict('csv/bbtable_snappy.csv')
zlib = get_dict('csv/bbtable_zlib.csv')
Now we are ready to plot.
%matplotlib inline
#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_xlabel('Compression ratio')
ax.set_ylabel('Compression level')
#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[2][1:], blosclzm[0][1:], 'bo-')
ax1.plot(blosclz[2][1:], blosclz[0][1:], 'ro-')
#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[2][1:], blosclzm[0][1:], 'bo-')
ax2.plot(lz4[2][1:], blosclz[0][1:], 'ro-')
#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[2][1:], lz4hcm[0][1:], 'bo-')
ax3.plot(lz4hc[2][1:], lz4hc[0][1:], 'ro-')
#snappy
ax4.set_title('snappy')
ax4.plot(snappym[2][1:], snappym[0][1:], 'bo-')
ax4.plot(snappy[2][1:], snappy[0][1:], 'ro-')
#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[2][1:], zlibm[0][1:], 'bo-', label='Memory')
ax5.plot(zlib[2][1:], zlib[0][1:], 'ro-', label='Disk')
#Legend
ax5.legend(bbox_to_anchor=(2, 1))
plt.show()
Above we can see that both methods (Disk and memory) get the same compression ratio, this was expected because this only depends on the compressor. I only used it to show that everything should work ok despite where it is stored.
#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_ylabel('Compression time (s)')
ax.set_xlabel('Compression ratio')
#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[2][1:], blosclzm[3][1:], 'bo-')
ax1.plot(blosclz[2][1:], blosclz[3][1:], 'ro-')
#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[2][1:], blosclzm[3][1:], 'bo-')
ax2.plot(lz4[2][1:], blosclz[3][1:], 'ro-')
#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[2][1:], lz4hcm[3][1:], 'bo-')
ax3.plot(lz4hc[2][1:], lz4hc[3][1:], 'ro-')
#snappy
ax4.set_title('snappy')
ax4.plot(snappym[2][1:], snappym[3][1:], 'bo-')
ax4.plot(snappy[2][1:], snappy[3][1:], 'ro-')
#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[2][1:], zlibm[3][1:], 'bo-', label='Memory')
ax5.plot(zlib[2][1:], zlib[3][1:], 'ro-', label='Disk')
#Legend
ax5.legend(bbox_to_anchor=(2, 1))
plt.show()
Here we can see that working with the dataset in memory is faster than reading it from disk. Due to the bigger scale of zlib and lz4hc we cannot appreciate it as well on those compressors.
#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_ylabel('OHLC time (s)')
ax.set_xlabel('Compression level')
plt.suptitle('OHLC/Compression')
#Y value for red line
Y = (float(blosclz[4][0]) + float(lz4[4][0]) + float(lz4hc[4][0]) + float(snappy[4][0]) + float(zlib[4][0]))/len(cmethods)
Ym = (float(blosclzm[4][0]) + float(lz4m[4][0]) + float(lz4hcm[4][0]) + float(snappym[4][0]) + float(zlibm[4][0]))/len(cmethods)
#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[0][1:], blosclzm[4][1:], 'bo-')
ax1.axhline(Ym, color = 'b')
ax1.plot(blosclz[0][1:], blosclz[4][1:], 'ro-')
ax1.axhline(Y, color = 'r')
ax1.set_ylim(ymin=2, ymax=4.5)
#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[0][1:], lz4m[4][1:], 'bo-')
ax2.axhline(Ym, color = 'b')
ax2.plot(lz4[0][1:], lz4[4][1:], 'ro-')
ax2.axhline(Y, color = 'r')
ax2.set_ylim(ymin=2, ymax=4.5)
#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[0][1:], lz4hcm[4][1:], 'bo-')
ax3.axhline(Ym, color = 'b')
ax3.plot(lz4hc[0][1:], lz4hc[4][1:], 'ro-')
ax3.axhline(Y, color = 'r')
ax3.set_ylim(ymin=2, ymax=4.5)
#snappy
ax4.set_title('snappy')
ax4.plot(snappym[0][1:], snappym[4][1:], 'bo-')
ax4.axhline(Ym, color = 'b')
ax4.plot(snappy[0][1:], snappy[4][1:], 'ro-')
ax4.axhline(Y, color = 'r')
ax4.set_ylim(ymin=2, ymax=4.5)
#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[0][1:], zlibm[4][1:], 'bo-', label='Memory')
ax5.axhline(Ym, color = 'b')
ax5.plot(zlib[0][1:], zlib[4][1:], 'ro-', label='Disk')
ax5.axhline(Y, color = 'r')
ax5.set_ylim(ymin=2, ymax=4.5)
#Legend
ax5.legend(bbox_to_anchor=(2, 1))
plt.show()
This may be one of the most interesting plots, the two horizontal lines represent the average time for that operation for the uncompressed dataset. That means that the compressed files (Represented as dots) below than line can be computed faster than the uncompressed dataset.
Here there are two important points:
Let's see how many MiB/s we get.
#Matplotlib magic
fig = plt.figure(num=None, figsize=(18, 9), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
plt.subplots_adjust(wspace=0.6, hspace=0.6)
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.spines['right'].set_color('none')
ax.tick_params(labelcolor='w', top='off', bottom='off', left='off', right='off')
ax.set_ylabel('MiB/s')
ax.set_xlabel('Compression level')
#blosclz
ax1.set_title('blosclz')
ax1.plot(blosclzm[0][1:], blosclzm[5][1:], 'bo-')
ax1.plot(blosclzm[0][1:], blosclzm[6][1:], 'go-')
ax1.plot(blosclz[0][1:], blosclz[5][1:], 'ro-')
ax1.plot(blosclz[0][1:], blosclz[6][1:], 'mo-')
#lz4
ax2.set_title('lz4')
ax2.plot(lz4m[0][1:], lz4m[5][1:], 'bo-')
ax2.plot(lz4m[0][1:], lz4m[6][1:], 'go-')
ax2.plot(lz4[0][1:], lz4[5][1:], 'ro-')
ax2.plot(lz4[0][1:], lz4[6][1:], 'mo-')
#lz4hc
ax3.set_title('lz4hc')
ax3.plot(lz4hcm[0][1:], lz4hcm[5][1:], 'bo-')
ax3.plot(lz4hcm[0][1:], lz4hcm[6][1:], 'go-')
ax3.plot(lz4hc[0][1:], lz4hc[5][1:], 'ro-')
ax3.plot(lz4hc[0][1:], lz4hc[6][1:], 'mo-')
#snappy
ax4.set_title('snappy')
ax4.plot(snappym[0][1:], snappym[5][1:], 'bo-')
ax4.plot(snappym[0][1:], snappym[6][1:], 'go-')
ax4.plot(snappy[0][1:], snappy[5][1:], 'ro-')
ax4.plot(snappy[0][1:], snappy[6][1:], 'mo-')
#zlib
ax5.set_title('zlib')
ax5.plot(zlibm[0][1:], zlibm[5][1:], 'bo-', label='Compression speed (Memory)')
ax5.plot(zlibm[0][1:], zlibm[6][1:], 'go-', label='Computational speed (Memory)')
ax5.plot(zlib[0][1:], zlib[5][1:], 'ro-', label='Compression speed (Disk)')
ax5.plot(zlib[0][1:], zlib[6][1:], 'mo-', label='Computational speed (Disk)')
#Legend
ax5.legend(bbox_to_anchor=(2.7, 1))
plt.show()
This is the less understanable plot in my opinion. Higher lines mean better speed, so we can see how working with the datasets in memory is faster than its disk counterpart.
In general we have seen that for disk it is faster to use compressed data rather than the uncompressed one. What about memory? In the OHLC/Compression plot we can see that working with compressed data is nearly as fast (or sometimes even faster) than working with the uncompressed data. This is a very important point because you are not only working "at the same speed" but you are also saving memory.
Regarding speed
Reading and writing data from disk are really slow operations. Due to processors being very fast it is actually faster for most cases to compress the data before writing it to disk (so you write less data that is the slow operation) than writing the uncompressed data. This is true for both reading and writing data from disk. This is becoming true for memory too, as shown before.
Regarding memory
When working with big data I have realized that RAM memory is a big issue, you can't or shouldn't fit all the data in it. And that why BLZ is awesome, it allows you to work with big data without on disk without actually trying to load it all into memory. And I have shown in previous tutorials that those containers are really fast.
Also we are getting up to a 34 compression ratio, that means we are storing 34 times less data on disk, which allows us to store even more data!