Apache Avro as serialization of TF data

Apache Avro is a data serialization format used in Big Data use cases.

pip3 install avro-python3

It looks attractive from the specifactions, but how will it perform?

As a simple test, we take the feature data for g_word_utf8. It is a map from the numbers 1 to 426584 to Hebrew word occurrences (Unicode strings).

In Text-Fabric we have a representation in plain text and a compressed, pickled representation.

Outcome

Text-Fabric is much faster in loading this kind of data.

The size of the Avro binary serialization is much bigger than the TF text representation.

The sizes of the gzipped Avro serialization and the gzipped, pickled TF serialization are approximately equal.

Detailed comparison

name kind size load time
g_word_utf8.tf tf: plain unicode text 5.4 MB 1.6 s
g_word_utf8.tfx tf: gzipped binary 3.2 MB 0.2 s
g_word_utf8.avdt avro: binary 8.3 MB 3.2 s
g_word_utf8.avdt.gz avro: gzipped binary 3.2 MB 4.7 s

Conclusion

We do not see reasons to replace the TF feature data serialization by Avro.

In [1]:
import os
import gzip
import avro
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

from tf.fabric import Fabric

GZIP_LEVEL = 2 # same as used in Text-Fabric

Load from the textual data

In [3]:
VERSION = 'c'
BHSA = f'BHSA/tf/{VERSION}'
PARA = f'parallels/tf/{VERSION}'

TF = Fabric(locations='~/github/etcbc', modules=[BHSA, PARA])
api = TF.load('')
api.makeAvailableIn(globals())
This is Text-Fabric 5.5.22
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/
Tutorial      : https://github.com/annotation/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/annotation/text-fabric-data

117 features found and 0 ignored
  0.00s loading features ...
   |     1.60s T g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/c
  6.02s All features loaded/computed - for details use loadLog()

The load time is ~ 1.6 seconds.

But during this time, the textual data has been compiled and written to a binary form. Let's load again.

Load from binary data

In [4]:
TF = Fabric(locations='~/github/etcbc', modules=[BHSA, PARA])
api = TF.load('')
api.makeAvailableIn(globals())
This is Text-Fabric 5.5.22
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/
Tutorial      : https://github.com/annotation/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/annotation/text-fabric-data

117 features found and 0 ignored
  0.00s loading features ...
  5.28s All features loaded/computed - for details use loadLog()
In [5]:
loadLog()
   |     0.03s B otype                from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.55s B oslots               from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.01s B book                 from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.01s B chapter              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B verse                from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.18s B g_cons               from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.27s B g_cons_utf8          from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.18s B g_lex                from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.25s B g_lex_utf8           from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.17s B g_word               from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.21s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.12s B lex0                 from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.18s B lex_utf8             from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B qere                 from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B qere_trailer         from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B qere_trailer_utf8    from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B qere_utf8            from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.07s B trailer              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.08s B trailer_utf8         from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B __levels__           from otype, oslots, otext
   |     0.03s B __order__            from otype, oslots, __levels__
   |     0.03s B __rank__             from otype, __order__
   |     1.36s B __levUp__            from otype, oslots, __rank__
   |     1.12s B __levDown__          from otype, __levUp__, __rank__
   |     0.39s B __boundary__         from otype, oslots, __rank__
   |     0.01s B __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]             from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c
   |     0.00s B [email protected]              from /Users/dirk/github/etcbc/BHSA/tf/c

The load time of the feature g_word_utf8 is ~ 0.2 seconds.

Make an Avro feature data file

In [6]:
tempDir = os.path.expanduser('~/github/annotation/text-fabric/_temp/avro')
os.makedirs(tempDir, exist_ok=True)

The data is a map from integers to strings, but Avro does not support that. We have to represent the integers as strings.

In [7]:
feature = 'g_word_utf8'
tfData =  TF.features[feature].data
print(len(tfData))
#data = {str(i): w for (i,w) in tfData.items() if i < 12}
data = {str(i): w for (i,w) in tfData.items()}
print(len(data))
print(data['2'])
426584
426584
רֵאשִׁ֖ית

We define a record schema, where each record consists of the name of a feature and its data. We only will put one record in a file, because we want to load features individually.

In [8]:
schemaJSON = '''
{
  "name": "tffeature",
  "type": "record",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    { "name": "data",
      "type": {
        "type": "map",
        "values": "string"
      }
    }
  ]
}
'''
schemaFile = 'tf.avsc'
schemaPath = f'{tempDir}/{schemaFile}'
with open(schemaPath, 'w') as sf:
  sf.write(schemaJSON)
In [9]:
schema = avro.schema.Parse(open(schemaPath, "rb").read())

We write the feature data to an Avro data file.

In [10]:
dataFile = f'{tempDir}/{feature}.avdt'
writer = DataFileWriter(open(dataFile, "wb"), DatumWriter(), schema)
In [11]:
indent(reset=True)
info('start writing')
writer.append({"name": feature, "data": data})
writer.close()
info('done')
  0.00s start writing
  2.55s done

We make also a gzipped data file.

In [12]:
dataFileZ = f'{dataFile}.gz'
with open(dataFile, 'rb') as ah:
  avroData = ah.read()
with gzip.open(dataFileZ, 'wb', compresslevel=GZIP_LEVEL) as ah:
  ah.write(avroData)

Load from avro binary

In [13]:
indent(reset=True)
info('start reading')
reader = DataFileReader(open(dataFile, "rb"), DatumReader())
for x in reader:
  print(f'{x["name"]} - {x["data"]["2"]}')
reader.close()
info('done')
  0.00s start reading
g_word_utf8 - רֵאשִׁ֖ית
  3.24s done

Load time ~ 3.2 seconds.

Load from avro binary (gzipped)

In [14]:
indent(reset=True)
info('start reading')
reader = DataFileReader(gzip.open(dataFileZ, "rb"), DatumReader())
for x in reader:
  print(f'{x["name"]} - {x["data"]["2"]}')
reader.close()
info('done')
  0.00s start reading
g_word_utf8 - רֵאשִׁ֖ית
  4.72s done
In [ ]: