Using HBase from Python¶

Overview¶

Column-oriented databases and Apache HBase
Binary data in Python
Serialization using the Struct class.
The happybase API.

Reading¶

Apache HBase¶

HBase is a column-oriented database.
It is inspired by Google's Bigtable \cite{Chang2006}.
HBase is written in Java (Bigtable was written in C++).
HBase can be used from Python through a Thrift API.
It can also be access through the Apache Spark Python API.

Scalability¶

The design philosophy is based on highly-distributed, shared-nothing architecture.
It provides automatic sharding.
HBase is part of the Hadoop Ecosystem.

Shared-nothing architecture¶

When building scalable systems, we need to parallelize our computations.
This requires multiple processors (and/or multiple cores).
How do we distribute the data to each processor?

Scheme	Description
Shared memory (SM)	Multiple processors share a common central memory.
Shared disk (SD)	Multiple processors with private memory share a common collection of disks.
Shared nothing (SN)	Neither memory not peripheral storage is shared among processors.

\cite{Stonebraker1986}

Sharding¶

We can provide parallel access to a database by storing different subsets

of the data on different RDBMS servers.

We could use, e.g. a hash function to determine which server a piece of data should reside on.
Each of these subsets is called a shard.
With a traditional relational database, sharding is very costly in terms of duplicated resources and the complexity of the configuration.
HBase provides automatic sharding, with minimal duplication of state.

Regions¶

Tables are partitioned vertically into different regions.
Different region servers are responsible for one or more regions.
The HBase Master process performs coordination and load balancing across the region servers.

The Hadoop Distributed File System (HDFS)¶

In a production configuration HBase stores its underlying data on a file-system called HDFS.
HDFS uses a cluster of computers to simulate a single file-system.
It provides:
- Resilience against hardware
- Streaming Data Access
- Support for large files (e.g. terabytes)
It is based on the philosophy that moving computation is often cheaper than moving data.

HBase Tables¶

Tables are maps of maps.
We first map from row key to a map of the data for that row.
Then within each row, column names are mapped to values.
Tables consist of:
- Rows, which hold the data associated with the row-key.
- Column keys, which are used to index each row.
- Row keys, which are used to index each attribute within a row.
- Column families which group related keys to specify access-control and options.

Multi-dimensional sorted maps¶

Row keys are sorted in [lexicographic order](http://mathworld.wolfram.com/LexicographicOrder.html%7D%7Blexicographic order).
The number of columns per row is unbounded.
This design implements a persistent, sparse, multi-dimensional sorted map.

Sparse data¶

When de-normalising a schema, we introduce many NULL values.
A data-set which contains a significant fraction of NULL values is called sparse data.
NULL values can be represented by the absence of a mapping.
Existence is tested using a Bloom filter.
The bloom filter is cached in memory;
- This prevents unnecessary disk accesses.

HBase is typeless¶

There are no types.
Both keys and their associated values are arbitrary-length arrays of bytes.
There is no limit on the size of a value.

Lexicographic ordering¶

We can define an ordering over arbitrary binary data.
For example, the following Python code determines whether $x < y$

according to lexicographic ordering:

In [1]:

def lexicographic_le(x, y):   
    for i in range(len(x)):
        if x[i] == y[i]:
            continue
        else:
            return x[i] < y[i]
    return False

lexicographic_le('steve', 'smith')

Out[1]:

False

Ordering arrays of bytes¶

In [2]:

import numpy as np

x = np.array([15, 9, 5, 16, 4])

y = np.array([15, 16, 1, 18, 1])

lexicographic_le(x, y)

Out[2]:

True

Joins in HBase¶

There is no automatic enforcement of referentially integrity in a column-oriented database.
- There are no foreign key constraints.
If we want to retrieve the value associated with a particular key from another table, then

that will be a fast operation.

In general, however, the set-theoretic operations available in SQL can be very expensive.
Joins are not supported; we "pre-join" the data through de-normalisation.

An example table¶

hbase table

Column qualifiers¶

Row keys and column keys can consist of arbitrary binary data.
Often we will use strings (which HBase will see an array of 8-bit ASCII values).
Column-keys containing the character ":" specify membership of a column-family:

<column family name>:<column qualifer>

The column family name must consist of printable text.
For example, in the previous example, the triangle attribute would be written as:

shape:triangle

The fully-qualified key is stored on disk.
Therefore, the name of a column-family should generally be short, in order to maintain capacity and IO throughput.

The colors and shapes example with Python dicts¶

In [3]:

data = \
    {
        'first': 
            {
                'color:red': 0xf00, 'color:blue': 0x00f, 
                    'shape:yellow': 0xff0, 'shape:square': 4 
            }, 
        'second': 
            { 
                'shape:square': 4, 'shape:triangle': 3 
            } 
    }

In [4]:

data['second']

Out[4]:

{'shape:square': 4, 'shape:triangle': 3}

In [5]:

data['second']['shape:triangle']

Out[5]:

URLs¶

BigTable was originally designed to house web data (it originated from Google).
It is very common to use a URL as a row-key.
We would typically reverse the URL before using it as a key.
For example:

keats.kcl.ac.uk becomes uk.ac.kcl.keats.

Time stamps¶

HBase also allows multiple versions of the values associated with a given column for a given row.
We can optionally use this to index by the time-stamp of the datum.
For example, we might store several versions of a web page associated with a given URL.
The time-stamp dimension is optionally.
When writing data, if we do not specify the time-stamp, then the current system time is used.
When modifying data, the old value is kept along with the old time-stamp.
There are therefore three dimensions to the multi-dimensional map.
For some applications, we can use the time-dimension to store our own data.

Epoch time¶

Time-stamps are stored as integers in Unix epoch time.
Epoch time is number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970
To get the current epoch time on a Unix system:

date +%s

Table operations¶

There are four core CRUD operations which we can perform on tables:
1. Get
2. Put
3. Delete
4. Scan
We execute them by either:
- Executing a command in the shell.
- Calling a Java method in the Java API.
- Calling a Python function in the Thrift API.
On a big project, it is typically more practical to use the API than the shell.

Get¶

Get returns attributes for a specified row-key and column(s).
It can be used in several ways:
- Get everything for a row.
- To get all columns from specific column families.
- To get specific columns.
- To get values within a specified time range.
By default it will retrieve up to three versions of the data
- We can specify a different number each time we get.

Put¶

Put can be used to insert new rows, or update existing rows.
We always specify a row key.
Where there is no existing value at a cell, new data is created.
If there is existing data, it is retained under the old time stamp.

Delete¶

Delete a particular cell.

Scan¶

Scan is used to retrieve a range of rows in one operation.
In its simplest form it will iterate over all rows in the table.
We can also specify a minimum and maximum row-key.
The ordering is given by the lexicographic ordering of the row-key itself.
- Ascending and descending are supported.
- Note that we cannot specify an arbitrary ordering of the data;
- there is no equivalent of SQL's ORDER BY clause.

Tall-Narrow Versus Flat-Wide tables¶

In a column-oriented database we need to think carefully about the structure of the table in respect of the kinds of queries we want to support.
We can choose a tall-narrow design as opposed to a flat-wide design.
- Tall-narrow: very many rows, fewer columns
- Flat-wide: fewer rows, very many columns
A tall-narrow design is more readily shard-able.
- Individual rows cannot be split across regions.
A flat-wide design can be transformed into a tall-narrow design by storing additional data values in the row-key.
These attributes can be retrieved by scanning with a partial-key.

Partial Key Scans¶

Because keys are sorted lexicographically, we can index several attributes simultaneously through concatenation.
e.g. suppose we use row`keys as with the following format:

<userId>-<date>-<messageId>-<attachmentId>
We can then scan the table by specifying a partial-key.

Underlying data-structures.¶

The ordering over keys is provided by Log-Structured Merge Trees \cite{Neil1996}.
In contrast to B+ Trees in traditional databases.
LSMT trees ensure that rows of tables can efficiently be read from disk sequentially.
The merging operation is scheduled in the background automatically.
No need to constantly optimise tables to avoid fragmentation of pages.
Scanning a table over a large range is bound only by sequential IO latencies.

CRUD operations from the shell¶

We can perform create, read, update and delete (CRUD) operations from the HBase shell.
To start the HBase shell type a command similar to the following from the Unix shell:

hbase shell

To check the status of HBase use the command status.

Types in HBase¶

HBase does not have any concept of data types.
All data is treated as binary data.

Unprintable characters¶

We can think of binary data as a sequence of bytes.
If we use utf8 or 8-bit ASCII, then we can represent binary data as a Python string.
Byte values which do not have an associated ASCII character are shown as hexadecimal values.
It is very easy to convert between Hexadecimal and binary.

In [6]:

chr(97)

Out[6]:

'a'

In [7]:

ord('a')

Out[7]:

In [8]:

chr(0)

Out[8]:

'\x00'

Creating non-printable characters¶

We can write non-printable characters by prefixing the hexadecimal value with \x<value>.

In [9]:

ord('\x00')

Out[9]:

In [10]:

ord('\x10')

Out[10]:

Sequences of bytes¶

We can represent any 8-bit value as a character in Python.
We can represent arbitrary binary data as an ordered sequence of characters.

This is simply a string.
For example, the sequence of bytes 0x0010 can be represented:

In [11]:

binary_data = '\x00\x10'
binary_data

Out[11]:

'\x00\x10'

In [12]:

binary_data[0]

Out[12]:

'\x00'

Lexicographic ordering of binary data¶

We can compare strings according to their alphabetic ordering; e.g.:

In [13]:

'alpha' < 'beta'

Out[13]:

True

In [14]:

'smith' < 'steve'

Out[14]:

True

This still works with unprintable characters:

In [15]:

x = '\x0f\x09\x05\x10\x04'
y = '\x0f\x10\x01\x12\x01'

In [16]:

type(x)

Out[16]:

str

In [17]:

type(y)

Out[17]:

str

In [18]:

x < y

Out[18]:

True

Indexing binary data¶

Binary data can be considered as a sequence of bits.
Therefore it can be considered as a sequence of bytes.
Therefore binary data can be represented as a string.
Therefore binary data can be compared and sorted.
Therefore we can order binary data and store it in, e.g. a tree.
Therefore binary data can be indexed.

Converting to and from binary data¶

All values in Python are ultimately stored as binary data.
This is usually hidden from the programmer.
When working with HBase we will often need to deal explicitly with the binary representation.
This is sometimes called serialization.
We can use Python's struct class to do this.

In [19]:

import struct
import binascii

data = 3.141592653589793
float64s = struct.Struct('d')
pi_in_binary = float64s.pack(data)
pi_in_binary

Out[19]:

b'\x18-DT\xfb!\t@'

Because some of the characters are printable, it is hard to interpret the above string.
When dealing with binary data we often want to sequence of bits, or more conveniently the sequence of bytes represented in hexadecimal.
The hexlify() function in the binascii module will do this:

In [20]:

binascii.hexlify(pi_in_binary)

Out[20]:

b'182d4454fb210940'

In [21]:

import struct
import binascii

data = 5
integer64s = struct.Struct('l')
binary_data = integer64s.pack(data)
binascii.hexlify(binary_data)

Out[21]:

b'0500000000000000'

In [22]:

data = -5
integer64s = struct.Struct('l')
binary_data = integer64s.pack(data)
binascii.hexlify(binary_data)

Out[22]:

b'fbffffffffffffff'

Deserializing¶

In [23]:

original_data = float64s.unpack(pi_in_binary)
original_data

Out[23]:

(3.141592653589793,)

In [24]:

original_value = original_data[0]
original_value

Out[24]:

3.141592653589793

Big- and Little-Endian formats¶

When we use decimal, by convention we write the most-significant digit on the left, and proceed towards the least-significant digit on the right.
This is called a Big-endian format.
We can also use the opposite convention, which is Little-endian.
Different CPUs use different formats.
Intel processors use the Little-endian format.
The convention used by the CPU is called the native format.
If we want to make our data portable between machines, e.g. transmit it over the network, we may have to specify a particular convention.

Endian conversions in Python¶

In [25]:

big_endian_short_int = struct.Struct('>h')
v = big_endian_short_int.pack(1)
binascii.hexlify(v)

Out[25]:

b'0001'

In [26]:

little_endian_short_int = struct.Struct('<h')
v = little_endian_short_int.pack(1)
binascii.hexlify(v)

Out[26]:

b'0100'

In [27]:

big_endian_short_int = struct.Struct('>h')
v = big_endian_short_int.pack(-2)
binascii.hexlify(v)

Out[27]:

b'fffe'

In [28]:

little_endian_short_int = struct.Struct('<h')
v = little_endian_short_int.pack(-2)
binascii.hexlify(v)

Out[28]:

b'feff'

In [29]:

type(original_value)

Out[29]:

float

Serializing multiple values¶

In [30]:

import struct
import binascii

values = (1, 'hello'.encode('UTF-8'), 2.7)
s = struct.Struct('I 5s d')
serialized = s.pack(*values)

print('Original values:', values)
print('Format string  :', s.format)
print('Uses           :', s.size, 'bytes')
print('Packed Value   :', binascii.hexlify(serialized))

Original values: (1, b'hello', 2.7)
Format string  : I 5s d
Uses           : 24 bytes
Packed Value   : b'0100000068656c6c6f000000000000009a99999999990540'

In [31]:

s.unpack(serialized)

Out[31]:

(1, b'hello', 2.7)

HappyBase Prerequisites¶

Install the happybase package:

conda install -c https://conda.anaconda.org/auto \
        happybase

Start HBase:

cd hbase-1.1.2/bin
./start-hbase.sh

Start the HBase Thrift server:

./hbase thrift start

Once the thrift server is up and running you can obtain a connection from Python:

In [32]:

import happybase

host = '127.0.0.1'
connection = happybase.Connection(host)

Creating a table¶

In [33]:

table_name = 'my_table'
connection.create_table(table_name,
                        { 'color': dict(max_versions=10),
                          'shape': dict(max_versions=1)
                         })

Put¶

In [34]:

integer16s = struct.Struct('h')
four_as_bytes = integer16s.pack(4) 
four_as_bytes

Out[34]:

b'\x04\x00'

In [35]:

table = connection.table('my_table')
table.put('first', {'shape:square': four_as_bytes} )

Batch operations¶

In [36]:

b = table.batch()
b.put('first', {
        'color:red': '\xff\x00\x00', 
        'color:blue': '\x00\x00\xff', 
        'color:yellow': '\xff\xff\x00'
        })
b.put('second', {'shape:triangle': integer16s.pack(3)})
b.send()

Get¶

To retrieve a particular cell:

In [37]:

result = table.row('first', columns=['shape:square'])
result

Out[37]:

{b'shape:square': b'\x04\x00'}

Byte sequences¶

Notice the b character in front of each string.
This denotes that the type here is not in fact a string, but a bytes sequence.
We can convert a string into sequence of bytes using the bytes() function; e.g.

In [38]:

my_string = 'my_string'
my_string

Out[38]:

'my_string'

In [39]:

type(my_string)

Out[39]:

str

In [40]:

my_bytes = bytes(my_string, encoding='utf')
my_bytes

Out[40]:

b'my_string'

In [41]:

type(my_bytes)

Out[41]:

bytes

For convenience, we will define a function to convert to a byte array before extracting a given key from the dictionary.

In [42]:

def value(d, key):
    return d[bytes(key, encoding='utf')]

Unpacking values from a cell¶

In [43]:

binary_data = value(result, 'shape:square')
binary_data

Out[43]:

b'\x04\x00'

In [44]:

number_of_sides = integer16s.unpack(binary_data)[0]
number_of_sides

Out[44]:

In [45]:

type(number_of_sides)

Out[45]:

int

Retrieving multiple columns¶

In [46]:

result = table.row('first', columns=['shape:square', 'color:blue'])
result

Out[46]:

{b'color:blue': b'\x00\x00\xc3\xbf', b'shape:square': b'\x04\x00'}

We can also retrieve an entire column-family:

In [47]:

result = table.row('first', columns=['color'])
result

Out[47]:

{b'color:blue': b'\x00\x00\xc3\xbf',
 b'color:red': b'\xc3\xbf\x00\x00',
 b'color:yellow': b'\xc3\xbf\xc3\xbf\x00'}

Or an entire row:

In [48]:

table.row('first')

Out[48]:

{b'color:blue': b'\x00\x00\xc3\xbf',
 b'color:red': b'\xc3\xbf\x00\x00',
 b'color:yellow': b'\xc3\xbf\xc3\xbf\x00',
 b'shape:square': b'\x04\x00'}

Scanning a table¶

In [49]:

for row_key, data in table.scan():
    print(row_key, data)

b'first' {b'color:blue': b'\x00\x00\xc3\xbf', b'color:red': b'\xc3\xbf\x00\x00', b'color:yellow': b'\xc3\xbf\xc3\xbf\x00', b'shape:square': b'\x04\x00'}
b'second' {b'shape:triangle': b'\x03\x00'}

In [50]:

table.row('first', columns=['shape:square', 'shape:triangle'])

Out[50]:

{b'shape:square': b'\x04\x00'}

Using URLs as row keys¶

It is very common to use a URL as the row key.

In [51]:

urls = [
    "http://www.google.com/",
"http://www.baidu.com/",
"http://www.facebook.com/",
"http://www.youtube.com/",
"http://www.yahoo.com/",
"http://www.wikipedia.org/",
"http://www.taobao.com/",
"http://www.qq.com/",
"http://www.amazon.com/",
"http://www.live.com/",
"http://www.twitter.com/",
"http://www.weibo.com/",
"http://www.google.co.in/",
"http://www.tmall.com/",
"http://www.linkedin.com/",
"http://www.blogspot.com/",
"http://www.google.co.jp/",
"http://www.google.de/"
    ]

Reversing URLs¶

It is very useful to reverse the order of the domain names when storing them in an ordered-sequence.
This will allow us to retreive, e.g. all of the ".com" URLs using a scan on a partial key.
Let's parse the URLs and then reverse the domain names, and then store the result in a list with the reversed domain and the original URL.

In [52]:

from urllib.parse import urlparse
from functools import reduce

def reverse_domain(dom):
    result = dom.split('.')
    result.reverse()
    return reduce(lambda x, y: x + '.' + y, result)

reversed_domains = \
    [(reverse_domain(urlparse(i).netloc), i) for i in urls]
reversed_domains

Out[52]:

[('com.google.www', 'http://www.google.com/'),
 ('com.baidu.www', 'http://www.baidu.com/'),
 ('com.facebook.www', 'http://www.facebook.com/'),
 ('com.youtube.www', 'http://www.youtube.com/'),
 ('com.yahoo.www', 'http://www.yahoo.com/'),
 ('org.wikipedia.www', 'http://www.wikipedia.org/'),
 ('com.taobao.www', 'http://www.taobao.com/'),
 ('com.qq.www', 'http://www.qq.com/'),
 ('com.amazon.www', 'http://www.amazon.com/'),
 ('com.live.www', 'http://www.live.com/'),
 ('com.twitter.www', 'http://www.twitter.com/'),
 ('com.weibo.www', 'http://www.weibo.com/'),
 ('in.co.google.www', 'http://www.google.co.in/'),
 ('com.tmall.www', 'http://www.tmall.com/'),
 ('com.linkedin.www', 'http://www.linkedin.com/'),
 ('com.blogspot.www', 'http://www.blogspot.com/'),
 ('jp.co.google.www', 'http://www.google.co.jp/'),
 ('de.google.www', 'http://www.google.de/')]

Storing the URLs in the database¶

We will now create a simple table called pages with a single column family d:

In [53]:

connection.create_table('pages', { 'd': dict() })

We will store the details of each URL on each row.
We will use a single column times_crawled to represent the number of times we have crawled the URL.

In [54]:

integer64s = struct.Struct('I')
pages_table = connection.table('pages')
b = pages_table.batch()
for (reversed_domain, url) in reversed_domains:
    b.put(reversed_domain, {'d:url': str(url), 'd:times_crawled': integer64s.pack(0)})
b.send()

In [55]:

def to_int(bytes):
    return integer64s.unpack(bytes)[0]

for (url, data) in pages_table.scan():
    print(value(data, 'd:url'), "was crawled", \
        to_int(value(data, 'd:times_crawled')), "times")

b'http://www.amazon.com/' was crawled 0 times
b'http://www.baidu.com/' was crawled 0 times
b'http://www.blogspot.com/' was crawled 0 times
b'http://www.facebook.com/' was crawled 0 times
b'http://www.google.com/' was crawled 0 times
b'http://www.linkedin.com/' was crawled 0 times
b'http://www.live.com/' was crawled 0 times
b'http://www.qq.com/' was crawled 0 times
b'http://www.taobao.com/' was crawled 0 times
b'http://www.tmall.com/' was crawled 0 times
b'http://www.twitter.com/' was crawled 0 times
b'http://www.weibo.com/' was crawled 0 times
b'http://www.yahoo.com/' was crawled 0 times
b'http://www.youtube.com/' was crawled 0 times
b'http://www.google.de/' was crawled 0 times
b'http://www.google.co.in/' was crawled 0 times
b'http://www.google.co.jp/' was crawled 0 times
b'http://www.wikipedia.org/' was crawled 0 times

Partial key scans¶

We can now retreive URLs for a top-level domain by specifying a partial key.
Here we specify a row prefix parameter of com.
Because we have reversed the domain names this will allow us to retrieve particular top-level domains.

In [57]:

for url, data in pages_table.scan(row_prefix=bytes("com.", encoding='utf')):
    print(url)
    print(value(data, 'd:url'), "was crawled", \
        to_int(value(data, 'd:times_crawled'))), "times"

b'com.amazon.www'
b'http://www.amazon.com/' was crawled 0
b'com.baidu.www'
b'http://www.baidu.com/' was crawled 0
b'com.blogspot.www'
b'http://www.blogspot.com/' was crawled 0
b'com.facebook.www'
b'http://www.facebook.com/' was crawled 0
b'com.google.www'
b'http://www.google.com/' was crawled 0
b'com.linkedin.www'
b'http://www.linkedin.com/' was crawled 0
b'com.live.www'
b'http://www.live.com/' was crawled 0
b'com.qq.www'
b'http://www.qq.com/' was crawled 0
b'com.taobao.www'
b'http://www.taobao.com/' was crawled 0
b'com.tmall.www'
b'http://www.tmall.com/' was crawled 0
b'com.twitter.www'
b'http://www.twitter.com/' was crawled 0
b'com.weibo.www'
b'http://www.weibo.com/' was crawled 0
b'com.yahoo.www'
b'http://www.yahoo.com/' was crawled 0
b'com.youtube.www'
b'http://www.youtube.com/' was crawled 0

In [59]:

for url, data in pages_table.scan(row_prefix=bytes("de.", encoding='utf')):
    print(value(data, 'd:url'), "was crawled", \
        to_int(value(data,'d:times_crawled')), "times")

b'http://www.google.de/' was crawled 0 times