(C) 2016 Steve Phelps
Struct
class.happybase
API.When building scalable systems, we need to parallelize our computations.
This requires multiple processors (and/or multiple cores).
How do we distribute the data to each processor?
Scheme | Description |
---|---|
Shared memory (SM) | Multiple processors share a common central memory. |
Shared disk (SD) | Multiple processors with private memory share a common collection of disks. |
Shared nothing (SN) | Neither memory not peripheral storage is shared among processors. |
\cite{Stonebraker1986}
of the data on different RDBMS servers.
We could use, e.g. a hash function to determine which server a piece of data should reside on.
Each of these subsets is called a shard.
With a traditional relational database, sharding is very costly in terms of duplicated resources and the complexity of the configuration.
HBase provides automatic sharding, with minimal duplication of state.
Tables are partitioned vertically into different regions.
Different region servers are responsible for one or more regions.
The HBase Master process performs coordination and load balancing across the region servers.
In a production configuration HBase stores its underlying data on a file-system called HDFS.
HDFS uses a cluster of computers to simulate a single file-system.
It provides:
It is based on the philosophy that moving computation is often cheaper than moving data.
NULL
values.NULL
values is called sparse
data.NULL
values can be represented by the absence of a mapping.We can define an ordering over arbitrary binary data.
For example, the following Python code determines whether $x < y$
according to lexicographic ordering:
def lexicographic_le(x, y):
for i in range(len(x)):
if x[i] == y[i]:
continue
else:
return x[i] < y[i]
return False
lexicographic_le('steve', 'smith')
False
import numpy as np
x = np.array([15, 9, 5, 16, 4])
y = np.array([15, 16, 1, 18, 1])
lexicographic_le(x, y)
True
There is no automatic enforcement of referentially integrity in a column-oriented database.
If we want to retrieve the value associated with a particular key from another table, then
that will be a fast operation.
In general, however, the set-theoretic operations available in SQL can be very expensive.
Joins are not supported; we "pre-join" the data through de-normalisation.
Row keys and column keys can consist of arbitrary binary data.
Often we will use strings (which HBase will see an array of 8-bit ASCII values).
Column-keys containing the character ":
" specify membership of a column-family:
<column family name>:<column qualifer>
The column family name must consist of printable text.
For example, in the previous example, the triangle
attribute would be written as:
shape:triangle
The fully-qualified key is stored on disk.
Therefore, the name of a column-family should generally be short, in order to maintain capacity and IO throughput.
data = \
{
'first':
{
'color:red': 0xf00, 'color:blue': 0x00f,
'shape:yellow': 0xff0, 'shape:square': 4
},
'second':
{
'shape:square': 4, 'shape:triangle': 3
}
}
data['second']
{'shape:square': 4, 'shape:triangle': 3}
data['second']['shape:triangle']
3
BigTable was originally designed to house web data (it originated from Google).
It is very common to use a URL as a row-key.
We would typically reverse the URL before using it as a key.
For example:
keats.kcl.ac.uk
becomes uk.ac.kcl.keats
.
date +%s
There are four core CRUD operations which we can perform on tables:
We execute them by either:
On a big project, it is typically more practical to use the API than the shell.
Get returns attributes for a specified row-key and column(s).
It can be used in several ways:
By default it will retrieve up to three versions of the data
Put can be used to insert new rows, or update existing rows.
We always specify a row key.
Where there is no existing value at a cell, new data is created.
If there is existing data, it is retained under the old time stamp.
Scan is used to retrieve a range of rows in one operation.
In its simplest form it will iterate over all rows in the table.
We can also specify a minimum and maximum row-key.
The ordering is given by the lexicographic ordering of the row-key itself.
ORDER BY
clause.In a column-oriented database we need to think carefully about the structure of the table in respect of the kinds of queries we want to support.
We can choose a tall-narrow design as opposed to a flat-wide design.
A tall-narrow design is more readily shard-able.
A flat-wide design can be transformed into a tall-narrow design by storing additional data values in the row-key.
These attributes can be retrieved by scanning with a partial-key.
Because keys are sorted lexicographically, we can index several attributes simultaneously through concatenation.
e.g. suppose we use row`keys as with the following format:
<userId>-<date>-<messageId>-<attachmentId>
We can then scan the table by specifying a partial-key.
We can perform create, read, update and delete (CRUD) operations from the HBase shell.
To start the HBase shell type a command similar to the following from the Unix shell:
hbase shell
status
.HBase does not have any concept of data types.
All data is treated as binary data.
We can think of binary data as a sequence of bytes.
If we use utf8 or 8-bit ASCII, then we can represent binary data as a Python string.
Byte values which do not have an associated ASCII character are shown as hexadecimal values.
It is very easy to convert between Hexadecimal and binary.
chr(97)
'a'
ord('a')
97
chr(0)
'\x00'
\x<value>
.ord('\x00')
0
ord('\x10')
16
We can represent any 8-bit value as a character in Python.
We can represent arbitrary binary data as an ordered sequence of characters.
This is simply a string.
For example, the sequence of bytes 0x0010
can be represented:
binary_data = '\x00\x10'
binary_data
'\x00\x10'
binary_data[0]
'\x00'
'alpha' < 'beta'
True
'smith' < 'steve'
True
x = '\x0f\x09\x05\x10\x04'
y = '\x0f\x10\x01\x12\x01'
type(x)
str
type(y)
str
x < y
True
Binary data can be considered as a sequence of bits.
Therefore it can be considered as a sequence of bytes.
Therefore binary data can be represented as a string.
Therefore binary data can be compared and sorted.
Therefore we can order binary data and store it in, e.g. a tree.
Therefore binary data can be indexed.
All values in Python are ultimately stored as binary data.
This is usually hidden from the programmer.
When working with HBase we will often need to deal explicitly with the binary representation.
This is sometimes called serialization.
We can use Python's struct class to do this.
import struct
import binascii
data = 3.141592653589793
float64s = struct.Struct('d')
pi_in_binary = float64s.pack(data)
pi_in_binary
b'\x18-DT\xfb!\t@'
Because some of the characters are printable, it is hard to interpret the above string.
When dealing with binary data we often want to sequence of bits, or more conveniently the sequence of bytes represented in hexadecimal.
The hexlify()
function in the binascii
module will do this:
binascii.hexlify(pi_in_binary)
b'182d4454fb210940'
import struct
import binascii
data = 5
integer64s = struct.Struct('l')
binary_data = integer64s.pack(data)
binascii.hexlify(binary_data)
b'0500000000000000'
data = -5
integer64s = struct.Struct('l')
binary_data = integer64s.pack(data)
binascii.hexlify(binary_data)
b'fbffffffffffffff'
original_data = float64s.unpack(pi_in_binary)
original_data
(3.141592653589793,)
original_value = original_data[0]
original_value
3.141592653589793
When we use decimal, by convention we write the most-significant digit on the left, and proceed towards the least-significant digit on the right.
This is called a Big-endian format.
We can also use the opposite convention, which is Little-endian.
Different CPUs use different formats.
Intel processors use the Little-endian format.
The convention used by the CPU is called the native format.
If we want to make our data portable between machines, e.g. transmit it over the network, we may have to specify a particular convention.
big_endian_short_int = struct.Struct('>h')
v = big_endian_short_int.pack(1)
binascii.hexlify(v)
b'0001'
little_endian_short_int = struct.Struct('<h')
v = little_endian_short_int.pack(1)
binascii.hexlify(v)
b'0100'
big_endian_short_int = struct.Struct('>h')
v = big_endian_short_int.pack(-2)
binascii.hexlify(v)
b'fffe'
little_endian_short_int = struct.Struct('<h')
v = little_endian_short_int.pack(-2)
binascii.hexlify(v)
b'feff'
type(original_value)
float
import struct
import binascii
values = (1, 'hello'.encode('UTF-8'), 2.7)
s = struct.Struct('I 5s d')
serialized = s.pack(*values)
print('Original values:', values)
print('Format string :', s.format)
print('Uses :', s.size, 'bytes')
print('Packed Value :', binascii.hexlify(serialized))
Original values: (1, b'hello', 2.7) Format string : I 5s d Uses : 24 bytes Packed Value : b'0100000068656c6c6f000000000000009a99999999990540'
s.unpack(serialized)
(1, b'hello', 2.7)
import happybase
host = '127.0.0.1'
connection = happybase.Connection(host)
table_name = 'my_table'
connection.create_table(table_name,
{ 'color': dict(max_versions=10),
'shape': dict(max_versions=1)
})
integer16s = struct.Struct('h')
four_as_bytes = integer16s.pack(4)
four_as_bytes
b'\x04\x00'
table = connection.table('my_table')
table.put('first', {'shape:square': four_as_bytes} )
b = table.batch()
b.put('first', {
'color:red': '\xff\x00\x00',
'color:blue': '\x00\x00\xff',
'color:yellow': '\xff\xff\x00'
})
b.put('second', {'shape:triangle': integer16s.pack(3)})
b.send()
To retrieve a particular cell:
result = table.row('first', columns=['shape:square'])
result
{b'shape:square': b'\x04\x00'}
Notice the b
character in front of each string.
This denotes that the type here is not in fact a string, but a bytes
sequence.
We can convert a string into sequence of bytes using the bytes()
function; e.g.
my_string = 'my_string'
my_string
'my_string'
type(my_string)
str
my_bytes = bytes(my_string, encoding='utf')
my_bytes
b'my_string'
type(my_bytes)
bytes
def value(d, key):
return d[bytes(key, encoding='utf')]
binary_data = value(result, 'shape:square')
binary_data
b'\x04\x00'
number_of_sides = integer16s.unpack(binary_data)[0]
number_of_sides
4
type(number_of_sides)
int
result = table.row('first', columns=['shape:square', 'color:blue'])
result
{b'color:blue': b'\x00\x00\xc3\xbf', b'shape:square': b'\x04\x00'}
result = table.row('first', columns=['color'])
result
{b'color:blue': b'\x00\x00\xc3\xbf', b'color:red': b'\xc3\xbf\x00\x00', b'color:yellow': b'\xc3\xbf\xc3\xbf\x00'}
table.row('first')
{b'color:blue': b'\x00\x00\xc3\xbf', b'color:red': b'\xc3\xbf\x00\x00', b'color:yellow': b'\xc3\xbf\xc3\xbf\x00', b'shape:square': b'\x04\x00'}
for row_key, data in table.scan():
print(row_key, data)
b'first' {b'color:blue': b'\x00\x00\xc3\xbf', b'color:red': b'\xc3\xbf\x00\x00', b'color:yellow': b'\xc3\xbf\xc3\xbf\x00', b'shape:square': b'\x04\x00'} b'second' {b'shape:triangle': b'\x03\x00'}
table.row('first', columns=['shape:square', 'shape:triangle'])
{b'shape:square': b'\x04\x00'}
It is very common to use a URL as the row key.
urls = [
"http://www.google.com/",
"http://www.baidu.com/",
"http://www.facebook.com/",
"http://www.youtube.com/",
"http://www.yahoo.com/",
"http://www.wikipedia.org/",
"http://www.taobao.com/",
"http://www.qq.com/",
"http://www.amazon.com/",
"http://www.live.com/",
"http://www.twitter.com/",
"http://www.weibo.com/",
"http://www.google.co.in/",
"http://www.tmall.com/",
"http://www.linkedin.com/",
"http://www.blogspot.com/",
"http://www.google.co.jp/",
"http://www.google.de/"
]
It is very useful to reverse the order of the domain names when storing them in an ordered-sequence.
This will allow us to retreive, e.g. all of the ".com" URLs using a scan on a partial key.
Let's parse the URLs and then reverse the domain names, and then store the result in a list with the reversed domain and the original URL.
from urllib.parse import urlparse
from functools import reduce
def reverse_domain(dom):
result = dom.split('.')
result.reverse()
return reduce(lambda x, y: x + '.' + y, result)
reversed_domains = \
[(reverse_domain(urlparse(i).netloc), i) for i in urls]
reversed_domains
[('com.google.www', 'http://www.google.com/'), ('com.baidu.www', 'http://www.baidu.com/'), ('com.facebook.www', 'http://www.facebook.com/'), ('com.youtube.www', 'http://www.youtube.com/'), ('com.yahoo.www', 'http://www.yahoo.com/'), ('org.wikipedia.www', 'http://www.wikipedia.org/'), ('com.taobao.www', 'http://www.taobao.com/'), ('com.qq.www', 'http://www.qq.com/'), ('com.amazon.www', 'http://www.amazon.com/'), ('com.live.www', 'http://www.live.com/'), ('com.twitter.www', 'http://www.twitter.com/'), ('com.weibo.www', 'http://www.weibo.com/'), ('in.co.google.www', 'http://www.google.co.in/'), ('com.tmall.www', 'http://www.tmall.com/'), ('com.linkedin.www', 'http://www.linkedin.com/'), ('com.blogspot.www', 'http://www.blogspot.com/'), ('jp.co.google.www', 'http://www.google.co.jp/'), ('de.google.www', 'http://www.google.de/')]
pages
with a single column family d
:connection.create_table('pages', { 'd': dict() })
We will store the details of each URL on each row.
We will use a single column times_crawled
to represent the number of times we have crawled the URL.
integer64s = struct.Struct('I')
pages_table = connection.table('pages')
b = pages_table.batch()
for (reversed_domain, url) in reversed_domains:
b.put(reversed_domain, {'d:url': str(url), 'd:times_crawled': integer64s.pack(0)})
b.send()
def to_int(bytes):
return integer64s.unpack(bytes)[0]
for (url, data) in pages_table.scan():
print(value(data, 'd:url'), "was crawled", \
to_int(value(data, 'd:times_crawled')), "times")
b'http://www.amazon.com/' was crawled 0 times b'http://www.baidu.com/' was crawled 0 times b'http://www.blogspot.com/' was crawled 0 times b'http://www.facebook.com/' was crawled 0 times b'http://www.google.com/' was crawled 0 times b'http://www.linkedin.com/' was crawled 0 times b'http://www.live.com/' was crawled 0 times b'http://www.qq.com/' was crawled 0 times b'http://www.taobao.com/' was crawled 0 times b'http://www.tmall.com/' was crawled 0 times b'http://www.twitter.com/' was crawled 0 times b'http://www.weibo.com/' was crawled 0 times b'http://www.yahoo.com/' was crawled 0 times b'http://www.youtube.com/' was crawled 0 times b'http://www.google.de/' was crawled 0 times b'http://www.google.co.in/' was crawled 0 times b'http://www.google.co.jp/' was crawled 0 times b'http://www.wikipedia.org/' was crawled 0 times
We can now retreive URLs for a top-level domain by specifying a partial key.
Here we specify a row prefix parameter of com.
Because we have reversed the domain names this will allow us to retrieve particular top-level domains.
for url, data in pages_table.scan(row_prefix=bytes("com.", encoding='utf')):
print(url)
print(value(data, 'd:url'), "was crawled", \
to_int(value(data, 'd:times_crawled'))), "times"
b'com.amazon.www' b'http://www.amazon.com/' was crawled 0 b'com.baidu.www' b'http://www.baidu.com/' was crawled 0 b'com.blogspot.www' b'http://www.blogspot.com/' was crawled 0 b'com.facebook.www' b'http://www.facebook.com/' was crawled 0 b'com.google.www' b'http://www.google.com/' was crawled 0 b'com.linkedin.www' b'http://www.linkedin.com/' was crawled 0 b'com.live.www' b'http://www.live.com/' was crawled 0 b'com.qq.www' b'http://www.qq.com/' was crawled 0 b'com.taobao.www' b'http://www.taobao.com/' was crawled 0 b'com.tmall.www' b'http://www.tmall.com/' was crawled 0 b'com.twitter.www' b'http://www.twitter.com/' was crawled 0 b'com.weibo.www' b'http://www.weibo.com/' was crawled 0 b'com.yahoo.www' b'http://www.yahoo.com/' was crawled 0 b'com.youtube.www' b'http://www.youtube.com/' was crawled 0
for url, data in pages_table.scan(row_prefix=bytes("de.", encoding='utf')):
print(value(data, 'd:url'), "was crawled", \
to_int(value(data,'d:times_crawled')), "times")
b'http://www.google.de/' was crawled 0 times