• Author: Tim Hopper
  • Twitter: @tdhopper
  • Email: [email protected]

Content of talk given at PyCarolinas 2012.

This material is on my github page: https://github.com/tdhopper/Pickle-and-Redis.

Persistent Data in Python

Pickle

Pickle is a Python module for "serializing and de-serializing a Python object structure."

Basically, Python lets you save objects to disk.

In [64]:
import pickle

Examples

Dump the text string "abcdefg" to a file called "pickle_test."

In [2]:
pickle.dump("abcdefg", open("pickle_test", "wb"))

Pickle dumps are binary files. They're not designed to be read as text.

In [3]:
print open("pickle_test", "r").read()
S'abcdefg'
p0
.

Pickling other built in classes (and combinations thereof)

In [4]:
data1 = {'a': [1, 2.0, 3, 4+6j],
         'b': ('string', u'Unicode string'),
         'c': None}

pickle.dump(data1, open('data.pkl', 'wb'))

data2 = pickle.load(open('data.pkl', 'rb'))
In [5]:
data1 == data2
Out[5]:
True

What can be pickled?

  • None, True, and False
  • integers, long integers, floating point numbers, complex numbers
  • normal and Unicode strings
  • tuples, lists, sets, and dictionaries containing only picklable objects
  • functions defined at the top level of a module
  • built-in functions defined at the top level of a module
  • classes that are defined at the top level of a module
  • instances of such classes whose dict or setstate() is picklable

(From the official documentation)

Pickling Custom Classes

Pickle can handle much more than built in classes:

In [68]:
class PicklePerson(object):
    def __init__(self, name, age, location):
        self.name = name
        self.age = age
        self.location = location
    
    def __repr__(self):
        return "name: " + self.name + "\n" + "age: " + self.age + \
            "\n" + "location: " + self.location
In [8]:
todd = PicklePerson("Todd", "30", "Raleigh")
print todd
name: Todd
age: 30
location: Raleigh
In [10]:
pickle.dump(todd, open("pickle_todd", "wb"))
recovered_todd = pickle.load(open("pickle_todd","r"))
In [11]:
recovered_todd
Out[11]:
name: Todd
age: 30
location: Raleigh

What can't be pickled?

In [13]:
def f(x): return x+1
pickle.dump(f, open("pickle_good","wb"))
In [14]:
try: 
    with open("pickle_bad","wb") as f:
        
        pickle.dump(lambda x: x+1, f)

except pickle.PicklingError:
    print "Can't pickle :-("
Can't pickle :-(
In [19]:
class NotPickable(object):
    def __init__(self, x):
        self.attr = x

o = NotPickable(open('Pickle and Redis.ipynb', 'r+w'))

try: 
    with open("pickle_bad","wb") as f:
        
        pickle.dumps(o)

except TypeError:
    print "Can't pickle :-("
Can't pickle :-(

Pickling errors can cause problems when using the multiprocessing module for parallelization. I have an example here and here's some discussion on StackOverflow.

Pickle Security

cPickle

"cPickle can be up to 1000 times faster than pickle because the former is implemented in C. "

In [20]:
import cPickle, os
In [21]:
%timeit pickle.dump([data1 for x in range(1000)], open("pickle_todd", "wb"))
100 loops, best of 3: 7.8 ms per loop
In [22]:
%timeit cPickle.dump([data1 for x in range(1000)], open("pickle_todd", "wb"))
100 loops, best of 3: 2.76 ms per loop

For reference, the size of the pickle, in bytes, is:

In [23]:
os.path.getsize('/Users/tdhopper/Dropbox/PyCarolinas 2012/pickle_todd')
Out[23]:
4112

However, "in the cPickle module the callables Pickler() and Unpickler() are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses."

Redis

Basics

"Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets." (http://redis.io/)

Some advantages:

  • Unlike memcached, redis can save its state to the disk.
  • Data from any Redis server can replicate to any number of slaves. A slave may be a master to another slave." (Wikipedia)
  • Optionally durable.
  • Holds data set in memory.
  • Fast, fast, fast.

Installation

Redis is easy to install on *nix systems:

$ wget http://redis.googlecode.com/files/redis-2.4.17.tar.gz
$ tar xzf redis-2.4.17.tar.gz
$ cd redis-2.4.17
$ make

(There's an unofficial Windows port.)

Start a redis server with:

 $ redis-server

The Redis server can be accessed directly from the Redis Command Line Interface (CLI):

 $ redis-cli

Setting and getting keys is easy:

redis> set foo bar
OK
redis> get foo
"bar"

The Redis website provides helpful documentation on all the "redis-cli" commands:

Data Types

Strings

"Strings are the most basic kind of Redis value. Redis Strings are binary safe, this means that a Redis string can contain any kind of data...."

Including:

  • An image
  • A Pickled Python object!

A string must be less than 512 megabytes.

Counters
SET mykey "10"
OK
redis> INCR mykey
(integer) 11
redis> GET mykey
"11"

("Note: this is a string operation because Redis does not have a dedicated integer type.")

Appends
redis> EXISTS mykey
(integer) 0
redis> APPEND mykey "Hello"
(integer) 5
redis> APPEND mykey " World"
(integer) 11
redis> GET mykey
"Hello World"

This gives fast way to store a time series.

Also see DECR and INCRBY.

Slices

Slicing strings is easy:

redis> SET mykey "This is a string"
OK
redis> GETRANGE mykey 0 3
"This"
redis> GETRANGE mykey -3 -1
"ing"
redis> GETRANGE mykey 0 -1
"This is a string"

Lists

  • "Redis Lists are simply lists of strings, sorted by insertion order."
  • "It is possible to add elements to a Redis List pushing new elements on the head (on the left) or on the tail (on the right) of the list."
redis> RPUSH mylist "hello"
(integer) 1
redis> RPUSH mylist "world"
(integer) 2
redis> RPUSH mylist "HELLO" "PyCarolinas"
(integer) 4
redis> LRANGE mylist 0 -1
1) "hello"
2) "world"
3) "HELLO"
4) "PyCarolinas"

The maximum list size is $2^{32}-1\approx\mbox{}4\text{ billion}$.

Combine RPUSH, LPUSH, RPOP, and LPOP to create your favorite queue!

Sets

  • "Redis Sets are an unordered collection of Strings."
  • "...you can do unions, intersections, differences of sets in very short time."
  • Like lists, the maximum size is about 4 billion.
redis> SADD myset "Hello"
(integer) 1
redis> SADD myset "World"
(integer) 1
redis> SADD myset "World"
(integer) 0
redis> SMEMBERS myset
1) "World"
2) "Hello"

Get a random set item with SPOP or SRANDMEMBER.

Hashes

"Redis Hashes are maps between string fields and string values."

HMSET myhash field1 "Hello" field2 "World"
OK
redis> HGET myhash field1
"Hello"
redis> HGET myhash field2
"World"

Sorted Sets

"...every member of a Sorted Set is associated with score, that is used in order to take the sorted set ordered, from the smallest to the greatest score."

redis> ZADD myzset 1 "one"
(integer) 1
redis> ZADD myzset 1 "uno"
(integer) 1
redis> ZADD myzset 2 "two"
(integer) 1
redis> ZADD myzset 3 "two"
(integer) 0
redis> ZRANGE myzset 0 -1 WITHSCORES
1) "one"
2) "1"
3) "uno"
4) "1"
5) "two"
6) "3"
redis> ZRANGE myzset 0 -1
1) "one"
2) "uno"
3) "two"

Notice that "two" only appears once. When ZADD myzset 3 "two" is called, the score of "two" is updated from 2 to 3.

"While members are unique, scores may be repeated."

Redis and Python

A Python interface is redis is available at https://github.com/andymccurdy/redis-py

$ sudo pip install redis

Using redis from Python is as easy as importing the package and connecting to a server:

In [24]:
import redis
r = redis.StrictRedis(host='localhost', port=6379, db=0)

Setting and getting keys is easy:

In [25]:
r.set('foo', 'bar')
Out[25]:
True
In [26]:
r.get('foo')
Out[26]:
'bar'
In [28]:
%timeit r.set('foo', 'bar')
10000 loops, best of 3: 146 us per loop
In [27]:
%timeit r.get('foo')
10000 loops, best of 3: 160 us per loop

Times are in the order of 100 nanoseconds. According to Wolfram Alpha:

In general, the StrictRedis class implements commands identically to the redis-cli commands.

Let's create a set of words from a paragraph in the Wikipedia page on redis:

In [65]:
import string

text = """
    Redis typically holds the whole dataset in RAM. Versions up to 2.4 could be configured 
to use virtual memory but this is now deprecated. Persistence is reached in two different 
ways: One is called snapshotting, and is a semi-persistent durability mode where the dataset 
is asynchronously transferred from memory to disk from time to time, written in RDB dump format. 
Since version 1.1 the safer alternative is AOF, an append-only file (a journal) that is written 
as operations modifying the dataset in memory are processed. Redis is able to rewrite the 
append-only file in the background in order to avoid an indefinite growth of the journal."""

# Strip punctuation: http://stackoverflow.com/a/2402306/982745 
text_list = [word.translate(None, string.punctuation) for word in text.split()] 

for word in text_list: r.delete(word) # In case these words are already in redis, delete them.

Create a set of the words:

In [38]:
for word in text_list:
    r.sadd("persistence", word)
    
print [r.srandmember('persistence') for i in range(10)] # Get ten random words
print [r.srandmember('persistence') for i in range(10)] # Get ten more random words
['of', 'avoid', 'semipersistent', 'from', 'rewrite', 'RAM', 'Persistence', 'memory', 'file', 'Since']
['a', 'virtual', 'alternative', 'where', 'semipersistent', 'modifying', 'in', 'an', 'asynchronously', 'rewrite']

Count all the word frequency in this text:

In [44]:
for word in text_list:
    r.incr(word)

# Print most used words in this document:
    
for word in set(text_list):
    if int(r.get(word)) > 2:
        print word
        print "\t\t", r.get(word)
dataset
		3
in
		6
to
		6
memory
		3
is
		8
the
		7

The best part is that all this data will persist across your Python sessions!

Storing Python Objects in Redis

Direct Picklin'
In [66]:
bob = pickle.dumps(PicklePerson("bob","50","durham"))
print bob
ccopy_reg
_reconstructor
p0
(c__main__
PicklePerson
p1
c__builtin__
object
p2
Ntp3
Rp4
(dp5
S'age'
p6
S'50'
p7
sS'name'
p8
S'bob'
p9
sS'location'
p10
S'durham'
p11
sb.
In [49]:
r.set("bob", bob)
Out[49]:
True
In [50]:
pickle.loads(r.get("bob"))
Out[50]:
name: bob
age: 50
location: durham
Redisco

Redisco is a library build on redis-py that allows you to store objects in Redis.

In [57]:
import redisco
from redisco import connection_setup, models
redisco.connection_setup(host='localhost', port=6379, db=0)
In [58]:
class Person(models.Model):
    name = models.Attribute(required=True)
    age = models.Attribute(required=False)
    location = models.Attribute(required=False)
    
for x in Person.objects.filter(name="Tim"):
    x.delete()
In [59]:
tim_hopper = Person(name="Tim",age="26",location="Morrisville")
tim_smith = Person(name="Tim",age="75",location="Chapel Hill")
In [60]:
tim_hopper.save()
tim_smith.save()
Out[60]:
True
In [61]:
Person.objects.filter(name="Tim")
Out[61]:
[<Person:12 {'age': u'26', 'name': u'Tim', 'location': u'Morrisville'}>, <Person:13 {'age': u'75', 'name': u'Tim', 'location': u'Chapel Hill'}>]
In [62]:
Person.objects.filter(name="Tim", age="26")[0] == tim_hopper
Out[62]:
True

Redisco is in version 0.1.4 and hasn't been updated recently. Nevertheless, it gives you an idea of what redis-py is capable of.


  • Author: Tim Hopper
  • Twitter: @tdhopper
  • Email: [email protected]

Content of talk given at PyCarolinas 2012.

This material is on my github page: https://github.com/tdhopper/Pickle-and-Redis.