PyStore Tutorial

PyStore is a utility that provides developers with fast data storage for Pandas dataframes. It's built on top of Pandas, Numpy and Dask and stores the data in the Parquet file format (via Fastparquet) in a hierarchical directory structure. Files are compressed using Snappy, a fast and efficient compression/decompression library from Google.

The end result is a fast, powerful, and pythonic datastore for Pandas dataframes that can easily query millions of rows in sub-second speed.

PyStore was designed with storing timeseries data in mind. It provides namespaced collections of data. These collections allow bucketing data by source, user or some other metric (for example frequency: End-Of-Day; Minute Bars; etc.). Each collection (or namespace) maps to a directory containing partitioned parquet files for each item (e.g. symbol).

Let's get started

We'll start with importing the necessary libraries

In [1]:
import pystore
import quandl

Next, let's get some market data to work with. We'll use Quandl's API to download 37+ years worth of historical data for Apple's stock.

In [2]:
aapl = quandl.get('WIKI/AAPL')
aapl.head()
Out[2]:
Open High Low Close Volume Ex-Dividend Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume
Date
1980-12-12 28.75 28.87 28.75 28.75 2093900.0 0.0 1.0 0.422706 0.424470 0.422706 0.422706 117258400.0
1980-12-15 27.38 27.38 27.25 27.25 785200.0 0.0 1.0 0.402563 0.402563 0.400652 0.400652 43971200.0
1980-12-16 25.37 25.37 25.25 25.25 472000.0 0.0 1.0 0.373010 0.373010 0.371246 0.371246 26432000.0
1980-12-17 25.87 26.00 25.87 25.87 385900.0 0.0 1.0 0.380362 0.382273 0.380362 0.380362 21610400.0
1980-12-18 26.63 26.75 26.63 26.63 327900.0 0.0 1.0 0.391536 0.393300 0.391536 0.391536 18362400.0

We can take a look at the storage path

In [3]:
pystore.get_path()
Out[3]:
'~/.pystore'

This path can be changed by calling the set_path() mathod:

In [4]:
# Set storage path
pystore.set_path('./pystore_demo')

# show the new storage path
pystore.get_path()
Out[4]:
'./pystore_demo'

We can get a list of datastores found this location. Since we're just getting started, all we'll get is an empty list.

In [5]:
# List stores
pystore.list_stores()
./pystore_demo
Out[5]:
[]

Creating / connecting to our datastore

When connecting to a datastore, if it doesn't exist, it will be automatically created.

In [6]:
store = pystore.store('mydatastore')
store
Out[6]:
PyStore.datastore <./pystore_demo/mydatastore>

Now when we call pystore.list_stores() we'll get a list with our new datastore listed.

In [7]:
pystore.list_stores()
./pystore_demo
Out[7]:
['mydatastore']

Creating / connecting to a Collection

Before we can save our APPL time-series data, we need to create a Collection. As mentioned earlier, each collection (or namespace) maps to a directory containing partitioned parquet files for each item (e.g. symbol).

When connecting to a collection, if it doesn't exist, it will be automatically created.

In [8]:
# Access a collection (create it if not exist)
collection = store.collection('NASDAQ.EOD')
collection
Out[8]:
PyStore.collection <NASDAQ.EOD>

Now, when we list all collections in the datastore, we can see out newly created collection:

In [9]:
store.list_collections()
Out[9]:
['NASDAQ.EOD']

Working with collection items

Before saving our data, let's see if there are any existing items in the collection:

In [10]:
collection.list_items()
Out[10]:
[]

Now we're ready to store our data. For demo purposes, we won't be storing the last row, which will be appended later. We'll also attach some metadata indicating the data source.

In [11]:
collection.write('AAPL', aapl[:-1], metadata={'source': 'Quandl'})

Now, when we list all items in the collections, we can see out newly created item:

In [12]:
collection.list_items()
Out[12]:
['AAPL']

Now let's read the item from the datastore's collection:

In [13]:
# Read the item's data
item = collection.item('AAPL')
item
Out[13]:
PyStore.item <NASDAQ.EOD/AAPL>

The item object has two main properties: data, which returns a Dask dataframe, and metadata which returns the metadata we attached to our item, along with an "updated" timestamp". To learn more about Dask dataframes and their capacilities, visit http://dask.pydata.org/en/latest/dataframe.html.

In [14]:
data = item.data
data
Out[14]:
Dask DataFrame Structure:
Open High Low Close Volume Ex-Dividend Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume
npartitions=1
1980-12-12 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64
2018-03-26 ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: read-parquet, 1 tasks
In [15]:
item.metadata
Out[15]:
{'source': 'Quandl', '_updated': '2018-06-06 13:01:07.746784'}

To load the item's data as a Pandas dataframe, we call the to_pandas() method.

In [16]:
df = item.to_pandas()
df.tail()
Out[16]:
Open High Low Close Volume Ex-Dividend Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume
Date
2018-03-20 175.24 176.80 174.94 175.240 19314039.0 0.0 1.0 175.24 176.80 174.94 175.240 19314039.0
2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0
2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0
2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0
2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0

Let's append the last day (row) to our item:

In [17]:
collection.append('AAPL', aapl[-1:])

df = collection.item('AAPL').to_pandas()
df.tail()
Out[17]:
Open High Low Close Volume Ex-Dividend Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume
Date
2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0
2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0
2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0
2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0
2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0

Querying Collections

After a while, you'll have many items stored, and you may want to look some of them up by metadata. To do this, simply add your metadata key to the list_items method:

In [18]:
collection.list_items(source='Quandl')
Out[18]:
['AAPL']

Working with Snapshots

When working with data, there will be times when you'll accidentally mess up the data, making it unusable. For that reason, PyStore allows you to create snapshots - a point-in-time, named reference for all current items in a collection.

First, let's see if we have any existing snapshots:

In [19]:
collection.list_snapshots()
Out[19]:
[]

Creating a snapshot is done using the create_snapshot method:

In [20]:
collection.create_snapshot('snapshot_name')

collection.list_snapshots()
Out[20]:
['snapshot_name']

To see how snapshots work, let's change our original AAPL to only include the Close and Volume columns.

In [21]:
collection.write('AAPL', aapl[['Close', 'Volume']],
                 metadata={'source': 'Quandl'},
                 overwrite=True)

# Load the "new" item
df = collection.item('AAPL').to_pandas()
df.tail()
Out[21]:
Close Volume
Date
2018-03-21 171.270 35247358.0
2018-03-22 168.845 41051076.0
2018-03-23 164.940 40248954.0
2018-03-26 172.770 36272617.0
2018-03-27 168.340 38962839.0

To load the item from a previous snapshot, we just need to specity it when using the item() method:

In [22]:
snap_df = collection.item('AAPL', snapshot='snapshot_name')
snap_df.to_pandas().tail()
Out[22]:
Open High Low Close Volume Ex-Dividend Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume
Date
2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0
2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0
2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0
2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0
2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0

We can, of course, restore our data from the snapshot:

In [23]:
collection.write('AAPL', snap_df,
                 metadata={'source': 'Quandl'},
                 overwrite=True)

df = collection.item('AAPL').to_pandas()
df.tail()
Out[23]:
Open High Low Close Volume Ex-Dividend Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume
Date
2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0
2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0
2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0
2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0
2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0

Once we're sure we no longer have the need for this snapshot, we can delete it.

In [24]:
# Delete a collection snapshot
collection.delete_snapshot('snapshot_name')

# To delete all snapshots, use:
# collection.delete_snapshots()
Out[24]:
True

Deleting items, collections and stores

In [25]:
# Delete the item from the current version
collection.delete_item('AAPL')
Out[25]:
True
In [26]:
# Delete the collection
store.delete_collection('NASDAQ.EOD')
Out[26]:
True
In [27]:
# Delete the datastore
pystore.delete_store('mydatastore')

# to delete all datastores in the path, use:
# pystore.delete_stores()
Out[27]:
True