This tutorial will cover the following topics:
TileDB can model dataframes either as dense or sparse arrays. Storing a dataframe as a (1D) dense array allows for rapid slicing on row indices. On the other hand, storing the dataframe as a ND sparse array, specifying any subset of the columns to act as the dimensions, allows for rapid slicing on range predicates on those column dimensions.
In either case and in addition to the slicing predicate, TileDB allows for very fast subselection of columns. This is because it implements a "columnar" format and, therefore, it fetches from persistent storage only data from the requested columns.
This notebook was run on a 2.3 GHz Intel Core i9, 8 cores, 16GB RAM, running MacOS Mojave.
We will use the NYC Taxi Trip dataset and specifically the yellow taxi trip records which has this schema.
We will focus on ingesting the data from January 2020, namely file yellow_tripdata_2020-01.csv
. The file is about 560MB.
!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv
!ls -alh yellow_tripdata_2020-01.csv
-rw-rw---- 1 stavros staff 566M Jul 30 00:07 yellow_tripdata_2020-01.csv
You need to install TileDB-Py, the Python wrapper of TileDB Embedded, as follows:
# Pip:
$ pip install tiledb
# Or Conda:
$ conda install -c conda-forge tiledb-py
The notebook was run using Pandas 1.1.0.
Note that the TileDB core is a C++ library. To boost performance when integrating with pandas, we use Apache Arrow to achieve zero-copy when returning results from TileDB into pandas dataframes. You need to install pyarrow to take advantage of this optimization:
# Pip:
$ pip install pyarrow
# Or Conda:
$ conda install -c conda-forge pyarrow
Import TileDB and check the versions of the C++ core and TileDB-Py respectively.
import tiledb, numpy as np
# Version of TileDB core (C++ library)
tiledb.libtiledb.version()
(2, 1, 3)
# Version of TileDB-Py (Python wrapper)
tiledb.__version__
'0.7.4'
Before we start, we create the TileDB context passing a configuration parameter around memory allocation during read queries that will be explained in a later tutorial. That needs to be set at the very beginning of the code and before any other TileDB function is called.
cfg = tiledb.Ctx().config()
cfg.update(
{
'py.init_buffer_bytes': 1024**2 * 50
}
)
tiledb.default_ctx(cfg)
tiledb.Ctx() [see Ctx.confg() for configuration]
We also enable the TileDB stats so that we can get some insight into performance.
tiledb.stats_enable()
We ingest the yellow_tripdata_2020-01.csv
CSV file into a TileDB dense array as shown below. The command takes the taxi CSV file and ingests it into a 1D dense array called taxi_dense_array
. It sets the tile extent to 100K, which means that groups of 100K rows each across every column will comprise the atomic unit of compression and IO (i.e., a data tile). Two of the columns (tpep_dropoff_datetime
and tpep_pickup_datetime
) are dates, so we make sure to parse them as such. Finally, one of the columns (store_and_fwd_flag
) may have nulls, so we explicitly set some null value.
%%time
tiledb.stats_reset()
tiledb.from_csv("taxi_dense_array", "yellow_tripdata_2020-01.csv",
tile = 100000,
parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'],
fillna={'store_and_fwd_flag': ''})
tiledb.stats_dump()
/opt/miniconda3/envs/tiledb/lib/python3.8/site-packages/IPython/core/magic.py:187: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False. call = lambda f, *a, **k: f(*a, **k)
TileDB Embedded Version: (2, 1, 3) TileDB-Py Version: 0.7.4 ==== WRITE ==== - Number of write queries: 1 - Number of attributes written: 18 * Number of fixed-sized attributes written: 17 * Number of var-sized attributes written: 1 - Number of bytes written: 141115071 bytes (0.131424 GB) - Number of write operations: 1362 - Number of bytes filtered: 942532872 bytes (0.877802 GB) - Filtering deflation factor: 6.67918x - Total metadata written: 13304 bytes (1.23903e-05 GB) * Array schema: 1129 bytes (1.05146e-06 GB) * Fragment metadata footer: 846 bytes (7.87899e-07 GB) * R-tree: 99 bytes (9.22009e-08 GB) * Fixed-sized tile offsets: 6838 bytes (6.36838e-06 GB) * Var-sized tile offsets: 2307 bytes (2.14856e-06 GB) * Var-sized tile sizes: 2085 bytes (1.94181e-06 GB) - Time to write array metadata: 0.00435116 secs * Array metadata size: 55 bytes (5.12227e-08 GB) - Write time: 0.711282 secs * Time to split the coordinates buffer: 9.6e-08 secs * Time to check out-of-bounds coordinates: 3.8e-08 secs * Time to initialize dense cell range iterators: 3.4535e-05 secs * Time to compute cell ranges: 5.2313e-05 secs * Time to prepare and filter tiles: 0.624271 secs * Time to write tiles: 0.0603356 secs * Time to write fragment metadata: 0.0119379 secs - Time to finalize write query: 1.37e-07 secs CPU times: user 18.1 s, sys: 4.42 s, total: 22.6 s Wall time: 16 s
From the stats, the actual write time in TileDB took under 1 second (the rest was mostly parsing the CSV in Pandas). The raw uncompressed CSV data was about 870 MB in binary format, and those got compressed down to about 131 MB in TileDB. There are 18 columns written as attributes, one of which is var-sized (of string type, as we will see in the schema below).
Next, let's open the written array and inspect the TileDB schema.
A = tiledb.open("taxi_dense_array")
print(A.schema)
ArraySchema( domain=Domain(*[ Dim(name='__tiledb_rows', domain=(0, 6405007), tile=100000, dtype='uint64'), ]), attrs=[ Attr(name='VendorID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='tpep_pickup_datetime', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='tpep_dropoff_datetime', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='passenger_count', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='trip_distance', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='RatecodeID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='store_and_fwd_flag', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='PULocationID', dtype='int64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='DOLocationID', dtype='int64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='payment_type', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='fare_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='extra', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='mta_tax', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='tip_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='tolls_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='improvement_surcharge', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='total_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), Attr(name='congestion_surcharge', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])), ], cell_order='row-major', tile_order='row-major', capacity=10000, sparse=False, coords_filters=FilterList([ZstdFilter(level=-1), ]) )
That shows the 18 columns being stored as attributes, along with their types and filters (e.g., zstd compression, which is the default). There is a single dimension __tiledb_rows
, which corresponds to the row indices. This essentially means that you will be able to slice fast across the row indices.
In order to see the number of rows ingested into this array, you can use the non-empty domain. The range below is inclusive and states that there are 6,405,008 rows in the array.
print(A.nonempty_domain())
((array(0, dtype=uint64), array(6405007, dtype=uint64)),)
Let's reset the stats and perform a full read of the array (all rows and all columns). The result is stored directly in a pandas dataframe. Note that ranges with df
are always inclusive.
%%time
tiledb.stats_reset()
df = A.df[0:6405007]
df
CPU times: user 3.14 s, sys: 5.12 s, total: 8.26 s Wall time: 1.59 s
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 2020-01-01 00:28:15 | 2020-01-01 00:33:03 | 1.0 | 1.20 | 1.0 | N | 238 | 239 | 1.0 | 6.00 | 3.00 | 0.5 | 1.47 | 0.00 | 0.3 | 11.27 | 2.5 |
1 | 1.0 | 2020-01-01 00:35:39 | 2020-01-01 00:43:04 | 1.0 | 1.20 | 1.0 | N | 239 | 238 | 1.0 | 7.00 | 3.00 | 0.5 | 1.50 | 0.00 | 0.3 | 12.30 | 2.5 |
2 | 1.0 | 2020-01-01 00:47:41 | 2020-01-01 00:53:52 | 1.0 | 0.60 | 1.0 | N | 238 | 238 | 1.0 | 6.00 | 3.00 | 0.5 | 1.00 | 0.00 | 0.3 | 10.80 | 2.5 |
3 | 1.0 | 2020-01-01 00:55:23 | 2020-01-01 01:00:14 | 1.0 | 0.80 | 1.0 | N | 238 | 151 | 1.0 | 5.50 | 0.50 | 0.5 | 1.36 | 0.00 | 0.3 | 8.16 | 0.0 |
4 | 2.0 | 2020-01-01 00:01:58 | 2020-01-01 00:04:16 | 1.0 | 0.00 | 1.0 | N | 193 | 193 | 2.0 | 3.50 | 0.50 | 0.5 | 0.00 | 0.00 | 0.3 | 4.80 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6405003 | NaN | 2020-01-31 22:51:00 | 2020-01-31 23:22:00 | NaN | 3.24 | NaN |