This tutorial will cover the following topics:
TileDB can model dataframes either as dense or sparse arrays. Storing a dataframe as a (1D) dense array allows for rapid slicing on row indices. On the other hand, storing the dataframe as a ND sparse array, specifying any subset of the columns to act as the dimensions, allows for rapid slicing on range predicates on those column dimensions.
In either case and in addition to the slicing predicate, TileDB allows for very fast subselection of columns. This is because it implements a "columnar" format and, therefore, it fetches from persistent storage only data from the requested columns.
This notebook was run on a 2.3 GHz Intel Core i9, 8 cores, 16GB RAM, running MacOS Mojave.
We will use the NYC Taxi Trip dataset and specifically the yellow taxi trip records which has this schema.
We will focus on ingesting the data from January 2020, namely file yellow_tripdata_2020-01.csv
. The file is about 560MB.
!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv
!ls -alh yellow_tripdata_2020-01.csv
You need to install TileDB-Py, the Python wrapper of TileDB Embedded, as follows:
# Pip:
$ pip install tiledb
# Or Conda:
$ conda install -c conda-forge tiledb-py
The notebook was run using Pandas 1.1.0.
Note that the TileDB core is a C++ library. To boost performance when integrating with pandas, we use Apache Arrow to achieve zero-copy when returning results from TileDB into pandas dataframes. You need to install pyarrow to take advantage of this optimization:
# Pip:
$ pip install pyarrow
# Or Conda:
$ conda install -c conda-forge pyarrow
Import TileDB and check the versions of the C++ core and TileDB-Py respectively.
import tiledb, numpy as np
# Version of TileDB core (C++ library)
tiledb.libtiledb.version()
# Version of TileDB-Py (Python wrapper)
tiledb.__version__
Before we start, we create the TileDB context passing a configuration parameter around memory allocation during read queries that will be explained in a later tutorial. That needs to be set at the very beginning of the code and before any other TileDB function is called.
cfg = tiledb.Ctx().config()
cfg.update(
{
'py.init_buffer_bytes': 1024**2 * 50
}
)
tiledb.default_ctx(cfg)
We also enable the TileDB stats so that we can get some insight into performance.
tiledb.stats_enable()
We ingest the yellow_tripdata_2020-01.csv
CSV file into a TileDB dense array as shown below. The command takes the taxi CSV file and ingests it into a 1D dense array called taxi_dense_array
. It sets the tile extent to 100K, which means that groups of 100K rows each across every column will comprise the atomic unit of compression and IO (i.e., a data tile). Two of the columns (tpep_dropoff_datetime
and tpep_pickup_datetime
) are dates, so we make sure to parse them as such. Finally, one of the columns (store_and_fwd_flag
) may have nulls, so we explicitly set some null value.
%%time
tiledb.stats_reset()
tiledb.from_csv("taxi_dense_array", "yellow_tripdata_2020-01.csv",
tile = 100000,
parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'],
fillna={'store_and_fwd_flag': ''})
tiledb.stats_dump()
From the stats, the actual write time in TileDB took under 1 second (the rest was mostly parsing the CSV in Pandas). The raw uncompressed CSV data was about 870 MB in binary format, and those got compressed down to about 131 MB in TileDB. There are 18 columns written as attributes, one of which is var-sized (of string type, as we will see in the schema below).
Next, let's open the written array and inspect the TileDB schema.
A = tiledb.open("taxi_dense_array")
print(A.schema)
That shows the 18 columns being stored as attributes, along with their types and filters (e.g., zstd compression, which is the default). There is a single dimension __tiledb_rows
, which corresponds to the row indices. This essentially means that you will be able to slice fast across the row indices.
In order to see the number of rows ingested into this array, you can use the non-empty domain. The range below is inclusive and states that there are 6,405,008 rows in the array.
print(A.nonempty_domain())
Let's reset the stats and perform a full read of the array (all rows and all columns). The result is stored directly in a pandas dataframe. Note that ranges with df
are always inclusive.
%%time
tiledb.stats_reset()
df = A.df[0:6405007]
df