Dataframes: The Basics

This tutorial will cover the following topics:

  • Storing a dataframe as a TileDB 1D dense array to allow fast (out-of-core) slicing on rows
  • Storing a dataframe as a TileDB ND sparse array to allow fast (out-of-core) execution of column range predicates
  • Interoperating with Pandas and Apache Arrow
  • Fast subselection on columns
  • Running SQL queries on the stored dataframes
  • Measuring performance in TileDB
  • Running on different storage backends
  • Some basic virtual file system (VFS) operations with TileDB

TileDB can model dataframes either as dense or sparse arrays. Storing a dataframe as a (1D) dense array allows for rapid slicing on row indices. On the other hand, storing the dataframe as a ND sparse array, specifying any subset of the columns to act as the dimensions, allows for rapid slicing on range predicates on those column dimensions.

In either case and in addition to the slicing predicate, TileDB allows for very fast subselection of columns. This is because it implements a "columnar" format and, therefore, it fetches from persistent storage only data from the requested columns.

This notebook was run on a 2.3 GHz Intel Core i9, 8 cores, 16GB RAM, running MacOS Mojave.

Getting Started

Dataset

We will use the NYC Taxi Trip dataset and specifically the yellow taxi trip records which has this schema.

We will focus on ingesting the data from January 2020, namely file yellow_tripdata_2020-01.csv. The file is about 560MB.

In [ ]:
!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv
In [1]:
!ls -alh yellow_tripdata_2020-01.csv
-rw-rw----  1 stavros  staff   566M Jul 30 00:07 yellow_tripdata_2020-01.csv

Installation

You need to install TileDB-Py, the Python wrapper of TileDB Embedded, as follows:

# Pip:
$ pip install tiledb

# Or Conda:
$ conda install -c conda-forge tiledb-py

The notebook was run using Pandas 1.1.0.

Note that the TileDB core is a C++ library. To boost performance when integrating with pandas, we use Apache Arrow to achieve zero-copy when returning results from TileDB into pandas dataframes. You need to install pyarrow to take advantage of this optimization:

# Pip:
$ pip install pyarrow

# Or Conda:
$ conda install -c conda-forge pyarrow

Setup

Import TileDB and check the versions of the C++ core and TileDB-Py respectively.

In [2]:
import tiledb, numpy as np
In [3]:
# Version of TileDB core (C++ library)
tiledb.libtiledb.version()
Out[3]:
(2, 1, 3)
In [4]:
# Version of TileDB-Py (Python wrapper)
tiledb.__version__
Out[4]:
'0.7.4'

Before we start, we create the TileDB context passing a configuration parameter around memory allocation during read queries that will be explained in a later tutorial. That needs to be set at the very beginning of the code and before any other TileDB function is called.

In [5]:
cfg = tiledb.Ctx().config()
cfg.update(
  {
    'py.init_buffer_bytes': 1024**2 * 50
  }
)
tiledb.default_ctx(cfg)
Out[5]:
tiledb.Ctx() [see Ctx.confg() for configuration]

We also enable the TileDB stats so that we can get some insight into performance.

In [6]:
tiledb.stats_enable()

The Dense Case

We ingest the yellow_tripdata_2020-01.csv CSV file into a TileDB dense array as shown below. The command takes the taxi CSV file and ingests it into a 1D dense array called taxi_dense_array. It sets the tile extent to 100K, which means that groups of 100K rows each across every column will comprise the atomic unit of compression and IO (i.e., a data tile). Two of the columns (tpep_dropoff_datetime and tpep_pickup_datetime) are dates, so we make sure to parse them as such. Finally, one of the columns (store_and_fwd_flag) may have nulls, so we explicitly set some null value.

In [7]:
%%time
tiledb.stats_reset()
tiledb.from_csv("taxi_dense_array", "yellow_tripdata_2020-01.csv", 
                tile = 100000, 
                parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'], 
                fillna={'store_and_fwd_flag': ''})
tiledb.stats_dump()
/opt/miniconda3/envs/tiledb/lib/python3.8/site-packages/IPython/core/magic.py:187: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.
  call = lambda f, *a, **k: f(*a, **k)
TileDB Embedded Version: (2, 1, 3)
TileDB-Py Version: 0.7.4
==== WRITE ====

- Number of write queries: 1

- Number of attributes written: 18
  * Number of fixed-sized attributes written: 17
  * Number of var-sized attributes written: 1

- Number of bytes written: 141115071 bytes (0.131424 GB) 
- Number of write operations: 1362
- Number of bytes filtered: 942532872 bytes (0.877802 GB) 
- Filtering deflation factor: 6.67918x

- Total metadata written: 13304 bytes (1.23903e-05 GB) 
  * Array schema: 1129 bytes (1.05146e-06 GB) 
  * Fragment metadata footer: 846 bytes (7.87899e-07 GB) 
  * R-tree: 99 bytes (9.22009e-08 GB) 
  * Fixed-sized tile offsets: 6838 bytes (6.36838e-06 GB) 
  * Var-sized tile offsets: 2307 bytes (2.14856e-06 GB) 
  * Var-sized tile sizes: 2085 bytes (1.94181e-06 GB) 

- Time to write array metadata: 0.00435116 secs
  * Array metadata size: 55 bytes (5.12227e-08 GB) 


- Write time: 0.711282 secs
  * Time to split the coordinates buffer: 9.6e-08 secs
  * Time to check out-of-bounds coordinates: 3.8e-08 secs
  * Time to initialize dense cell range iterators: 3.4535e-05 secs
  * Time to compute cell ranges: 5.2313e-05 secs
  * Time to prepare and filter tiles: 0.624271 secs
  * Time to write tiles: 0.0603356 secs
  * Time to write fragment metadata: 0.0119379 secs

- Time to finalize write query: 1.37e-07 secs
CPU times: user 18.1 s, sys: 4.42 s, total: 22.6 s
Wall time: 16 s

From the stats, the actual write time in TileDB took under 1 second (the rest was mostly parsing the CSV in Pandas). The raw uncompressed CSV data was about 870 MB in binary format, and those got compressed down to about 131 MB in TileDB. There are 18 columns written as attributes, one of which is var-sized (of string type, as we will see in the schema below).

Next, let's open the written array and inspect the TileDB schema.

In [8]:
A = tiledb.open("taxi_dense_array")
print(A.schema)
ArraySchema(
  domain=Domain(*[
    Dim(name='__tiledb_rows', domain=(0, 6405007), tile=100000, dtype='uint64'),
  ]),
  attrs=[
    Attr(name='VendorID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='tpep_pickup_datetime', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='tpep_dropoff_datetime', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='passenger_count', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='trip_distance', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='RatecodeID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='store_and_fwd_flag', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='PULocationID', dtype='int64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='DOLocationID', dtype='int64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='payment_type', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='fare_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='extra', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='mta_tax', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='tip_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='tolls_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='improvement_surcharge', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='total_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='congestion_surcharge', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=False,
  coords_filters=FilterList([ZstdFilter(level=-1), ])
)

That shows the 18 columns being stored as attributes, along with their types and filters (e.g., zstd compression, which is the default). There is a single dimension __tiledb_rows, which corresponds to the row indices. This essentially means that you will be able to slice fast across the row indices.

In order to see the number of rows ingested into this array, you can use the non-empty domain. The range below is inclusive and states that there are 6,405,008 rows in the array.

In [9]:
print(A.nonempty_domain())
((array(0, dtype=uint64), array(6405007, dtype=uint64)),)

Let's reset the stats and perform a full read of the array (all rows and all columns). The result is stored directly in a pandas dataframe. Note that ranges with df are always inclusive.

In [10]:
%%time
tiledb.stats_reset()
df = A.df[0:6405007]
df
CPU times: user 3.14 s, sys: 5.12 s, total: 8.26 s
Wall time: 1.59 s
Out[10]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 1.20 1.0 N 238 239 1.0 6.00 3.00 0.5 1.47 0.00 0.3 11.27 2.5
1 1.0 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 1.20 1.0 N 239 238 1.0 7.00 3.00 0.5 1.50 0.00 0.3 12.30 2.5
2 1.0 2020-01-01 00:47:41 2020-01-01 00:53:52 1.0 0.60 1.0 N 238 238 1.0 6.00 3.00 0.5 1.00 0.00 0.3 10.80 2.5
3 1.0 2020-01-01 00:55:23 2020-01-01 01:00:14 1.0 0.80 1.0 N 238 151 1.0 5.50 0.50 0.5 1.36 0.00 0.3 8.16 0.0
4 2.0 2020-01-01 00:01:58 2020-01-01 00:04:16 1.0 0.00 1.0 N 193 193 2.0 3.50 0.50 0.5 0.00 0.00 0.3 4.80 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6405003 NaN 2020-01-31 22:51:00 2020-01-31 23:22:00 NaN 3.24 NaN 237 234 NaN 17.59 2.75 0.5 0.00 0.00 0.3 21.14 0.0
6405004 NaN 2020-01-31 22:10:00 2020-01-31 23:26:00 NaN 22.13 NaN 259 45 NaN 46.67 2.75 0.5 0.00 12.24 0.3 62.46 0.0
6405005 NaN 2020-01-31 22:50:07 2020-01-31 23:17:57 NaN 10.51 NaN 137 169 NaN 48.85 2.75 0.0 0.00 0.00 0.3 51.90 0.0
6405006 NaN 2020-01-31 22:25:53 2020-01-31 22:48:32 NaN 5.49 NaN 50 42 NaN 27.17 2.75 0.0 0.00 0.00 0.3 30.22 0.0
6405007 NaN 2020-01-31 22:44:00 2020-01-31 23:06:00 NaN 11.60 NaN 179 205 NaN 54.56 2.75 0.5 0.00 0.00 0.3 58.11 0.0

6405008 rows × 18 columns

In [11]:
tiledb.stats_dump()
TileDB Embedded Version: (2, 1, 3)
TileDB-Py Version: 0.7.4
==== READ ====

- Number of read queries: 1
- Number of attempts until results are found: 1

- Number of attributes read: 18
  * Number of fixed-sized attributes read: 17
  * Number of var-sized attributes read: 1
- Number of dimensions read: 1
  * Number of fixed-sized dimensions read: 1

- Number of logical tiles overlapping the query: 65
- Number of physical tiles read: 1300
  * Number of physical fixed-sized tiles read: 1170
  * Number of physical var-sized tiles read: 130
- Number of cells read: 6500000
- Number of result cells: 6405008
- Percentage of useful cells read: 98.5386%

- Number of bytes read: 141109657 bytes (0.131419 GB) 
- Number of read operations: 87
- Number of bytes unfiltered: 942510623 bytes (0.877781 GB) 
- Unfiltering inflation factor: 6.67928x

- Time to compute estimated result size: 0.00081386 secs
  * Time to compute tile overlap: 0.000311342 secs
    > Time to compute relevant fragments: 1.8264e-05 secs
    > Time to load relevant fragment R-trees: 0.000177244 secs
    > Time to compute relevant fragment tile overlap: 0.000106599 secs

- Total metadata read: 10568 bytes (9.84222e-06 GB) 
  * R-tree: 8 bytes (7.45058e-09 GB) 
  * Fixed-sized tile offsets: 9504 bytes (8.85129e-06 GB) 
  * Var-sized tile offsets: 528 bytes (4.91738e-07 GB) 
  * Var-sized tile sizes: 528 bytes (4.91738e-07 GB) 

- Time to load array metadata: 0.00107096 secs
  * Array metadata size: 55 bytes (5.12227e-08 GB) 

- Time to initialize the read state: 6.5081e-05 secs

- Read time: 0.853632 secs
  * Time to compute next partition: 0.000322538 secs
  * Time to compute tile coordinates: 3.8468e-05 secs
  * Time to compute result coordinates: 8.112e-06 secs
    > Time to compute sparse result tiles: 4.263e-06 secs
  * Time to compute dense result cell slabs: 0.000260793 secs
  * Time to copy result attribute values: 0.838509 secs
    > Time to read attribute tiles: 0.132942 secs
    > Time to unfilter attribute tiles: 0.209192 secs
    > Time to copy fixed-sized attribute values: 0.352938 secs
    > Time to copy var-sized attribute values: 0.0787757 secs
  * Time to fill dense coordinates: 0.0144067 secs

- Total read query time (array open + init state + read): 0.853698 secs
==== Python Stats ====

- TileDB-Py Indexing Time: 1.49114
  * TileDB-Py query execution time: 0.861424
    > TileDB C++ Core initial query submit time: 0.853751
  * TileDB-Py buffer conversion time: 0.627318

This operation fetches the entire array / dataframe from the disk, decompresses all tiles and creates a pandas dataframe with the result. The whole process takes 1.2 seconds in TileDB core (C++) and about 0.7 seconds on the Python wrapper side for buffer conversion.

The stats are quite informative. They break down how long it took to read from storage and unfilter (i.e., decompress), how many cells were fetched, what is the percentage of useful results, etc.

However, note that you do not need to read the entire dataframe in main memory in order to process it. You can efficiently slice any subset of rows directly from storage as follows. TileDB makes very lightweight use of main memory to process the result. Note that df[] works with mulit-index semantics and thus can take multi-range subarrays as well.

In [12]:
%%time
df = A.df[0:999]
df
CPU times: user 19.1 ms, sys: 137 ms, total: 156 ms
Wall time: 74.2 ms
Out[12]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 1.20 1.0 N 238 239 1.0 6.0 3.0 0.5 1.47 0.0 0.3 11.27 2.5
1 1.0 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 1.20 1.0 N 239 238 1.0 7.0 3.0 0.5 1.50 0.0 0.3 12.30 2.5
2 1.0 2020-01-01 00:47:41 2020-01-01 00:53:52 1.0 0.60 1.0 N 238 238 1.0 6.0 3.0 0.5 1.00 0.0 0.3 10.80 2.5
3 1.0 2020-01-01 00:55:23 2020-01-01 01:00:14 1.0 0.80 1.0 N 238 151 1.0 5.5 0.5 0.5 1.36 0.0 0.3 8.16 0.0
4 2.0 2020-01-01 00:01:58 2020-01-01 00:04:16 1.0 0.00 1.0 N 193 193 2.0 3.5 0.5 0.5 0.00 0.0 0.3 4.80 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 2.0 2020-01-01 00:17:02 2020-01-01 00:22:02 2.0 0.62 1.0 N 237 141 1.0 5.0 0.5 0.5 2.20 0.0 0.3 11.00 2.5
996 2.0 2020-01-01 00:26:22 2020-01-01 00:52:31 2.0 7.09 1.0 N 140 244 2.0 25.5 0.5 0.5 0.00 0.0 0.3 29.30 2.5
997 2.0 2020-01-01 00:54:56 2020-01-01 01:10:51 2.0 2.59 1.0 N 244 42 1.0 12.5 0.5 0.5 0.00 0.0 0.3 13.80 0.0
998 1.0 2020-01-01 00:34:17 2020-01-01 00:42:20 1.0 0.80 1.0 N 48 142 1.0 6.5 3.0 0.5 2.05 0.0 0.3 12.35 2.5
999 1.0 2020-01-01 00:44:24 2020-01-01 01:16:17 2.0 4.00 1.0 N 142 224 2.0 21.0 3.0 0.5 0.00 0.0 0.3 24.80 2.5

1000 rows × 18 columns

Notice how much faster that operation was, taking only a few milliseconds.

Finally, you can slice any subset of columns, without fetching all the columns first in a pandas dataframe.

In [13]:
%%time
df = A.query(attrs=['tpep_dropoff_datetime', 'fare_amount']).df[0:6405007]
df
CPU times: user 423 ms, sys: 614 ms, total: 1.04 s
Wall time: 176 ms
Out[13]:
tpep_dropoff_datetime fare_amount
0 2020-01-01 00:33:03 6.00
1 2020-01-01 00:43:04 7.00
2 2020-01-01 00:53:52 6.00
3 2020-01-01 01:00:14 5.50
4 2020-01-01 00:04:16 3.50
... ... ...
6405003 2020-01-31 23:22:00 17.59
6405004 2020-01-31 23:26:00 46.67
6405005 2020-01-31 23:17:57 48.85
6405006 2020-01-31 22:48:32 27.17
6405007 2020-01-31 23:06:00 54.56

6405008 rows × 2 columns

Once again, that operation was much faster than fetching the entire dataframe in main memory. The stats also inform you about how many attributes (i.e., columns) were retrieved, which is two in this example.

Remember to close the array when you are done.

In [14]:
A.close()

The Sparse Case

Storing the dataframe as a 1D dense array allowed us to rapidly slice on row indexes. But what if we wished to slice fast on predicates applied to column values, such as dropoff time and fare amount? For such scenarios and if you know for a fact that the majority of your workloads involve applying a range (or equality) predicate on a specific subset of columns, you can create a sparse array with those columns set as the dimensions.

This can be done as follows. Instead of the tile argument we used in dense arrays, we use capacity to determine how many rows to group in a data tile (read about the difference between dense and sparse data tiles). Also index_col determines which columns will act as dimensions.

In [15]:
%%time
tiledb.stats_reset()
tiledb.from_csv("taxi_sparse_array", "yellow_tripdata_2020-01.csv", 
                capacity=100000, 
                sparse=True, 
                index_col=['tpep_dropoff_datetime', 'fare_amount'], 
                parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'], 
                fillna={'store_and_fwd_flag': ''})
tiledb.stats_dump()
/opt/miniconda3/envs/tiledb/lib/python3.8/site-packages/IPython/core/magic.py:187: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.
  call = lambda f, *a, **k: f(*a, **k)
/opt/miniconda3/envs/tiledb/lib/python3.8/site-packages/numpy/lib/arraysetops.py:580: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
TileDB Embedded Version: (2, 1, 3)
TileDB-Py Version: 0.7.4
==== WRITE ====

- Number of write queries: 1

- Number of attributes written: 16
  * Number of fixed-sized attributes written: 15
  * Number of var-sized attributes written: 1
- Number of dimensions written: 2
  * Number of fixed-sized dimensions written: 2

- Number of bytes written: 117477982 bytes (0.10941 GB) 
- Number of write operations: 1356
- Number of bytes filtered: 928759802 bytes (0.864975 GB) 
- Filtering deflation factor: 7.90582x

- Total metadata written: 14124 bytes (1.3154e-05 GB) 
  * Array schema: 1080 bytes (1.00583e-06 GB) 
  * Fragment metadata footer: 822 bytes (7.65547e-07 GB) 
  * R-tree: 1261 bytes (1.1744e-06 GB) 
  * Fixed-sized tile offsets: 6770 bytes (6.30505e-06 GB) 
  * Var-sized tile offsets: 2205 bytes (2.05357e-06 GB) 
  * Var-sized tile sizes: 1986 bytes (1.84961e-06 GB) 

- Time to write array metadata: 0.000813311 secs
  * Array metadata size: 98 bytes (9.12696e-08 GB) 

- Number of logical cells written: 6405008
- Number of logical tiles written: 65
  * Number of fixed-sized physical tiles written: 1105
  * Number of var-sized physical tiles written: 130

- Write time: 1.87926 secs
  * Time to split the coordinates buffer: 1.02e-07 secs
  * Time to check out-of-bounds coordinates: 0.0182705 secs
  * Time to sort coordinates: 0.821675 secs
  * Time to check coordinate duplicates: 1.3e-07 secs
  * Time to prepare tiles: 0.336857 secs
  * Time to compute coordinate metadata (e.g., MBRs): 0.00437902 secs
  * Time to filter tiles: 0.602146 secs
  * Time to write tiles: 0.0639578 secs
  * Time to write fragment metadata: 0.010609 secs

- Time to finalize write query: 4.93e-07 secs
CPU times: user 28.1 s, sys: 4.63 s, total: 32.8 s
Wall time: 17.4 s

Once again, most of the total ingestion time is spent on parsing on the pandas side. Notice that the R-tree (which is 2D) this time is slightly larger, as this is the main indexing method is sparse arrays. It is still tiny though relative to the entire array size, which is ~100MB.

Note that you can choose any subset of columns as the dimensions (any number with different types, even strings).

Let's open the array and print the schema.

In [16]:
A = tiledb.open("taxi_sparse_array")
print(A.schema)
ArraySchema(
  domain=Domain(*[
    Dim(name='tpep_dropoff_datetime', domain=(numpy.datetime64('2003-01-01T14:16:59.000000000'), numpy.datetime64('2021-01-02T01:25:01.000000000')), tile=1000 nanoseconds, dtype='datetime64[ns]'),
    Dim(name='fare_amount', domain=(-1238.0, 4265.0), tile=1000.0, dtype='float64'),
  ]),
  attrs=[
    Attr(name='VendorID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='tpep_pickup_datetime', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='passenger_count', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='trip_distance', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='RatecodeID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='store_and_fwd_flag', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='PULocationID', dtype='int64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='DOLocationID', dtype='int64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='payment_type', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='extra', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='mta_tax', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='tip_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='tolls_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='improvement_surcharge', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='total_amount', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='congestion_surcharge', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=True,
  coords_filters=FilterList([ZstdFilter(level=-1), ])
)

Observe that now the array is sparse, having 16 attributes and 2 dimensions. Also notice that, by default, the array allows duplicates. This can be turned off by passing allows_duplicates=False in from_csv, which will return an error if the CSV contains rows with identical coordinates along the array dimensions.

Let's print the non-empty domain for the sparse array.

In [17]:
A.nonempty_domain()
Out[17]:
((array('2003-01-01T14:16:59.000000000', dtype='datetime64[ns]'),
  array('2021-01-02T01:25:01.000000000', dtype='datetime64[ns]')),
 (array(-1238.), array(4265.)))

The first range corresponds to tpep_dropoff_datetime and the second to fare_amount.

Now let's slice the whole array into a pandas dataframe.

In [18]:
%%time
tiledb.stats_reset()
df = A.query().df[:]
df
CPU times: user 4.36 s, sys: 6.17 s, total: 10.5 s
Wall time: 3.05 s
Out[18]:
VendorID tpep_pickup_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
tpep_dropoff_datetime fare_amount
2003-01-01 14:16:59 0.0 2.0 2003-01-01 00:07:17 1.0 0.00 1.0 N 193 193 2.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0
2008-12-31 23:32:32 9.5 2.0 2008-12-31 23:17:15 1.0 2.34 1.0 N 162 263 2.0 0.5 0.5 0.00 0.0 0.3 13.30 2.5
2008-12-31 23:36:53 3.5 2.0 2008-12-31 23:35:00 1.0 0.42 1.0 N 263 263 2.0 0.5 0.5 0.00 0.0 0.3 7.30 2.5
2008-12-31 23:37:14 16.5 2.0 2008-12-31 23:06:13 1.0 3.48 1.0 N 144 230 2.0 0.5 0.5 0.00 0.0 0.3 20.30 2.5
2008-12-31 23:55:35 12.5 2.0 2008-12-31 23:39:37 1.0 2.69 1.0 N 263 143 2.0 0.5 0.5 0.00 0.0 0.3 16.30 2.5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-07-10 11:42:41 7.0 2.0 2020-07-10 11:34:11 1.0 1.07 1.0 N 236 262 1.0 1.0 0.5 2.26 0.0 0.3 13.56 2.5
2020-07-31 18:54:12 5.0 2.0 2020-07-31 18:50:41 1.0 0.72 1.0 N 236 43 2.0 1.0 0.5 0.00 0.0 0.3 9.30 2.5
2021-01-02 00:36:50 10.5 2.0 2021-01-02 00:22:00 1.0 1.56 1.0 N 142 161 2.0 1.0 0.5 0.00 0.0 0.3 14.80 2.5
2021-01-02 00:58:56 11.5 2.0 2021-01-02 00:44:08 1.0 2.32 1.0 N 170 148 2.0 1.0 0.5 0.00 0.0 0.3 15.80 2.5
2021-01-02 01:25:01 10.0 2.0 2021-01-02 01:12:10 1.0 1.87 1.0 N 90 148 1.0 1.0 0.5 2.86 0.0 0.3 17.16 2.5

6405008 rows × 16 columns

In [19]:
tiledb.stats_dump()
TileDB Embedded Version: (2, 1, 3)
TileDB-Py Version: 0.7.4
==== READ ====

- Number of read queries: 1
- Number of attempts until results are found: 1

- Number of attributes read: 16
  * Number of fixed-sized attributes read: 15
  * Number of var-sized attributes read: 1
- Number of dimensions read: 2
  * Number of fixed-sized dimensions read: 2

- Number of logical tiles overlapping the query: 65
- Number of physical tiles read: 1235
  * Number of physical fixed-sized tiles read: 1105
  * Number of physical var-sized tiles read: 130
- Number of cells read: 6405008
- Number of result cells: 6405008
- Percentage of useful cells read: 100%

- Number of bytes read: 117472852 bytes (0.109405 GB) 
- Number of read operations: 86
- Number of bytes unfiltered: 928739186 bytes (0.864956 GB) 
- Unfiltering inflation factor: 7.90599x

- Time to compute estimated result size: 0.000917319 secs
  * Time to compute tile overlap: 0.000401086 secs
    > Time to compute relevant fragments: 1.2216e-05 secs
    > Time to load relevant fragment R-trees: 0.00025591 secs
    > Time to compute relevant fragment tile overlap: 0.000130426 secs

- Total metadata read: 12928 bytes (1.20401e-05 GB) 
  * R-tree: 2368 bytes (2.20537e-06 GB) 
  * Fixed-sized tile offsets: 9504 bytes (8.85129e-06 GB) 
  * Var-sized tile offsets: 528 bytes (4.91738e-07 GB) 
  * Var-sized tile sizes: 528 bytes (4.91738e-07 GB) 

- Time to load array metadata: 0.000244082 secs
  * Array metadata size: 98 bytes (9.12696e-08 GB) 

- Time to initialize the read state: 6.9947e-05 secs

- Read time: 1.36381 secs
  * Time to compute next partition: 0.000467399 secs
  * Time to compute result coordinates: 0.537777 secs
    > Time to compute sparse result tiles: 0.000178318 secs
    > Time to read coordinate tiles: 0.0154381 secs
    > Time to unfilter coordinate tiles: 0.0532034 secs
    > Time to compute range result coordinates: 0.319006 secs
  * Time to compute sparse result cell slabs: 0.0198344 secs
  * Time to copy result attribute values: 0.749709 secs
    > Time to read attribute tiles: 0.0899356 secs
    > Time to unfilter attribute tiles: 0.168391 secs
    > Time to copy fixed-sized attribute values: 0.352634 secs
    > Time to copy var-sized attribute values: 0.0858673 secs
  * Time to copy result coordinates: 0.0470755 secs
    > Time to copy fixed-sized coordinates: 0.0293159 secs

- Total read query time (array open + init state + read): 1.36388 secs
==== Python Stats ====

- TileDB-Py Indexing Time: 2.94222
  * TileDB-Py query execution time: 1.37205
    > TileDB C++ Core initial query submit time: 1.36393
  * TileDB-Py buffer conversion time: 1.56823

Notice that this takes longer than the dense case. This is because the sparse case involves more advanced indexing and copying operations than dense. However, the real benefit of sparse dataframe modeling is the ability to slice rapidly with range conditions on the indexed dimensions, without having to fetch the entire dataframe in main memory.

In [20]:
%%time
df = A.df[np.datetime64("2020-07-01"):np.datetime64("2020-10-01"), 5.5:12.5]
df
CPU times: user 14.7 ms, sys: 83.8 ms, total: 98.4 ms
Wall time: 92.2 ms
Out[20]:
VendorID tpep_pickup_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
tpep_dropoff_datetime fare_amount
2020-07-10 10:52:46 5.5 2.0 2020-07-10 10:46:19 1.0 0.64 1.0 N 68 164 1.0 1.0 0.5 1.96 0.0 0.3 11.76 2.5
2020-07-10 11:02:42 5.5 2.0 2020-07-10 10:56:46 1.0 0.93 1.0 N 170 163 1.0 1.0 0.5 1.96 0.0 0.3 11.76 2.5
2020-07-10 11:21:14 12.5 2.0 2020-07-10 11:03:49 1.0 2.41 1.0 N 163 262 1.0 1.0 0.5 3.36 0.0 0.3 20.16 2.5
2020-07-10 11:42:41 7.0 2.0 2020-07-10 11:34:11 1.0 1.07 1.0 N 236 262 1.0 1.0 0.5 2.26 0.0 0.3 13.56 2.5

This is truly rapid. In the dense case, you would have to load the whole dataframe in main memory and then slice using pandas.

You can subset on attributes as follows.

In [21]:
%%time
df = A.query(attrs=['trip_distance']).df[:]
df
CPU times: user 1.65 s, sys: 798 ms, total: 2.45 s
Wall time: 1.61 s
Out[21]:
trip_distance
tpep_dropoff_datetime fare_amount
2003-01-01 14:16:59 0.0 0.00
2008-12-31 23:32:32 9.5 2.34
2008-12-31 23:36:53 3.5 0.42
2008-12-31 23:37:14 16.5 3.48
2008-12-31 23:55:35 12.5 2.69
... ... ...
2020-07-10 11:42:41 7.0 1.07
2020-07-31 18:54:12 5.0 0.72
2021-01-02 00:36:50 10.5 1.56
2021-01-02 00:58:56 11.5 2.32
2021-01-02 01:25:01 10.0 1.87

6405008 rows × 1 columns

By default, TileDB fetches also the coordinate values and sets them as pandas indices. To disable them, you can run:

In [22]:
%%time
df = A.query(dims=False, attrs=['trip_distance']).df[:]
df
CPU times: user 787 ms, sys: 533 ms, total: 1.32 s
Wall time: 655 ms
Out[22]:
trip_distance
0 0.00
1 2.34
2 0.42
3 3.48
4 2.69
... ...
6405003 1.07
6405004 0.72
6405005 1.56
6405006 2.32
6405007 1.87

6405008 rows × 1 columns

Wer can also subselect on dimensions:

In [23]:
%%time
df = A.query(dims=['tpep_dropoff_datetime'], attrs=['trip_distance']).df[:]
df
CPU times: user 822 ms, sys: 690 ms, total: 1.51 s
Wall time: 662 ms
Out[23]:
trip_distance
tpep_dropoff_datetime
2003-01-01 14:16:59 0.00
2008-12-31 23:32:32 2.34
2008-12-31 23:36:53 0.42
2008-12-31 23:37:14 3.48
2008-12-31 23:55:35 2.69
... ...
2020-07-10 11:42:41 1.07
2020-07-31 18:54:12 0.72
2021-01-02 00:36:50 1.56
2021-01-02 00:58:56 2.32
2021-01-02 01:25:01 1.87

6405008 rows × 1 columns

Finally, you can choose even attributes to act as dataframe indices using the index_col argument.

In [24]:
%%time
df = A.query(index_col=['trip_distance'], attrs=['passenger_count', 'trip_distance']).df[:]
df
CPU times: user 1.02 s, sys: 1.3 s, total: 2.32 s
Wall time: 811 ms
Out[24]:
tpep_dropoff_datetime fare_amount passenger_count
trip_distance
0.00 2003-01-01 14:16:59 0.0 1.0
2.34 2008-12-31 23:32:32 9.5 1.0
0.42 2008-12-31 23:36:53 3.5 1.0
3.48 2008-12-31 23:37:14 16.5 1.0
2.69 2008-12-31 23:55:35 12.5 1.0
... ... ... ...
1.07 2020-07-10 11:42:41 7.0 1.0
0.72 2020-07-31 18:54:12 5.0 1.0
1.56 2021-01-02 00:36:50 10.5 1.0
2.32 2021-01-02 00:58:56 11.5 1.0
1.87 2021-01-02 01:25:01 10.0 1.0

6405008 rows × 3 columns

For convenience, TileDB can also return dataframe results as an Arrow Table as follows:

In [25]:
%%time
df = A.query(return_arrow=True, index_col=['trip_distance'], attrs=['passenger_count', 'trip_distance']).df[:]
df
CPU times: user 1 s, sys: 972 ms, total: 1.97 s
Wall time: 742 ms
Out[25]:
pyarrow.Table
tpep_dropoff_datetime: timestamp[ns]
fare_amount: double
passenger_count: double
trip_distance: double

Since we are done, we can close the array.

In [26]:
A.close()

Storing Pandas Dataframes in TileDB Arrays

You can also store a pandas dataframe you already created in main memory into a TileDB array. The following will create a new TileDB array and write the contents of a pandas dataframe.

In [27]:
# First read some data into a pandas dataframe
A = tiledb.open("taxi_sparse_array")
df = A.query(attrs=['passenger_count', 'trip_distance']).df[:]
df

# Create and write into a TileDB array
tiledb.from_pandas("sliced_taxi_sparse_array", df)

Let's inspect the schema.

In [28]:
A2 = tiledb.open("sliced_taxi_sparse_array")
A2.schema
Out[28]:
ArraySchema(
  domain=Domain(*[
    Dim(name='tpep_dropoff_datetime', domain=(numpy.datetime64('2003-01-01T14:16:59.000000000'), numpy.datetime64('2021-01-02T01:25:01.000000000')), tile=1000 nanoseconds, dtype='datetime64[ns]'),
    Dim(name='fare_amount', domain=(-1238.0, 4265.0), tile=1000.0, dtype='float64'),
  ]),
  attrs=[
    Attr(name='passenger_count', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='trip_distance', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=True,
  coords_filters=FilterList([ZstdFilter(level=-1), ])
)

Reading the array back:

In [29]:
A2.df[:]
Out[29]:
passenger_count trip_distance
tpep_dropoff_datetime fare_amount
2003-01-01 14:16:59 0.0 1.0 0.00
2008-12-31 23:32:32 9.5 1.0 2.34
2008-12-31 23:36:53 3.5 1.0 0.42
2008-12-31 23:37:14 16.5 1.0 3.48
2008-12-31 23:55:35 12.5 1.0 2.69
... ... ... ...
2020-07-10 11:42:41 7.0 1.0 1.07
2020-07-31 18:54:12 5.0 1.0 0.72
2021-01-02 00:36:50 10.5 1.0 1.56
2021-01-02 00:58:56 11.5 1.0 2.32
2021-01-02 01:25:01 10.0 1.0 1.87

6405008 rows × 2 columns

Lastly, we close the opened arrays.

In [30]:
A.close()
A2.close()

Running SQL Queries

One of the cool things about TileDB is that it offers a powerful integration with embedded MariaDB. This allows for execution of arbitrary SQL queries directly on TileDB arrays (both dense and sparse). We took appropriate care to push the fast slicing and attribute subsetting portions of the query down to TileDB, leaving the rest of the SQL execution to MariaDB. In other words, we made MariaDB take advantage of the multi-dimensional indexing and columnar format of TileDB!

To install this capability, run:

conda install -c conda-forge libtiledb-sql-py

The usage is very simple and intuitive. All results are retunred directly as pandas dataframes.

In [31]:
import tiledb.sql, pandas as pd
In [32]:
db = tiledb.sql.connect()
In [33]:
%%time
pd.read_sql(sql="SELECT AVG(trip_distance) FROM taxi_dense_array WHERE __tiledb_rows >= 0 AND __tiledb_rows <1000", con=db)
CPU times: user 10.5 ms, sys: 21.6 ms, total: 32.1 ms
Wall time: 24.7 ms
Out[33]:
AVG(trip_distance)
0 2.81232
In [34]:
%%time
pd.read_sql(sql="SELECT AVG(trip_distance) FROM taxi_sparse_array WHERE tpep_dropoff_datetime <= '2019-07-31' AND fare_amount < 5.5", con=db)
CPU times: user 14.4 ms, sys: 106 ms, total: 121 ms
Wall time: 47.6 ms
Out[34]:
AVG(trip_distance)
0 0.448333

Other backends

So far we have explained how to store TileDB arrays to the local disk. TileDB is optimized for numerous storage backends, including AWS S3, Azure Blob Storage and more. The entire functionality shown above (including SQL queries with embedded MariaDB) "just works" by replacing the array names taxi_dense_array and taxi_sparse_array with a URI that points to another backend, e.g., s3://<my_bucket>/<path>/array_name. The TileDB data format is cloud-native (based on immutable objects for fast updates and time traveling, to be covered in later tutorials) and the storage engine takes it to the extreme to implement parallel IO while minimizing the communication with the backend wherever possible.

Simple VFS Operations

In order to be able to support numerous storage backends, we abstracted all IO (e.g., read, write, remove, move, list, etc.) behind a Virtual Filesystem (VFS) class, which we exposed in our APIs as it is useful beyond the array internals. Everything we describe below in this section "just works" for any other storage backend URI.

For example, you can use the VFS functionality to list the contents of an array folder:

In [35]:
vfs = tiledb.VFS()
In [36]:
vfs.ls("taxi_sparse_array")
Out[36]:
['file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__1608324905947_1608324905947_176244b5e6f342fb941ca3b6f519349c_6',
 'file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__1608324905947_1608324905947_176244b5e6f342fb941ca3b6f519349c_6.ok',
 'file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__array_schema.tdb',
 'file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__lock.tdb',
 'file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__meta']

Or remove the arrays we created.

In [37]:
vfs.remove_dir("taxi_dense_array")
In [38]:
vfs.remove_dir("taxi_sparse_array")
In [39]:
vfs.remove_dir("sliced_taxi_sparse_array")

Also you can remove the CSV file as follows.

In [ ]:
vfs.remove_file('yellow_tripdata_2020-01.csv')