TileDB Quickstart Notebook¶

This notebook will walk through usage of TileDB Cloud including User Defined Functions and Serverless SQL

We will start off using the quickstart example arrays and then move to a real work dataset based on transit data from the Boston metro area.

In [1]:

import tiledb, tiledb.cloud
import pandas
import numpy

In [2]:

# No login needed! You are automatically signed in to tiledb cloud python client
#tiledb.cloud.login(username="xxx", password="yyy")

First let's access the quickstart sparse array through the cloud service.

In [3]:

with tiledb.open("tiledb://TileDB-Inc/quickstart_sparse", ctx=tiledb.cloud.Ctx()) as A:
    df = pandas.DataFrame(A[:])

df

Out[3]:

	coords	a
0	(1, 1)	1
1	(2, 3)	3
2	(2, 4)	2

Next let's run User Defined Function. This will compute the mean of the "a" attribute.

In [14]:

def mean(data):
    import numpy
    return numpy.mean(data["a"])

# Run UDF on entire dataset
with tiledb.open("tiledb://TileDB-Inc/quickstart_sparse", ctx=tiledb.cloud.Ctx()) as A:
    results = A.apply(mean, [(1,4), (1,4)])

results

Out[14]:

2.0

A serverless SQL query can also be run to to compute the mean. Serverless SQL returns a pandas dataframe of the results.

In [16]:

# Run sql
results = tiledb.cloud.sql.exec("select AVG(`a`) as avg from `tiledb://TileDB-Inc/quickstart_sparse` WHERE `rows` between 1 and 4 and `cols` between 1 and 4")
results

Out[16]:

	avg
0	2

Realworld Example¶

Using ridership data from the MBTA[1] (Boston-area Transit Authority) we will run a sql query to extract ridership data for all services exclusing buses and then create a line graph.

The dataset has already been loaded into a tiledb array for this example.

[1] Data taken from MBTA data dashboard https://mbtabackontrack.com/performance/

Serverless SQL To Filter and Fetch Data¶

In [7]:

# Run a serverless sql query to extract the total monthly ridership exclusing any buses
# SQL here allows us to convert the full datetime to YEAR_MONTH format, since the granularity is only monthly
data = tiledb.cloud.sql.exec("""
SELECT
    EXTRACT(YEAR_MONTH FROM service_date) as service_date,
    route_or_line,
    total_monthly_ridership
FROM `tiledb://TileDB-Inc/MBTA_Average_Monthly_Ridership_by_Mode`
WHERE route_or_line != 'Bus'""")
data

Out[7]:

	route_or_line	service_date	total_monthly_ridership
0	Silver Line	201704	681787
1	Commuter Rail	201704	2438211
2	Boat-F1	201704	61364
3	Boat-F4	201704	38199
4	Blue Line	201704	1457503
...	...	...	...
411	Blue Line	201910	1588641
412	Green Line	201910	3601913
413	Orange Line	201910	4775264
414	Red Line	201910	5620037
415	The RIDE	201910	138048

416 rows × 3 columns

Next we will use Pandas to plot the ridership by route and month

In [8]:

# First we tell pandas to index the dataframe by the data and route line
data2 = data.set_index(['service_date', 'route_or_line'])
data2.groupby('route_or_line')['total_monthly_ridership'].plot(legend=True, figsize=(15,8))

Out[8]:

route_or_line
Blue Line        AxesSubplot(0.125,0.125;0.775x0.755)
Boat-F1          AxesSubplot(0.125,0.125;0.775x0.755)
Boat-F4          AxesSubplot(0.125,0.125;0.775x0.755)
Commuter Rail    AxesSubplot(0.125,0.125;0.775x0.755)
Green Line       AxesSubplot(0.125,0.125;0.775x0.755)
Orange Line      AxesSubplot(0.125,0.125;0.775x0.755)
Red Line         AxesSubplot(0.125,0.125;0.775x0.755)
Silver Line      AxesSubplot(0.125,0.125;0.775x0.755)
The RIDE         AxesSubplot(0.125,0.125;0.775x0.755)
Name: total_monthly_ridership, dtype: object

User Defined Function for Median Computation¶

Next we will use a User Define Function to compute the median monthly rideship. MariaDB does not have a Median function so this is an easy way to compute this median without loading the entire dataset into the notebooks memory

In [9]:

# median_by_route_or_line takes the numpy ordered dictionary and performs a pandas group by and apply of numpy median
def median_by_month(data):
    import pandas
    import numpy
    res = pandas.DataFrame(data)
    res['month'] = res["service_date"].apply(lambda d: d.month)
    # Get the median rideship by month
    median_results = res.groupby('month')['total_monthly_ridership'].apply(numpy.median)
    return median_results

# Run UDF on entire dataset but only fetch service_date and total_monthly_ridership columns
with tiledb.open("tiledb://TileDB-Inc/MBTA_Average_Monthly_Ridership_by_Mode", ctx=tiledb.cloud.Ctx()) as A:
    ned = A.nonempty_domain()
    median_rides_by_month = A.apply(median_by_month, ned, attrs=["service_date", "total_monthly_ridership"])
median_rides_by_month

Out[9]:

month
1     1940541.5
2     1805384.5
3     2047487.0
4     1599512.0
5     2116218.0
6     2118354.5
7     2095768.5
8     1717782.0
9     1936200.0
10    2252478.0
11    2034947.0
12    1905606.0
Name: total_monthly_ridership, dtype: float64

In [10]:

# Let's plot the months on a bar graph
median_rides_by_month.plot.bar(figsize=(15,8))

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f21416068d0>

In [ ]: