This notebook will walk through usage of TileDB Cloud including User Defined Functions and Serverless SQL
We will start off using the quickstart example arrays and then move to a real work dataset based on transit data from the Boston metro area.
import tiledb, tiledb.cloud
import pandas
import numpy
# No login needed! You are automatically signed in to tiledb cloud python client
#tiledb.cloud.login(username="xxx", password="yyy")
First let's access the quickstart sparse array through the cloud service.
with tiledb.open("tiledb://TileDB-Inc/quickstart_sparse", ctx=tiledb.cloud.Ctx()) as A:
df = pandas.DataFrame(A[:])
df
coords | a | |
---|---|---|
0 | (1, 1) | 1 |
1 | (2, 3) | 3 |
2 | (2, 4) | 2 |
Next let's run User Defined Function. This will compute the mean of the "a" attribute.
def mean(data):
import numpy
return numpy.mean(data["a"])
# Run UDF on entire dataset
with tiledb.open("tiledb://TileDB-Inc/quickstart_sparse", ctx=tiledb.cloud.Ctx()) as A:
results = A.apply(mean, [(1,4), (1,4)])
results
2.0
A serverless SQL query can also be run to to compute the mean. Serverless SQL returns a pandas dataframe of the results.
# Run sql
results = tiledb.cloud.sql.exec("select AVG(`a`) as avg from `tiledb://TileDB-Inc/quickstart_sparse` WHERE `rows` between 1 and 4 and `cols` between 1 and 4")
results
avg | |
---|---|
0 | 2 |
Using ridership data from the MBTA[1] (Boston-area Transit Authority) we will run a sql query to extract ridership data for all services exclusing buses and then create a line graph.
The dataset has already been loaded into a tiledb array for this example.
[1] Data taken from MBTA data dashboard https://mbtabackontrack.com/performance/
# Run a serverless sql query to extract the total monthly ridership exclusing any buses
# SQL here allows us to convert the full datetime to YEAR_MONTH format, since the granularity is only monthly
data = tiledb.cloud.sql.exec("""
SELECT
EXTRACT(YEAR_MONTH FROM service_date) as service_date,
route_or_line,
total_monthly_ridership
FROM `tiledb://TileDB-Inc/MBTA_Average_Monthly_Ridership_by_Mode`
WHERE route_or_line != 'Bus'""")
data
route_or_line | service_date | total_monthly_ridership | |
---|---|---|---|
0 | Silver Line | 201704 | 681787 |
1 | Commuter Rail | 201704 | 2438211 |
2 | Boat-F1 | 201704 | 61364 |
3 | Boat-F4 | 201704 | 38199 |
4 | Blue Line | 201704 | 1457503 |
... | ... | ... | ... |
411 | Blue Line | 201910 | 1588641 |
412 | Green Line | 201910 | 3601913 |
413 | Orange Line | 201910 | 4775264 |
414 | Red Line | 201910 | 5620037 |
415 | The RIDE | 201910 | 138048 |
416 rows × 3 columns
Next we will use Pandas to plot the ridership by route and month
# First we tell pandas to index the dataframe by the data and route line
data2 = data.set_index(['service_date', 'route_or_line'])
data2.groupby('route_or_line')['total_monthly_ridership'].plot(legend=True, figsize=(15,8))
route_or_line Blue Line AxesSubplot(0.125,0.125;0.775x0.755) Boat-F1 AxesSubplot(0.125,0.125;0.775x0.755) Boat-F4 AxesSubplot(0.125,0.125;0.775x0.755) Commuter Rail AxesSubplot(0.125,0.125;0.775x0.755) Green Line AxesSubplot(0.125,0.125;0.775x0.755) Orange Line AxesSubplot(0.125,0.125;0.775x0.755) Red Line AxesSubplot(0.125,0.125;0.775x0.755) Silver Line AxesSubplot(0.125,0.125;0.775x0.755) The RIDE AxesSubplot(0.125,0.125;0.775x0.755) Name: total_monthly_ridership, dtype: object
Next we will use a User Define Function to compute the median monthly rideship. MariaDB does not have a Median function so this is an easy way to compute this median without loading the entire dataset into the notebooks memory
# median_by_route_or_line takes the numpy ordered dictionary and performs a pandas group by and apply of numpy median
def median_by_month(data):
import pandas
import numpy
res = pandas.DataFrame(data)
res['month'] = res["service_date"].apply(lambda d: d.month)
# Get the median rideship by month
median_results = res.groupby('month')['total_monthly_ridership'].apply(numpy.median)
return median_results
# Run UDF on entire dataset but only fetch service_date and total_monthly_ridership columns
with tiledb.open("tiledb://TileDB-Inc/MBTA_Average_Monthly_Ridership_by_Mode", ctx=tiledb.cloud.Ctx()) as A:
ned = A.nonempty_domain()
median_rides_by_month = A.apply(median_by_month, ned, attrs=["service_date", "total_monthly_ridership"])
median_rides_by_month
month 1 1940541.5 2 1805384.5 3 2047487.0 4 1599512.0 5 2116218.0 6 2118354.5 7 2095768.5 8 1717782.0 9 1936200.0 10 2252478.0 11 2034947.0 12 1905606.0 Name: total_monthly_ridership, dtype: float64
# Let's plot the months on a bar graph
median_rides_by_month.plot.bar(figsize=(15,8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f21416068d0>