CARTOFrames in Action¶

Data science workflows that leverage CARTO

Motivation¶

A large number of data scientists rely on the de facto standards of analysis in a Jupyter notebook. We want to support that by creating a Python module that allows these users to develop analyses while seamlessly interacting with CARTO. We aim to feature:

Stunning cartography from CARTO maps
Seamless reading and writing to CARTO for arbitrary updates to a DataFrame
Interactions with the Data Observatory to enrich a user's analysis

Basics¶

You'll need the following for this:

Your CARTO username
Your API key
Your favorite table (I recommend duplicating it and using the copy because we will do some oprations on it)

Paste these values in the quotes ('') below.

In [ ]:

import pandas as pd
import cartoframes

username = '' # <-- insert your username here
api_key = '' # <-- insert your API key here
tablename = '' # <-- insert your tablename here

cc = cartoframes.CartoContext('https://{}.carto.com/'.format(username),
                              api_key)

Read from your CARTO account¶

In [ ]:

df = cc.read(tablename)
df.head()

Make updates to the DataFrame and Sync Changes¶

In [ ]:

df['favorite_cookie'] = 'pecan'
df['favorite_cookie'][df.index % 2 == 0] = 'oatmeal'
cc.write(df, tablename, overwrite=True)

Map it out¶

In [ ]:

df.carto_map(color='favorite_cookie')

cartoframe from query¶

Query your CARTO account and create a table from the query. Finally, pull that new table into a pandas DataFrame.

In [ ]:

df_buffer = cc.query(query='''
                           SELECT ST_Buffer(the_geom::geography, 10000)::geometry as the_geom,
                                  cartodb_id, mag, depth, place
                             FROM all_month_3
                           LIMIT 100
                          ''',
                     tablename='buffered_earthquakes')
df_buffer.head()

In [ ]:

print(df_buffer.get_carto_datapage())

Model workflow¶

Let's recreate the workflow from https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/, where the author explores dask for splitting up the computations between multiple cores in a machine to complete tasks more quickly.

In [ ]:

from dask import dataframe as dd
import pandas as pd
columns = ["name", "amenity", "Longitude", "Latitude"]
data = dd.read_csv('POIWorld.csv', usecols=columns)

In [ ]:

with_name = data[data.name.notnull()]
with_amenity = data[data.amenity.notnull()]

is_starbucks = with_name.name.str.contains('[Ss]tarbucks')
is_dunkin = with_name.name.str.contains('[Dd]unkin')

starbucks = with_name[is_starbucks].compute()
dunkin = with_name[is_dunkin].compute()

In [ ]:

starbucks['type'] = 'starbucks'
dunkin['type'] = 'dunkin'
coffee_places = pd.concat([starbucks, dunkin])
coffee_places.head(20)

Write DataFrame to CARTO¶

In [ ]:

import pandas as pd
import cartoframes

username = 'eschbacher'
api_key = 'abcdefghijklmnopqrstuvwxyz'

In [ ]:

# specify columns for lng/lat so carto will create a geometry
cc.write(coffee_places,
         tablename='coffee_places',
         lnglat=('longitude', 'latitude'))

Let's visualize this DataFrame¶

Category map on Dunkin' Donuts vs. Starbucks (aka, color by 'type')

In [ ]:

from cartoframes import Layer
cc.map(layers=Layer('coffee_places', color='type', size=5),
       zoom=9, lng=-71.0637, lat=36.4275,
       interactive=False)

Fast Food¶

In [ ]:

is_fastfood = with_amenity.amenity.str.contains('fast_food')
fastfood = with_amenity[is_fastfood]
fastfood.name.value_counts().head(12)

In [ ]:

ff = fastfood.compute()
ff.sync_carto(username=username,
              api_key=api_key,
              requested_tablename='fastfood_dask',
              lnglat_cols=('longitude', 'latitude'))

Number of Fast Food places in this OSM dump¶

In [ ]:

len(ff)

Recreating the map from the blog¶

In [ ]:

cc.map(layers=Layer('fastfood_dask', size=2, color='#FFF'))

Going crazy with the Data Observatory¶

This method relies in you having the do_augment_table function that John had you load into your account. This might be kinda slow given that we have

In [ ]:

# DO measures: Total Population,
#              Children under 18 years of age
#              Median income

data_obs_measures = [{'numer_id': 'us.census.acs.B01003001'},
                     {'numer_id': 'us.census.acs.B17001001'},
                     {'numer_id': 'us.census.acs.B19013001'}]
cc.data_augment('coffee_places', data_obs_measures)