#!/usr/bin/env python # coding: utf-8 # ## Data Observatory # # CARTO's Data Observatory is a spatial data platfrom that enables data scientists to augment their data and broaden their analyses by using thousands of datasets from around the globe. # # This guide is intended for those who want to start augmenting their data using CARTOframes and wish to explore CARTO's public Data Observatory catalog to find datasets that best fit their use cases and analyses. # For further learning you can also check out the [Data Observatory examples](/developers/cartoframes/examples/#example-data-observatory). # ### Data Discovery # # The Data Obsevatory data catalog is comprised of thousands of curated spatial datasets. When searching for data the easiest way to find what you are looking for is to make use of a faceted search. A faceted (or hierarchical) search allows you to narrow down search results by applying multiple filters based on the faceted classification of catalog datasets. For more information check the [data discovery example](/developers/cartoframes/examples/#example-discover-a-dataset). # # Datasets are organized in these main hierarchies: country, category, provider and geography (or spatial resolution). # # > The catalog is public and you don't need a CARTO account to search for available datasets. You can access the web version of the catalog [here](https://carto.com/spatial-data-catalog). # #### Dataset and variables metadata # # The Data Observatory catalog is not only a repository of curated spatial datasets, it also contains valuable information that helps better understand the underlying data of every dataset so you can make an informed decision on what data best fits your problem. # # Some of the augmented metadata you can find for each dataset in the catalog is: # # - `head` and `tail` methods to get a glimpse of the actual data. This helps you to understand the available columns, data types, etc., to start modelling your problem right away. # - `geom_coverage` to visualize on a map the geographical coverage of the data in the `Dataset`. # - `counts`, `fields_by_type` and a full `describe` method with stats of the actual values in the dataset, such as: average, stdev, quantiles, min, max, median for each of the variables of the dataset. # # You don't need a subscription to a dataset to be able to query the augmented metadata, it's publicly available for anyone exploring the Data Observatory catalog. # # Let's review some of that information, starting by getting a glimpse of the ten first or last rows of the actual data of the dataset: # In[1]: from cartoframes.data.observatory import Dataset dataset = Dataset.get('ags_sociodemogr_a7e14220') dataset.head() # Alternatively, you can get the last ten ones with `dataset.tail()` # An overview of the coverage of the dataset # In[2]: dataset.geom_coverage() # Some stats about the dataset: # In[3]: dataset.counts() # In[4]: dataset.fields_by_type() # In[5]: dataset.describe() # Every `Dataset` instance in the catalog contains other useful metadata: # In[6]: dataset.to_dict() # When exploring datasets in the Data Observatory catalog it's very important that you understand clearly what variables are available to enrich your own data. # # For each `Variable` in each dataset, the Data Observatory provides (as it does for datasets) a set of methods and attributes to understand their underlaying data. # # Some of them are: # # - `head` and `tail` methods to get a glimpse of the actual data and start modelling your problem right away. # - `counts`, `quantiles` and a full `describe` method with stats of the actual values in the dataset, such as: average, stdev, quantiles, min, max, median for each of the variables of the dataset. # - an `histogram` plot with the distribution of the values on each variable. # Let's review some of that augmented metadata for the variables in the AGS population dataset. # In[7]: from cartoframes.data.observatory import Variable variable = Variable.get('POPCY_4534fac4') variable # In[8]: variable.to_dict() # There's also some utility methods to understand the underlying data for each variable: # In[9]: variable.head() # In[10]: variable.counts() # In[11]: variable.quantiles() # In[12]: variable.histogram() # In[13]: variable.describe() # #### Subscribe to a Dataset in the catalog # # Once you have explored the catalog and have identified a dataset with the variables you need for your analysis and in the right spatial resolution, you can check `is_public_data` to know whether the dataset is freely accessible or you first need to purchase a license. Subscriptions are available for CARTO's Enterprise plan users. # # Subscriptions to datasets allow you to either use them from CARTOframes to enrich your own data or to download them. See the [enrichment guide](/developers/cartoframes/guides/Data-Observatory/#data-enrichment) for more information. # # Let's check out the dataset and geography from our previous example: # In[14]: dataset = Dataset.get('ags_sociodemogr_a7e14220') # In[15]: dataset.is_public_data # This `dataset` is not public data, which means that you need a subscription to be able to use it to enrich your own data. # # > To subscribe to premium data in the Data Observatory catalog you need an Enterprise CARTO account with access to the Data Observatory. # In[16]: from cartoframes.auth import set_default_credentials set_default_credentials('creds.json') # In[17]: dataset.subscribe() # **Licenses to data in the Data Observatory grant you the right to use the data for the period of one year. Every non-public dataset or geography you want to use to enrich your own data require a valid license.** # # You can check the actual status of your subscriptions directly from the catalog. # In[18]: from cartoframes.data.observatory import Catalog Catalog().subscriptions() # ### Data Access # # Now that we have explored some basic information about the Dataset, we will proceed to download a sample of the Dataset into a dataframe so we can operate with it using Python. # # _Note: You'll need your [CARTO Account](https://carto.com/signup) credentials to perform this action._ # In[19]: from cartoframes.auth import set_default_credentials set_default_credentials('creds.json') # In[20]: from cartoframes.data.observatory import Dataset dataset = Dataset.get('ags_sociodemogr_a7e14220') # In[21]: # Filter by SQL query query = "SELECT * FROM $dataset$ LIMIT 50" dataset_df = dataset.to_dataframe(sql_query=query) # **Note about SQL filters** # # Our SQL filtering queries allow for any PostgreSQL and PostGIS operation, so you can filter the rows (by a WHERE condition) or the columns (using the SELECT). Some common examples are filtering the Dataset by bounding box or filtering by column value: # # ``` # SELECT * FROM $dataset$ WHERE ST_IntersectsBox(geom, -74.044467,40.706128,-73.891345,40.837690) # ``` # # ``` # SELECT total_pop, geom FROM $dataset$ # ``` # # A good tool to get the bounding box of a specific area is [bboxfinder.com](http://bboxfinder.com/#0.000000,0.000000,0.000000,0.000000). # In[22]: # First rows of the Dataset sample dataset_df.head() # You can also download the dataset directly to a CSV file # In[23]: query = "SELECT * FROM $dataset$ LIMIT 50" dataset_df = dataset.to_csv('my_dataset.csv', sql_query=query) # ### Data Enrichment # # We define enrichment as the process of augmenting your data with new variables by means of a spatial join between your data and a `Dataset` in CARTO's Data Observatory, aggregated at a given spatial resolution, or in other words: # # "*Enrichment is the process of adding variables to a geometry, which we call the target, (point, line, polygon…) from a spatial (polygon) dataset, which we call the source*" # # We recommend you also check out the [CARTOframes quickstart guide](/developers/cartoframes/guides/Quickstart/) since it offers a complete example of data discovery and enrichment and also helps you build a simple dashboard to draw conclusions from the resulting data. # # _Note: You'll need your [CARTO Account](https://carto.com/signup) credentials to perform this action._ # In[24]: from cartoframes.auth import set_default_credentials set_default_credentials('creds.json') # In[25]: from cartoframes.data.observatory import Dataset dataset = Dataset.get('ags_sociodemogr_a7e14220') variables = dataset.variables variables # The `ags_sociodemogr_f510a947` dataset contains socio-demographic variables aggregated by Census block group level. # # Let's try and find a variable for total population: # In[26]: vdf = variables.to_dataframe() vdf[vdf['name'].str.contains('pop', case=False, na=False)] # We can store the variable instance we need by searching the Catalog by its `slug`, in this case `POPCY_4534fac4`: # In[27]: variable = Variable.get('POPCY_4534fac4') variable.to_dict() # The `POPCY` variable contains the `SUM` of the population per blockgroup for the year 2019. Let's enrich our stores DataFrame with that variable. # #### Enrich points # # Let's start by loading the geocoded Starbucks stores: # In[28]: from geopandas import read_file stores_gdf = read_file('http://libs.cartocdn.com/cartoframes/files/starbucks_brooklyn_geocoded.geojson') stores_gdf.head() # Alternatively, you can load data in any geospatial format supported by GeoPandas or CARTO. # As we can see, for each store we have its name, address, the total revenue by year and a `geometry` column indicating the location of the store. This is important because for the enrichment service to work, we need a DataFrame with a geometry column encoded as a [shapely](https://pypi.org/project/Shapely/) object. # # We can now create a new `Enrichment` instance, and since the `stores_gdf` dataset represents store locations (points), we can use the `enrich_points` function passing as arguments the stores DataFrame and a list of `Variables` (that we have a valid subscription from the Data Observatory catalog for). # # In this case we are only enriching one variable (the total population), but we could enrich a list of them. # In[29]: from cartoframes.data.observatory import Enrichment enriched_stores_gdf = Enrichment().enrich_points(stores_gdf, [variable]) enriched_stores_gdf.head() # Once the enrichment finishes, we can see there is a new column in our DataFrame called `POPCY` with population projected for the year 2019, from the US Census block group which contains each one of our Starbucks stores. The enrichment process also provides an extra column called `do_area` with the information of the area in square meters covered by the polygons in the source dataset we are using to enrich our data. # #### Enrich polygons # # Next, let's do a second enrichment, but this time using a DataFrame with areas of influence calculated using the [CARTOframes isochrones](/developers/cartoframes/reference/#heading-Isolines) service to obtain the polygon around each store that covers the area within an 8, 17 and 25 minute walk. # In[30]: aoi_gdf = read_file('http://libs.cartocdn.com/cartoframes/files/starbucks_brooklyn_isolines.geojson') aoi_gdf.head() # In this case we have a DataFrame which, for each index in the `stores_gdf`, contains a polygon of the areas of influence around each store at 8, 17 and 25 minute walking intervals. Again the `geometry` is encoded as a `shapely` object. # # In this case, the `Enrichment` service provides an `enrich_polygons` function, which in its basic version, works in the same way as the `enrich_points` function. It just needs a DataFrame with polygon geometries and a list of variables to enrich: # In[31]: from cartoframes.data.observatory import Enrichment enriched_aoi_gdf = Enrichment().enrich_polygons(aoi_gdf, [variable]) enriched_aoi_gdf.head() # We now have a new column in our areas of influence DataFrame, `SUM_POPCY`, which represents the `SUM` of the total population in the Census block groups that instersect with each polygon in our DataFrame. # #### How enrichment works # Let's take a deeper look into what happens under the hood when you execute a polygon enrichment. # # Imagine we have polygons representing municipalities, in blue, each of which have a population attribute, and we want to find out the population inside the green circle. # # Polygon enrichment # We don’t know how the population is distributed inside these municipalities. They are probably concentrated in cities somewhere, but, since we don’t know where they are, our best guess is to assume that the population is evenly distributed in the municipality (i.e. every point inside the municipality has the same population density). # # Population is an extensive property (it grows with area), so we can subset it (a region inside the municipality will always have a smaller population than the whole municipality), and also aggregate it by summing. # # In this case, we’d calculate the population inside each part of the circle that intersects with a municipality. # **Default aggregation methods** # # In the Data Observatory, we suggest a default aggregation method for certain fields. However, some fields don’t have a clear best method, and some just can’t be aggregated. In these cases, we leave the `agg_method` field blank and let the user choose the method that best fits their needs.