#!/usr/bin/env python # coding: utf-8 # ## Data discovery: how to explore CARTO's Data Observatory catalog. # # This notebook shows how to use CARTOframes to discover and explore datasets from CARTO's [Data Observatory](https://carto.com/spatial-data-catalog/). # # If you haven't installed CARTOframes yet, please visit our [installation guide](https://carto.com/developers/cartoframes/guides/Installation/). # # The notebook is organized in the following sections: # # 0. [Setup](#section0) # # 0.1. [Import packages](#section0.1) # # 0.2. [Set CARTO default credentials](#section0.2) # # # 1. [Data discovery](#section1) # # 1.1. [Data catalog structure](#section1.1) # # 1.2. [Combining filters](#section1.2) # # 1.3. [Filter datasets by the type of geography](#section1.3) # # 1.4. [Get a first glimpse of a dataset](#section1.3) # # # **Want to learn more?** Learn how to access and download datasets on the following notebooks Access Public Data and Access Premium Data. # # ### 0. Setup # # #### 0.1. Import packages # In[1]: import geopandas as gpd import pandas as pd pd.set_option('display.max_columns', None) from cartoframes.auth import set_default_credentials from cartoframes.data.observatory import * # # #### 0.2. Set CARTO default credentials # # In order to be able to use the Data Observatory via CARTOframes, you need to set your CARTO account credentials first. # # Please, visit the [Authentication guide](https://carto.com/developers/cartoframes/guides/Authentication/) for further detail. # In[2]: set_default_credentials('creds.json') # **Note about credentials** # # For security reasons, we recommend storing your credentials in an external file to prevent publishing them by accident when sharing your notebooks. You can get more information in the section _Setting your credentials_ of the [Authentication guide](https://carto.com/developers/cartoframes/guides/Authentication/). # # ### 1. Data Discovery # # CARTO's data catalog consists of a set of datasets organized by: # - Country # - Category # - Provider # - Geography # # In addition, datasets are classified in public and premium data. # # This classification can be used to explore the Catalog and narrow down your search for the dataset most suitable for your analysis. # # For example, you may start exploring by country, then filter by category and finally select a dataset based on the provider. Alternatively, you could start exploring by category or provider. # # #### 1.1. Data catalog structure # # In this subsection we explore each of the four classification categories. # # **Note** results can be displayed in Python list format or in Pandas DataFrame format. # ##### 1.1.1. Countries # In[3]: Catalog().countries.to_dataframe().head(10) # ##### 1.1.2. Categories # In[4]: Catalog().categories.to_dataframe() # ##### 1.1.3. Providers # In[5]: Catalog().providers.to_dataframe().head(10) # ##### 1.1.4. Geographies # In[6]: Catalog().geographies.to_dataframe().head() # # #### 1.2. Combining filters # Let's now take a look at how to combine category and provider filters to get the public demographics datasets available in the US from ACS at the census tract level. # Categories available in the US. # In[7]: Catalog().country('usa').categories.to_dataframe() # Providers available in the US. # In[8]: Catalog().country('usa').providers.to_dataframe().head() # List of providers available in the US offering demographics data. # In[9]: Catalog().country('usa').category('demographics').providers # List of providers available in the US offering **public** demographics data. # In[10]: Catalog().country('usa').category('demographics').public().providers # Let's now take a look at all the demographics datasets provided by ACS in the US. # In[11]: Catalog().country('usa').category('demographics').provider('usa_acs').datasets.to_dataframe().head() # # #### 1.3. Filter datasets by the type of geography # # We can explore the types of geographies for which the datasets are available. In order to filter by a specific type of geography, we have to apply a filter to the geography_name column just like we would for a string column on a Pandas DataFrame. # In[12]: datasets_acs_df = Catalog().country('usa').category('demographics').provider('usa_acs').datasets.to_dataframe() # In[13]: datasets_acs_df['geography_name'].unique() # In[14]: datasets_acs_df[datasets_acs_df['geography_name'].str.contains('Census Tract')] # # #### 1.4. Get a first glimpse of a dataset # # We select the dataset acs_sociodemogr_496a0675 from the list of datasets above because it is the one with the latest data update. # # CARTOframes allows you to get a first glimpse of the dataset so that you can make sure it's the right dataset for your analysis. This includes: # - Information about the dataset. This includes a description, provider, temporal aggregation, if it is public or premium, etc. # - Information about its variables. This includes a name, description, aggregation method, etc. # - Access to the first 10 rows of the dataset. # - A statistical description of all numerical variables, just like the `describe()` function in Pandas. # - A map with the geometric coverage of the dataset. # In[15]: sample_ds = Dataset.get('acs_sociodemogr_496a0675') # ##### 1.4.1. Information about the dataset # In[16]: sample_ds.to_dict() # ##### 1.4.2. Information about the dataset variables # In[17]: sample_ds.variables.to_dataframe().head(5) # ##### 1.4.3. Access to the ten first rows of the dataset # In[18]: sample_ds.head() # ##### 1.4.4. Summary of different counts over the actual dataset data # In[19]: sample_ds.counts() # ##### 1.4.5. Fields by type # In[20]: sample_ds.fields_by_type() # ##### 1.4.6. Statistical description of numerical variables # In[21]: sample_ds.describe() # ##### 1.4.5. Visualization of dataset coverage # In[22]: sample_ds.geom_coverage()