#!/usr/bin/env python
# coding: utf-8
# ## Data discovery: how to explore CARTO's Data Observatory catalog.
#
# This notebook shows how to use CARTOframes to discover and explore datasets from CARTO's [Data Observatory](https://carto.com/spatial-data-catalog/).
#
# If you haven't installed CARTOframes yet, please visit our [installation guide](https://carto.com/developers/cartoframes/guides/Installation/).
#
# The notebook is organized in the following sections:
#
# 0. [Setup](#section0)
#
# 0.1. [Import packages](#section0.1)
#
# 0.2. [Set CARTO default credentials](#section0.2)
#
#
# 1. [Data discovery](#section1)
#
# 1.1. [Data catalog structure](#section1.1)
#
# 1.2. [Combining filters](#section1.2)
#
# 1.3. [Filter datasets by the type of geography](#section1.3)
#
# 1.4. [Get a first glimpse of a dataset](#section1.3)
#
#
# **Want to learn more?** Learn how to access and download datasets on the following notebooks Access Public Data and Access Premium Data.
#
# ### 0. Setup
#
# #### 0.1. Import packages
# In[1]:
import geopandas as gpd
import pandas as pd
pd.set_option('display.max_columns', None)
from cartoframes.auth import set_default_credentials
from cartoframes.data.observatory import *
#
# #### 0.2. Set CARTO default credentials
#
# In order to be able to use the Data Observatory via CARTOframes, you need to set your CARTO account credentials first.
#
# Please, visit the [Authentication guide](https://carto.com/developers/cartoframes/guides/Authentication/) for further detail.
# In[2]:
set_default_credentials('creds.json')
# **Note about credentials**
#
# For security reasons, we recommend storing your credentials in an external file to prevent publishing them by accident when sharing your notebooks. You can get more information in the section _Setting your credentials_ of the [Authentication guide](https://carto.com/developers/cartoframes/guides/Authentication/).
#
# ### 1. Data Discovery
#
# CARTO's data catalog consists of a set of datasets organized by:
# - Country
# - Category
# - Provider
# - Geography
#
# In addition, datasets are classified in public and premium data.
#
# This classification can be used to explore the Catalog and narrow down your search for the dataset most suitable for your analysis.
#
# For example, you may start exploring by country, then filter by category and finally select a dataset based on the provider. Alternatively, you could start exploring by category or provider.
#
# #### 1.1. Data catalog structure
#
# In this subsection we explore each of the four classification categories.
#
# **Note** results can be displayed in Python list format or in Pandas DataFrame format.
# ##### 1.1.1. Countries
# In[3]:
Catalog().countries.to_dataframe().head(10)
# ##### 1.1.2. Categories
# In[4]:
Catalog().categories.to_dataframe()
# ##### 1.1.3. Providers
# In[5]:
Catalog().providers.to_dataframe().head(10)
# ##### 1.1.4. Geographies
# In[6]:
Catalog().geographies.to_dataframe().head()
#
# #### 1.2. Combining filters
# Let's now take a look at how to combine category and provider filters to get the public demographics datasets available in the US from ACS at the census tract level.
# Categories available in the US.
# In[7]:
Catalog().country('usa').categories.to_dataframe()
# Providers available in the US.
# In[8]:
Catalog().country('usa').providers.to_dataframe().head()
# List of providers available in the US offering demographics data.
# In[9]:
Catalog().country('usa').category('demographics').providers
# List of providers available in the US offering **public** demographics data.
# In[10]:
Catalog().country('usa').category('demographics').public().providers
# Let's now take a look at all the demographics datasets provided by ACS in the US.
# In[11]:
Catalog().country('usa').category('demographics').provider('usa_acs').datasets.to_dataframe().head()
#
# #### 1.3. Filter datasets by the type of geography
#
# We can explore the types of geographies for which the datasets are available. In order to filter by a specific type of geography, we have to apply a filter to the geography_name column just like we would for a string column on a Pandas DataFrame.
# In[12]:
datasets_acs_df = Catalog().country('usa').category('demographics').provider('usa_acs').datasets.to_dataframe()
# In[13]:
datasets_acs_df['geography_name'].unique()
# In[14]:
datasets_acs_df[datasets_acs_df['geography_name'].str.contains('Census Tract')]
#
# #### 1.4. Get a first glimpse of a dataset
#
# We select the dataset acs_sociodemogr_496a0675 from the list of datasets above because it is the one with the latest data update.
#
# CARTOframes allows you to get a first glimpse of the dataset so that you can make sure it's the right dataset for your analysis. This includes:
# - Information about the dataset. This includes a description, provider, temporal aggregation, if it is public or premium, etc.
# - Information about its variables. This includes a name, description, aggregation method, etc.
# - Access to the first 10 rows of the dataset.
# - A statistical description of all numerical variables, just like the `describe()` function in Pandas.
# - A map with the geometric coverage of the dataset.
# In[15]:
sample_ds = Dataset.get('acs_sociodemogr_496a0675')
# ##### 1.4.1. Information about the dataset
# In[16]:
sample_ds.to_dict()
# ##### 1.4.2. Information about the dataset variables
# In[17]:
sample_ds.variables.to_dataframe().head(5)
# ##### 1.4.3. Access to the ten first rows of the dataset
# In[18]:
sample_ds.head()
# ##### 1.4.4. Summary of different counts over the actual dataset data
# In[19]:
sample_ds.counts()
# ##### 1.4.5. Fields by type
# In[20]:
sample_ds.fields_by_type()
# ##### 1.4.6. Statistical description of numerical variables
# In[21]:
sample_ds.describe()
# ##### 1.4.5. Visualization of dataset coverage
# In[22]:
sample_ds.geom_coverage()