#!/usr/bin/env python
# coding: utf-8

# # USDA Nutrients Visualization
# This notebook will visualize USDA nutrient data taken from the [Nutrient Explorer](http://bl.ocks.org/syntagmatic/raw/3150059/) using [Clustergrammer-widget](http://clustergrammer.readthedocs.io/clustergrammer_widget.html).
# 
# Below we will use Clustergrammer's ``Network`` to load the data, swap in zero for missing data, normalize the nutrient columns (so that they can be more easily compared), and finally filter the foods (rows) to only show the top 4000 foods (out of ~7,000) based on their sum across nutrients. This row filtering makes our visualization easier to work with and will tend to remove foods with a large amount of missing nutrient data. 

# In[1]:


# import clustergrammer_widget and instantiate network instance 
from clustergrammer_widget import *
net = Network(clustergrammer_widget)

# load matrix file 
net.load_file('USDA_nutrients_clean.txt')

# swap missing values for zero
net.swap_nan_for_zero()

# normalize nutrient columns so they can be more easily compared
net.normalize(axis='col', norm_type='zscore', keep_orig=True)

# set the maximum absolute value of any matrix cell to 10
# since we do not care about extreme outliers (this also improves the look of the visualization)
net.clip(-10, 10)

# filter down the foods to only keep those with the highest sum across all nutrients (e.g. most nutritious)
# keep top 4,000 foods out of ~7,000
net.filter_N_top('row', 4000, 'sum')

# cluster the data
net.cluster()


# # Visualizing as Interactive Heatmap with Clustergrammer
# Once we have loaded and pre-processed the data, we can visualize it with Clustergrammer (you can increase the opacity with the slider to improve visibility). We can see some broad trends:
# 
# - Nutrients (columns) cluster into two large clusters: 1) fiber, sugar, carbs etc. and 2) fat, calories, etc.
# - Foods (rows) with the same category tend to cluster together
# - foods with low water content tend to have high sugars or fats

# In[2]:


# generate the widget visualization 
net.widget()


# Clustergrammer allows us to interactively explore the dataset:
# 
# - We can also rows based on nutrients to see which foods and which food categories are highest/lowest in particular nutrients
# - we can filter food-rows using the sliders to reduce the dimensionality of our dataset and find foods of particular interest (e.g. foods with the highest total nutrient levels)
# - mousing over tiles shows the normalized and non-normalized nutrient values
# 

# # Fast Foods
# Here we are viewing the nutritional data of only 'Fast Foods'. Again, we see the same two clusters of foods characterized in part by high and low water content. 

# In[3]:


net.filter_cat('row', 1, 'Fast Foods')
net.cluster(enrichrgram=False)
net.widget()


# We can double-click the saturated [fat] (g) nutrient column to find the fast food with the greatest level of saturated fat per serving, which is nachos with cinnamon and sugar.