Below we will use Clustergrammer's
Network to load the data, swap in zero for missing data, normalize the nutrient columns (so that they can be more easily compared), and finally filter the foods (rows) to only show the top 4000 foods (out of ~7,000) based on their sum across nutrients. This row filtering makes our visualization easier to work with and will tend to remove foods with a large amount of missing nutrient data.
# import clustergrammer_widget and instantiate network instance import numpy as np from clustergrammer_widget import * net = Network() # load matrix file net.load_file('USDA_nutrients_clean.txt') # swap missing values for zero net.swap_nan_for_zero() # normalize nutrient columns so they can be more easily compared net.normalize(axis='col', norm_type='zscore', keep_orig=True) # set the maximum absolute value of any matrix cell to 10 # since we do not care about extreme outliers (this also improves the look of the visualization) net.dat['mat'] = np.clip(net.dat['mat'], -10, 10) # filter down the foods to only keep those with the highest sum across all nutrients (e.g. most nutritious) # keep top 4,000 foods out of ~7,000 net.filter_N_top('row', 4000, 'sum')
# cluster the data net.make_clust()
Once we have loaded and pre-processed the data, we can visualize it with Clustergrammer (you can increase the opacity with the slider to improve visibility). We can see some broad trends:
# generate the widget visualization clustergrammer_widget(network=net.widget())
Clustergrammer allows us to interactively explore the dataset:
df = net.export_df() df = df.transpose() foods = df.columns.tolist()
fast_foods = [i for i in foods if i == 'Fast Foods'] df = df[fast_foods] df = df.transpose() net.make_clust()
Here we are viewing the nutritional data of only 'Fast Foods'. We see the same two clusters of foods characterized in part by high and low water content.