Cocktail Recipe Network Analysis


One of my friends recently acquired a poster of 100 cocktails and the immediate question is where to start and how to proceed. Well, clearly it's easiest to move between recipes that share many ingredients. One way to visualize all of these interactions is using a network. A network in this conext is a collection of nodes (cocktails) connected by edges. Not all pairs of nodes need have edges. In this case, edges exist between two cocktails if they share any ingredients. Furthermore, we can weight these edges according to how many ingredients they share.

In [2]:
import numpy as np
import pandas as pd
import networkx as nx
from fa2 import ForceAtlas2

import graspy
from graspy import embed
from graspy import plot

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings

Load Data

In a csv file, all 100 of the drinks were entered along with their alcholic and non-alcoholic ingredients.

In [3]:
path = './recipes.csv'
with open('./.plotly_key.cred') as f:
    plotly_user,plotly_key = [l.strip().split('=')[1] for l in f.readlines()]
data = pd.read_csv(path, sep=',', encoding = "utf-8")
In [4]:
nodes = data.Name.values
alcohols = [[w.lower() for w in v.split('; ')] if type(v) == str else [v] for v in data.Alcohol.values]
ingredients = [[w.lower() for w in v.split('; ')] if type(v) == str else [''] for v in data.Ingredients.values]
ingredients_total = [i + j if not j == [''] else i for i,j in zip(alcohols, ingredients)]

Initial Ingredient Network and Plots

Once we load all of the data, we can easily make a network composed of nodes (cocktails) connected by edges with weight proportional to the number of shared ingredients. An easy network to construct is a simple undirected graph whre each edge weight is the number of shared ingredients. One issue is that such edges are symmetric and thus the ease of going from cocktail $A \rightarrow B$ is the same as the ease of $B \rightarrow A$. This is not necessarily true, as in the case where cocktail A has many more ingredients total. So, we use instead a directed graph with the edge weight of $A \rightarrow B$ being the fraction

$\frac{\text{Shared Ingredients} + \text{A Ingredients}}{\text{A Ingredients + B Ingredients}}$

Because the alcoholic ingredients are more pricey and harder to come by, this fraction was calculated separately for each set and the alcoholic fraction was weighted an order of magnitude greater before the two were summed.

In [39]:
G = nx.DiGraph()
for n,a,i in zip(nodes,alcohols,ingredients):
    G.nodes[n]['Alcohols'] = a
    G.nodes[n]['Ingredients'] = i
for n1,a1,i1 in zip(nodes,alcohols,ingredients):
    for n2,a2,i2 in zip(nodes,alcohols,ingredients):
        if n1 == n2:
        mutual_alc = (len(a1) + len(set(a1) & set(a2))) / (len(a2) + len(a1))
        mutual_ing = (len(i1) + len(set(i1) & set(i2))) / (len(i2) + len(i1))
        if 10*len(set(a1) & set(a2)) + len(set(i1) & set(i2)) == 0:
        G.add_edge(n1,n2,weight=10*mutual_alc + 1*mutual_ing,major=(len(set(a1) & set(a2)) > 0))

Now we draw our network...

In [21]:

...but this makes no discernable sense to us and provides little meaningful information as it is too cluttered. To imporve our visualization, we use the ForceAtlas2 package and its simulated forces based on edge weights to create a more meaningful visualization.

In [22]:
forceatlas2 = ForceAtlas2(scalingRatio=1,

pos = forceatlas2.forceatlas2_networkx_layout(G, pos=None, iterations=2000)
100%|█████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:03<00:00, 542.57it/s]
BarnesHut Approximation  took  0.63  seconds
Repulsion forces  took  2.41  seconds
Gravitational forces  took  0.05  seconds
Attraction forces  took  0.11  seconds
AdjustSpeedAndApplyForces step  took  0.24  seconds
In [23]:
## Plot the new graph
nx.draw(G ,pos,node_size=40,width=0.1, linewidths=0.8, zoom=1.5, arrows=False)

Community Detection

The ForceAtlas2 visualization looks nice, but what we want to know is what communities exist within the graph, and thus what cocktails are easy to transition between. A community is a set of nodes which are more strongly connected with each other than nodes among other communities, and hence have more ingredients in common.

A graph is just as nicely represented by an adjacency matrix. For a graph of $n$ nodes, in our case 100, the adjacency matrix is an $n \times n$ matrix where the index $i,j$ is the strength of the connection from node $i$ to node $j$. We can visualize this matrix using teh graph statistics package GrasPY.

In [24]:
A = nx.to_numpy_matrix(G)

But how do we translate these to community assignments? The answer lies in spectral embedding. We shall reduce the dimensionality of our data, projecting it to a lower dimensional subspace where it bext maximizes the variance.

In [25]:
from sklearn.cluster import KMeans
In [26]:
A = nx.to_numpy_matrix(G)
ase = embed.AdjacencySpectralEmbed()
Xhat = ase.fit_transform(A)[0]

Now, we can us a K-means clustering algorithm to detect communities among nodes. In the below plot, the data has been projected form $n$ dimensions to 4. In this 4-dimensional subspace, K-means has determined and labeled 8 different communities among our data. Once can fairly easily eyeball some of these communities in some of the subspace planes.

In [27]:
labels_kmeans = KMeans().fit_predict(Xhat)
         title='KMeans on embedded adjacency matrix',
         legend_name='Predicted label',
<seaborn.axisgrid.PairGrid at 0x16bf44e3d30>
In [28]:
idxs = np.argsort(labels_kmeans)
A = A[idxs,:][:,idxs]

We can sort the data according to labels and re-visualize it as a labeled heatmap. As well as plot our graph again, this time with colors.

In [29]:
title = 'Sorted Graph by Labels'
fig = plot.heatmap(A, title=title, inner_hier_labels=labels_kmeans[idxs])
In [30]:
nx.draw(G ,pos,node_size=40,width=0.1, linewidths=0.8, zoom=1.5,node_color=labels_kmeans+1,, arrows=False)