Interacting with Glue from IPython

Glue makes it easy to build linked, interactive statistical graphs from files and python datasets. One of Glue's nice features for interactive data analysis is the ability to run Glue and a "normal" python session in parallel. This lets you extract data, plots, and data selections from Glue, or send information back to Glue. Here's a demo, using a catalog of FBI Crime Statistics.

Setting up IPython

Glue is a Qt program, and we need to run a special IPython magic function to properly setup interaction between Qt and IPython. Without it, IPython will be unresponsive while Glue is running.

In [1]:
from glue import qglue
import pandas as pd

# set up IPython/Qt integration
# NOTE: this cell takes a second to run. For some reason,
#       IPython will stall if you try to run the next cell before this one completes
%gui qt4 

Loading Data

In [2]:
states = pd.read_csv('state_crime.csv')
states.head()
Out[2]:
Year Population Violent Crime rate Murder Rape Robbery Assault Property Burglary Larceny Vehicular State
0 1960 3266740 186.6 12.4 8.6 27.5 138.1 1035.4 355.9 592.1 87.3 Alabama
1 1961 3302000 168.5 12.9 7.6 19.1 128.9 985.5 339.3 569.4 76.8 Alabama
2 1962 3358000 157.3 9.4 6.5 22.5 119.0 1067.0 349.1 634.5 83.4 Alabama
3 1963 3347000 182.7 10.2 5.7 24.7 142.1 1150.9 376.9 683.4 90.6 Alabama
4 1964 3407000 213.1 9.3 11.7 29.1 163.0 1358.7 466.6 784.1 108.0 Alabama

Sending to Glue

qglue is an easy way to send python data structures (Numpy arrays, Pandas dataframes, Astropy tables, others) to glue. It returns an application object wich contains lots of state about the application. One of the most important pieces of this state is the data collection:

In [20]:
app = qglue(states=states)
dc = app.data_collection

Data Collections are list-like, and contain each dataset passed to Glue (only one in our case):

In [6]:
print dc
DataCollection (1 data set)
	  0: states
In [7]:
data = dc[0]
print type(data)
print data
<class 'glue.core.data.Data'>
Data Set: statesNumber of dimensions: 1
Shape: 2751
Components:
 0) Year
 1) Pixel Axis 0
 2) World 0
 3) Population
 4) Violent Crime rate
 5) Murder
 6) Rape
 7) Robbery
 8) Assault
 9) Property
 10) Burglary
 11) Larceny
 12) Vehicular
 13) State

Individual datasets in Glue are dictionary-like: we extract arrays using bracket notation

In [8]:
data['Murder']
Out[8]:
array([ 12.4,  12.9,   9.4, ...,   4.8,   4.7,   4.7])

I've created a few basic graphs in Glue, which look like this

Let's use Scikit-learn to run a simple K-means clustering on the data, and send the cluster IDs back to Glue as new subsets

In [9]:
from sklearn.cluster import KMeans
import numpy as np

# extract data into Numpy [N,3] array
X = np.column_stack((data['Robbery'], data['Rape'], data['Murder']))
clusters = KMeans(n_clusters=3).fit_predict(X)

# add cluster_id as a new attribute
c = data.add_component(clusters, 'cluster_id')

# create 3 new subsets, that select each value in clusters
dc.new_subset_group(label='Cluster 1', subset_state = (c == 0))
dc.new_subset_group(label='Cluster 2', subset_state = (c == 1))
dc.new_subset_group(label='Cluster 3', subset_state = (c == 2))
Out[9]:
<glue.core.subset_group.SubsetGroup at 0x11e790550>

The plots update automatically, coloring the new clusters

Extracting Data From Glue

Data objects also have a to_dataframe() method which convert the Glue data back to a DataFrame. Note that the new cluster_id attribute is included in the output

In [12]:
df = data.to_dataframe()
print df.columns
cuts = pd.cut(df.Robbery, 10)
df.groupby(cuts).Murder.mean()
Index([u'Assault', u'Burglary', u'Larceny', u'Murder', u'Pixel Axis 0', u'Population', u'Property', u'Rape', u'Robbery', u'State', u'Vehicular', u'Violent Crime rate', u'World 0', u'Year', u'cluster_id'], dtype='object')
Out[12]:
Robbery
(0.267, 165.22]        5.288493
(165.22, 328.54]       8.553755
(328.54, 491.86]      11.371795
(491.86, 655.18]      16.909091
(655.18, 818.5]       31.235714
(818.5, 981.82]       37.100000
(981.82, 1145.14]     41.614286
(1145.14, 1308.46]    64.050000
(1308.46, 1471.78]    31.100000
(1471.78, 1635.1]     34.350000
Name: Murder, dtype: float64

Subsets can also be extracted from Glue, and converted into boolean masks. This is useful for investigating selections defined by hand:

In [19]:
outliers = data.subsets[0].to_mask()
print "Selected %i rows" % outliers.sum()
print outliers
df[outliers].head()
Selected 189 rows
[False False False ..., False False False]
Out[19]:
Assault Burglary Larceny Murder Pixel Axis 0 Population Property Rape Robbery State Vehicular Violent Crime rate World 0 Year cluster_id
53 45.1 332.1 970.5 10.2 53 226167 1544.9 20.8 28.3 Alaska 242.3 104.3 53 1960 1
55 54.5 351.6 985.4 4.5 55 246000 1564.6 18.7 13.8 Alaska 227.6 91.5 55 1962 1
56 66.1 381.5 1213.7 6.5 56 248000 1952.8 14.9 22.2 Alaska 357.7 109.7 56 1963 1
104 466.1 394.0 2052.1 4.1 104 723860 2637.8 60.2 79.6 Alaska 191.7 610.1 104 2011 1
105 433.2 403.3 2128.0 4.1 105 731449 2739.4 79.7 86.1 Alaska 208.1 603.2 105 2012 1