Glue makes it easy to build linked, interactive statistical graphs from files and python datasets. One of Glue's nice features for interactive data analysis is the ability to run Glue and a "normal" python session in parallel. This lets you extract data, plots, and data selections from Glue, or send information back to Glue. Here's a demo, using a catalog of FBI Crime Statistics.
Glue is a Qt program, and we need to run a special IPython magic function to properly setup interaction between Qt and IPython. Without it, IPython will be unresponsive while Glue is running.
from glue import qglue
import pandas as pd
# set up IPython/Qt integration
# NOTE: this cell takes a second to run. For some reason,
# IPython will stall if you try to run the next cell before this one completes
%gui qt4
states = pd.read_csv('state_crime.csv')
states.head()
Year | Population | Violent Crime rate | Murder | Rape | Robbery | Assault | Property | Burglary | Larceny | Vehicular | State | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1960 | 3266740 | 186.6 | 12.4 | 8.6 | 27.5 | 138.1 | 1035.4 | 355.9 | 592.1 | 87.3 | Alabama |
1 | 1961 | 3302000 | 168.5 | 12.9 | 7.6 | 19.1 | 128.9 | 985.5 | 339.3 | 569.4 | 76.8 | Alabama |
2 | 1962 | 3358000 | 157.3 | 9.4 | 6.5 | 22.5 | 119.0 | 1067.0 | 349.1 | 634.5 | 83.4 | Alabama |
3 | 1963 | 3347000 | 182.7 | 10.2 | 5.7 | 24.7 | 142.1 | 1150.9 | 376.9 | 683.4 | 90.6 | Alabama |
4 | 1964 | 3407000 | 213.1 | 9.3 | 11.7 | 29.1 | 163.0 | 1358.7 | 466.6 | 784.1 | 108.0 | Alabama |
app = qglue(states=states)
dc = app.data_collection
Data Collections are list-like, and contain each dataset passed to Glue (only one in our case):
print dc
DataCollection (1 data set) 0: states
data = dc[0]
print type(data)
print data
<class 'glue.core.data.Data'> Data Set: statesNumber of dimensions: 1 Shape: 2751 Components: 0) Year 1) Pixel Axis 0 2) World 0 3) Population 4) Violent Crime rate 5) Murder 6) Rape 7) Robbery 8) Assault 9) Property 10) Burglary 11) Larceny 12) Vehicular 13) State
Individual datasets in Glue are dictionary-like: we extract arrays using bracket notation
data['Murder']
array([ 12.4, 12.9, 9.4, ..., 4.8, 4.7, 4.7])
I've created a few basic graphs in Glue, which look like this
Let's use Scikit-learn to run a simple K-means clustering on the data, and send the cluster IDs back to Glue as new subsets
from sklearn.cluster import KMeans
import numpy as np
# extract data into Numpy [N,3] array
X = np.column_stack((data['Robbery'], data['Rape'], data['Murder']))
clusters = KMeans(n_clusters=3).fit_predict(X)
# add cluster_id as a new attribute
c = data.add_component(clusters, 'cluster_id')
# create 3 new subsets, that select each value in clusters
dc.new_subset_group(label='Cluster 1', subset_state = (c == 0))
dc.new_subset_group(label='Cluster 2', subset_state = (c == 1))
dc.new_subset_group(label='Cluster 3', subset_state = (c == 2))
<glue.core.subset_group.SubsetGroup at 0x11e790550>
The plots update automatically, coloring the new clusters
Data objects also have a to_dataframe()
method which convert the Glue data back to a DataFrame. Note that the new cluster_id
attribute is included in the output
df = data.to_dataframe()
print df.columns
cuts = pd.cut(df.Robbery, 10)
df.groupby(cuts).Murder.mean()
Index([u'Assault', u'Burglary', u'Larceny', u'Murder', u'Pixel Axis 0', u'Population', u'Property', u'Rape', u'Robbery', u'State', u'Vehicular', u'Violent Crime rate', u'World 0', u'Year', u'cluster_id'], dtype='object')
Robbery (0.267, 165.22] 5.288493 (165.22, 328.54] 8.553755 (328.54, 491.86] 11.371795 (491.86, 655.18] 16.909091 (655.18, 818.5] 31.235714 (818.5, 981.82] 37.100000 (981.82, 1145.14] 41.614286 (1145.14, 1308.46] 64.050000 (1308.46, 1471.78] 31.100000 (1471.78, 1635.1] 34.350000 Name: Murder, dtype: float64
Subsets can also be extracted from Glue, and converted into boolean masks. This is useful for investigating selections defined by hand:
outliers = data.subsets[0].to_mask()
print "Selected %i rows" % outliers.sum()
print outliers
df[outliers].head()
Selected 189 rows [False False False ..., False False False]
Assault | Burglary | Larceny | Murder | Pixel Axis 0 | Population | Property | Rape | Robbery | State | Vehicular | Violent Crime rate | World 0 | Year | cluster_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
53 | 45.1 | 332.1 | 970.5 | 10.2 | 53 | 226167 | 1544.9 | 20.8 | 28.3 | Alaska | 242.3 | 104.3 | 53 | 1960 | 1 |
55 | 54.5 | 351.6 | 985.4 | 4.5 | 55 | 246000 | 1564.6 | 18.7 | 13.8 | Alaska | 227.6 | 91.5 | 55 | 1962 | 1 |
56 | 66.1 | 381.5 | 1213.7 | 6.5 | 56 | 248000 | 1952.8 | 14.9 | 22.2 | Alaska | 357.7 | 109.7 | 56 | 1963 | 1 |
104 | 466.1 | 394.0 | 2052.1 | 4.1 | 104 | 723860 | 2637.8 | 60.2 | 79.6 | Alaska | 191.7 | 610.1 | 104 | 2011 | 1 |
105 | 433.2 | 403.3 | 2128.0 | 4.1 | 105 | 731449 | 2739.4 | 79.7 | 86.1 | Alaska | 208.1 | 603.2 | 105 | 2012 | 1 |