For a new OU course on all matters data that we're working on at the moment, we're exploring the use of IPython notebooks and the pandas
library.
A few weeks ago, I came across another Python library called ipythonblocks [demo] that's been developed as a teaching tool around Python programming. The library provides an HTML grid containing squares whose colours are controlled by the programmer.
When I first saw caught sight of this library, I thought it looked ideal as a tool for helping visualise the state of pandas dataframes and various transformations we might apply to them. So here are a few proof of concept doodles... Comments much appreciated via the partner blogpost to this gist/notebook - Visualising Pandas DataFrames With IPythonBlocks – Proof of Concept.
I'm using Anaconda as my Python/ipynb/matplotlib distro, which is a real pain to install things into. Easiest way to get the ipythonblocks
library running is to download the .py
file from github and install it locally...
from ipythonblocks import BlockGrid
#Here's a simple grid
grid = BlockGrid(10, 10, fill=(123, 234, 123))
grid
import pandas as pd
df=pd.DataFrame({'group':['a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'day':['Mon','Tues','Fri','Thurs','Sat','Sun','Weds','Fri','Sun','Thurs','Sat','Weds','Mon','Tues'],
'amount':[1,2,4,2,1,1,2,4,5,3,4,2,1,3],
'value':[2.1,np.nan,3,4.4,2.5,1,np.nan,np.nan,8.2,3.2,4.1,np.nan,1.2,1.4]
})
df
amount | day | group | value | |
---|---|---|---|---|
0 | 1 | Mon | a | 2.1 |
1 | 2 | Tues | a | NaN |
2 | 4 | Fri | a | 3.0 |
3 | 2 | Thurs | a | 4.4 |
4 | 1 | Sat | a | 2.5 |
5 | 1 | Sun | a | 1.0 |
6 | 2 | Weds | a | NaN |
7 | 4 | Fri | b | NaN |
8 | 5 | Sun | b | 8.2 |
9 | 3 | Thurs | b | 3.2 |
10 | 4 | Sat | b | 4.1 |
11 | 2 | Weds | b | NaN |
12 | 1 | Mon | b | 1.2 |
13 | 3 | Tues | b | 1.4 |
14 rows × 4 columns
We can get the shape of a dataframe easily enough:
df.shape
(14, 4)
This gives us (rows,columns). So let's visualise that:
def pBlockGrid(df):
(y,x)=df.shape
return BlockGrid(x,y)
pBlockGrid(df)
We can add in some colour:
def pBlockGrid2(df,fill=(123, 234, 123),*args,**kwargs):
(y,x)=df.shape
b=BlockGrid(x,y,fill=fill,**kwargs)
return b
There's also an opportunity to tweak the blocksize...
pb=pBlockGrid2(df)
pb.block_size = 4
pb
We can check the type of each column in the pandas
dataframe:
df.dtypes
amount int64 day object group object value float64 dtype: object
Let's start to tweak the function so we can display the columns by data type:
def pBlockGrid3(df,fill=(123, 234, 123),blocksize=10,*args,**kwargs):
b=pBlockGrid2(df,fill=fill)
type_colours={'int64':(255, 0, 0),'float64':(0, 0, 255),'object':(128,128,0)}
for i,j in enumerate(df.dtypes):
b[:, i] = type_colours[j.name]
b.block_size = blocksize
return b
pBlockGrid3(df)
Some of the cells are NaN
or unknown values - let's detect them and colour them black:
def pBlock_colourNA(df,b,color_NA=(0,0,0),*args,**kwargs):
(y,x)=df.shape
for i in range(x):
for j in range(y):
if pd.isnull(df.iloc[j,i]):
b[j,i]=color_NA
return b
def pBlockGrid4(df,fill=(123, 234, 123),*args,**kwargs):
b=pBlockGrid3(df,fill=fill)
b=pBlock_colourNA(df,b,*args,**kwargs)
return b
pBlockGrid4(df,blocksize=5)
If we transpose the dataframe, we should be able to see the change in shape:
df.transpose()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
amount | 1 | 2 | 4 | 2 | 1 | 1 | 2 | 4 | 5 | 3 | 4 | 2 | 1 | 3 |
day | Mon | Tues | Fri | Thurs | Sat | Sun | Weds | Fri | Sun | Thurs | Sat | Weds | Mon | Tues |
group | a | a | a | a | a | a | a | b | b | b | b | b | b | b |
value | 2.1 | NaN | 3 | 4.4 | 2.5 | 1 | NaN | NaN | 8.2 | 3.2 | 4.1 | NaN | 1.2 | 1.4 |
4 rows × 14 columns
pBlockGrid4(df.transpose(),blocksize=5)
Let's create a couple of dataframes that we can try appending together:
df1=pd.DataFrame({"colA":['A','B','C','D'],"colB":np.arange(4)})
df1
colA | colB | |
---|---|---|
0 | A | 0 |
1 | B | 1 |
2 | C | 2 |
3 | D | 3 |
4 rows × 2 columns
df2=pd.DataFrame({"colA":['D','E','F'],"colB":np.arange(14,17)})
df2
colA | colB | |
---|---|---|
0 | D | 14 |
1 | E | 15 |
2 | F | 16 |
3 rows × 2 columns
This is probably a bit clunky as far as "proper" Python hackers go, but this is still proof of concept...
What happens if we create a function that does a pandas .concat()
and returns the resulting dataframe along with its block representation:
def pBlockGrid5(df1,df2,join='outer',fill=(123, 234, 123),*args,**kwargs):
df=pd.concat([df1,df2],join=join)
b=pBlockGrid2(df,fill=fill)
for block in b:
if block.row < len(df1): block.set_colors(255,0,0)
else: block.set_colors(0,0,255)
b=pBlock_colourNA(df,b,*args,**kwargs)
return (df,b)
The idea behind the colouring is to show which dataframe provide what rows. The function should probably be generalised to accept a list of dataframes to merge, and then colour them all differently (which will require setting up a colour ramp).
dfd,bb=pBlockGrid5(df1,df2)
bb
I'm wondering now: could the pandas DataFrame() object and associated operators (such as merge(), concat(), etc be extended to support ipythonblocks style visualisations?!
Speaking of merge, let's see if we can get something going there so we can inner, outer, left and right join dataframes - and visualise what happens as a result...
df3=pd.DataFrame({"colA":['F','G'],"colB":np.arange(16,18),"colC":np.arange(26,28)})
df3
colA | colB | colC | |
---|---|---|---|
0 | F | 16 | 26 |
1 | G | 17 | 27 |
2 rows × 3 columns
dfd,bb=pBlockGrid5(df1,df3)
bb
This visualisation tells us that the top four rows include data from the first data frame, and the rightmost column for those rows are filled with NAs. The bottom two rows are complete and come from the second data frame; maybe the second dataframe introduced an additional column?
How about we just look at missing data?
b=pBlockGrid2(dfd,fill=(123, 234, 123))
pBlock_colourNA(dfd,b)
The merge joins are akin to SQL joins - inner, outer, left and right.
I'm going to try to colour things so we can see which dataframe (left or right) the columns came in from...
def pBlockGrid6(df1,df2,fill=(123, 234, 123),*args,**kwargs):
df=pd.merge(df1,df2,**kwargs)
b=pBlockGrid2(df,fill=fill)
for block in b:
label=df.columns.tolist()[block.col]
if 'on' in kwargs and label in kwargs['on']: block.set_colors(0,255,0)
#This following bit assumes that the left N cols in the merged table are from the left table
elif block.col < len(df1.columns): block.set_colors(255,0,0)
else: block.set_colors(0,0,255)
b=pBlock_colourNA(df,b)
return (df,b)
dfd,bb=pBlockGrid6(df2,df3,how='inner',on=['colA','colB'])
print(dfd)
bb
colA colB colC 0 F 16 26 [1 rows x 3 columns]
So we read green as common join columns, blue as right dataframe only cells, and red as left dataframe only data cells.
dfd,bb=pBlockGrid6(df2,df3,how='left',on=['colA'])
print(dfd)
bb
colA colB_x colB_y colC 0 D 14 NaN NaN 1 E 15 NaN NaN 2 F 16 16 26 [3 rows x 4 columns]
dfd,bb=pBlockGrid6(df2,df3,how='right',on=['colA'])
print(dfd)
bb
colA colB_x colB_y colC 0 F 16 16 26 1 G NaN 17 27 [2 rows x 4 columns]
Getting the logic for colouring the outer join clear defeats me for now...:-( I'm not totally convinced it'll always work? Maybe it would help my thinking if we reorder the columns so that columns contributed by the right table are on the right of the joined dataframe, then we should be able to colour columns we know have come from the right table?
Anyway, let's give it a go...
dfd,bb=pBlockGrid6(df2,df3,how='outer',on=['colB'])
print(dfd)
bb
colA_x colB colA_y colC 0 D 14 NaN NaN 1 E 15 NaN NaN 2 F 16 F 26 3 NaN 17 G 27 [4 rows x 4 columns]
Seems to work.. erm, maybe... But I suspect things will start breaking if we allow on_left
and on-right
join column declarations...:-(
But the principle is there...
And once again I wonder - could this sort of visualisation be built into pandas as a useful teaching tool? How easy would it be to create a patch library to extend pandas with this sort of functionality?
For example, here are some thoughts on a possible pandas interface: