Visualization of the correlation network

Mirador is a software tool for exploratory analysis of complex datasets. It has been developed as a collaboration between Fathom Information Design and the Sabeti Lab at Harvard University.

Mirador allows to inspect different kind of plots (scatter, histograms, eikosograms) between any pairwise combination of variables in the dataset, and also to rank variables according to their correlation score with a variable of interest. However, it doesn't offer the option to calculate the correlations between all the (selected) variables, which is needed to generate a visual representation of the correlation matrix of the system and can give an overall image of the dependency structure in the data. To do this, we can export the data of the variables we are interested in, run the correlation matrix calculation in a Python script using Miralib, the underlying library in Mirador that provides all the underlying statistical calculations in Mirador, and then open the correlation matrix with Gephi, or any other software for visualization of network data.

Exporting data from Mirador

We will work with the Diabetes 1999-2008 dataset included in the built-in Mirador examples. If we search for the row variable "time_in_hospital", and sort the columns by their correlation with this variable (by clicking on the variable name), we should get the following result in Mirador:

If we open the profile view, we can select the variables we are interested in to include in out correlation matrix, by dragging the tooltips in the bottom of the plot:

Now we can export these variables and all the corresponding data by clicking on the "Export selection" button on the top right corner of the profile window.

Calculating the correlation matrix

Once we export the profile data, we can do any further calculations on it with other tools for statistical analysis. In our case, we will start by running a Python script that uses the same correlation calculation in Mirador to generate the correlation matrix, which is a NxN matrix where the (i, j) element contains the correlation score between variables i and j in our data. In order to run this calculation, we use Miralib to load the exported profile dataset, and iterate over all variable pairs and calculate the correlation scores. Since Miralib is written in Java, the best way to access it from Python right now is by using Jython.

The scripts repository contains all you need to run your own scripts through Jython. You can do it wihtout having to install Jython, since all you need is the jythonlib.jar, miralib.jar and commons-math3-3.2.jar packages. The Miralib API is being documented in this page.

The following Python code will load the exported profile data in ./diabetes/export, and save the resulting correlation matrix to the ./diabetes/network/network.csv file:

In [ ]:
import sys, codecs
from miralib.utils import Log
from miralib.utils import Preferences
from miralib.utils import Project
from miralib.data import DataRanges
from miralib.data import DataSet
from miralib.data import Variable
from miralib.data import DataSlice2D
from miralib.shannon import Similarity

Log.init()

inputFile = "./diabetes/export/profile-config.mira";
outputFile = "./diabetes/network/network.csv";

preferences = Preferences()
project = Project(inputFile, preferences)
dataset = DataSet(project);
ranges = DataRanges();

count = dataset.getVariableCount()
output = [""] * (count + 1)
    
print "Calculating correlation matrix:"
scores = [[0 for x in xrange(count)] for x in xrange(count)] 
for i in range(0, count):
    print "  Row " + str(i) + "/" + str(count) + "..."
    for j in range(i, count):
        vi = dataset.getVariable(i)
        vj = dataset.getVariable(j)
        slice = dataset.getSlice(vi, vj, ranges)
        score = 0
        if i != j and slice.missing < project.missingThreshold():
            score = Similarity.calculate(slice, project.pvalue(), project)
        scores[i][j] = scores[j][i] = score        
print "Done."
    
header = "";
for i in range(0, count):
    vi = dataset.getVariable(i)
    vname = vi.getAlias().replace('"', '\'');
    header = header + ";\"" + vname + "\"";
output[0] = header;

for i in range(0, count):
    vi = dataset.getVariable(i)
    vname = vi.getAlias().replace('"', '\'')
    line = "\"" + vname + "\""
    for j in range(0, count):
        line = line + ";" + str(scores[i][j])
    output[1 + i] = line

file = codecs.open(outputFile, "w", "utf-8")
for line in output:
    file.write(line + "\n");
file.close()    

print "Saved to",outputFile

In order to run this Python using the stand-alone Jython package we provide, you would run the following command from the terminal:

java -jar jythonlib.jar network.py

assuming that the Python code is saved in a file named network.py. Once we run this script, we are ready to input the data in Gephi.

Importing data into Gephi

Gephi is a network visualization and analysis tool. It offers many layouts to represent network data (force atlas, circular, etc.), as well as clustering and other analysis options. Our script save the correlation matrix in CSV format, which can be imported from Gephi as an undirected, weighted graph:

Once we have imported the correlation matrix into Gephi, we can use its interface to calculate the network modularity, the node centrality, and filter out edges with low weight:

For our correlation matrix data, we can generate a force-directed layout where we will see two major groups of variables, first those that correspond to the diagnosis procedures, with admission source ID as the central node, and second the actual laboratory variables, that don't connect directly to time in hospital, but to readmission instead.

Many other representation are also possible, for example the circular layout, which in this case doesn't seem to be as informartive as the force-directed layout: