Notebook

Cluster morphology using networkx¶

This is a short example demonstrating network analysis with DICES data. I'm using NetworkX to build the network models and Pyplot to visualize them.

I'm by no means an expert in network tools. If you have more complex case studies you'd like to share, please get in touch!

In [ ]:

# import statements
import pickle
import pandas as pd
import os
from dicesapi import DicesAPI
from dicesapi.jupyter import NotebookPBar
from collections import Counter
from matplotlib import pyplot as plt
import networkx as nx

# initialize connection to database
api = DicesAPI(
    progress_class = NotebookPBar,
    logfile = 'dices.log',
)

Define what we want to study¶

In this case, I'd like to organize every conversation in the corpus according to which parties talk to which other parties.

Nodes in my network will be character instances, and edges will be speaker-addressee relationships. I'm not going to consider how many times they speak throughout the conversation, simple whether person A ever speaks to person B.

I'll assign numbers to the participants in the order in which they appear.

The function below produces a dictionary with three components:

key: a shorthand representation of who speaks to whom
turns: a table of all the speeches in the cluster
graph: a networkx graph representing speaker-addressee relationships

In [ ]:

def convo_graph(cluster):
    persons = dict()
    
    def get_id(inst):
        name = inst.name if inst is not None else 'N/A'
        
        return persons.setdefault(name, len(persons) + 1)

    turns = pd.DataFrame(dict(
        id = cl.id,
        source = [get_id(inst) for inst in (s.spkr or [None])],
        target = [get_id(inst) for inst in (s.addr or [None])],
    ) for s in cluster.getSpeeches())
    
    all_edges = turns.explode('source').explode('target')
    
    flat_with_weights = all_edges.groupby(['source','target']
                                ).size(
                                ).reset_index(name='weight'
                                ).sort_values(['source', 'target'])
    
    graph = nx.from_pandas_edgelist(flat_with_weights, create_using=nx.DiGraph,
                            source='source', target='target')
    
    key = tuple((e.source, e.target) for i, e in flat_with_weights.iterrows())
    
    return dict(key=key, graph=graph, turns=turns)

Download all the speech clusters in the Iliad¶

In [ ]:

clusters = api.getClusters(work_title='Iliad')
print(len(clusters), 'clusters')

Test out our model¶

Let's try building a couple of graphs to see what they're like. I'm starting with item 0, the first cluster. Try picking other numbers to compare the results.

In [ ]:

cl = clusters[10]
print(cl)

pd.DataFrame(dict(
    cluster = cl.id,
    speech = s.id,
    work = f'{s.author.name} {s.work.title}',
    first = s.l_fi,
    last = s.l_la,
    spkr = s.getSpkrString(),
    addr = s.getAddrString(),
) for s in cl.getSpeeches())

Run our custom function to produce key, turns, and graph as a dict.

In [ ]:

bundle = convo_graph(cl)

Let's start with the turns, since that's the easiest for us to interpret. The speeches are still in order, but the names have been replaced by numbers.

In [ ]:

bundle['turns']

The key is a flattened form of this, combining turns that are identical in spkr-addressee relation.

In [ ]:

bundle['key']

Build graphs for each cluster¶

In [ ]:

pbar = NotebookPBar(max=len(clusters))
graphs = []

for i, cl in enumerate(clusters):
    graphs.append(convo_graph(cl))
    pbar.update(i)

Organize the clusters graphs according to key.¶

Here we create two dictionaries. One stores all the graphs based on key, the flat representation of the map. The other stores all the turn-taking tables in the same way.

In [ ]:

graph_index = {}
turns_index = {}

for graph in graphs:
    k = graph['key']
    g = graph['graph']
    m = graph['turns']
    
    if k not in graph_index:
        graph_index[k] = []
    graph_index[k].append(g)        
        
    if k not in turns_index:
        turns_index[k] = []
    turns_index[k].append(m)

Count conversations according to key¶

Make a quick counter of how many graphs are organized under each key, so we can see which morphologies are most common.

In [ ]:

key_count = Counter([g['key'] for g in graphs])

In [ ]:

key_count.most_common()

Plot the most common morphologies¶

We use the counter to take each successive map in order, from most common down. Then we check the graph_index for an example of the graph representing that morphology and plot it. The final line below also saves a copy of the image.

In [ ]:

fig, ax = plt.subplots(3, 4, figsize=(22,12))
plt.subplots_adjust(wspace=1, hspace=.5)

for i, rec in enumerate(key_count.most_common(12)):
    key, count = rec
    row = i % 4
    col = i // 4
    
    plt.sca(ax[col, row])
    g = graph_index[key][0]
    nx.draw_spring(g, node_color='pink', width=4, with_labels=True)
    ax[col,row].set_title(f'n={count}', fontsize=18)

plt.savefig('foo.pdf')

Search for speeches by morphology¶

We can also go the other direction: specify a key and look for examples of it in the corpus by using the indices we built.

Define the relationship we're looking for¶

In [ ]:

key = (((1), (2)), ((3), (1)))

Visualize it¶

In [ ]:

# look at first graph
graph = graph_index[key][0]

fig, ax = plt.subplots(figsize=(8,6))
nx.draw(graph, node_color='pink', width=2, with_labels=True)
ax.set_title(f'n={len(g)}')
fig.savefig('chain.pdf')

List all matching conversations¶

In [ ]:

cl_ids = [turns.loc[0,'id'] for turns in turns_index[key]]

for cl in clusters.filterIDs(cl_ids):
    display(
        pd.DataFrame(dict(
            author = s.author.name,
            work = s.work.title,
            lines = s.l_range,
            speaker = ', '.join([i.name for i in s.spkr]),
            addressee = ', '.join([i.name for i in s.addr]),
        ) for s in cl.getSpeeches())
    )