# C-sets for data analysis: graphs, relational data, and conjunctive queries¶

Evan Patterson
Topos Institute

Category Theory and Applications Meetup
January 20, 2021

## Introduction¶

Some common classes of data processing tools:

1. Data frames, aka data tables
• data.frame in R (originally S)
• pandas in Python
• DataFrames.jl and others in Julia
2. Graphs for graph analytics
3. Relational databases
• Best known through SQL databases
• Also in logic programming, e.g. Prolog

These classes are often considered to be disjoint:

• Different data structures
• Different environments: embedded/in-memory vs standalone systems

But category theory suggests a unifying framework:

• Mathematics is now pretty well understood (Spivak, Wisnesky, and others)
• In this talk, an implementation in the Julia package Catlab.jl

## Graphs and C-sets¶

The unifying framework is $\mathcal{C}$-sets, aka copresheaves or set-valued functors.

Definition. Given a small category $\mathcal{C}$, a $\mathcal{C}$-set is a functor $X: \mathcal{C} \to \mathbf{Set}$.

Interpretation (Spivak 2012, Spivak 2014):

• category $\mathcal{C}$ is a schema
• functor $\mathcal{C} \to \mathbf{Set}$ is instance data, i.e., a database on that schema

Example: The schema for graphs is

$$\mathcal{C} = \{E \rightrightarrows V\}$$
In [1]:
using Catlab, Catlab.CategoricalAlgebra

@present SchemaGraphs(FreeSchema) begin
V::Ob
E::Ob
src::Hom(E,V)
tgt::Hom(E,V)
end;


This object is a presentation of a category, which keeps track of

• generators (in this case, two object generators and two morphism generators)
• relations, i.e. equations (in this case, none)
In [2]:
generators(SchemaGraphs, :Hom)

Out[2]:
2-element Array{Catlab.Theories.FreeSchema.Hom{:generator},1}:
src
tgt

To translate the presentation into a Julia type for a data structure, use CSetType:

const Graph = CSetType(SchemaGraphs, index=[:src,:tgt])


Note: We won't run this line because it is a copy-paste from Catlab.Graphs.

In [3]:
using Catlab.Graphs

g = Graph()
g

Out[3]:
CSet with elements V = 1:4, E = 1:3
E src tgt
1 1 2
2 2 3
3 2 4
In [4]:
using Catlab.Graphics

to_graphviz(g)

Out[4]:
In [5]:
to_graphviz(g, node_labels=true, edge_labels=true)

Out[5]:

Catlab includes helper functions for graphs, but they just wrap the C-sets API.

In [6]:
h = Graph()
g == h

Out[6]:
true

### C-sets API¶

The core API for C-sets is simple:

• add_part(s)! and rem_part(s)!: add and remove parts
• subpart and set_subpart(s)! (also indexing notation): get/set subparts
• incident: inverse images of subparts, using indices if available

Illustrated below in context of Graphs API:

In [7]:
subpart(g, 1, :tgt) == tgt(g, 1) # Target vertex of edge 1.

Out[7]:
true
In [8]:
incident(g, 2, :src) # Edges having vertex 2 as source.

Out[8]:
2-element Array{Int64,1}:
2
3

Inverse images needed for efficient graph traversals:

In [9]:
outneighbors(g, 2) == [ tgt(g, e) for e in incident(g, 2, :src) ]

Out[9]:
true

### C-sets implementation¶

In Catlab, a $\mathcal{C}$-set consists of two things, tables and indices.

In [10]:
keys(tables(g))

Out[10]:
(:V, :E)

In a graph, $V$-table is trivial (zero columns) and $E$ has two columns:

In [11]:
tables(g)[:E]

Out[11]:
Table with 2 columns and 3 rows:
src  tgt
┌─────────
1 │ 1    2
2 │ 2    3
3 │ 2    4

Indices are private and are controlled by index keyword to CSetType:

const Graph = CSetType(SchemaGraphs, index=[:src,:tgt])

In [12]:
incident(g, :, :src)

Out[12]:
4-element view(::Array{Array{Int64,1},1}, :) with eltype Array{Int64,1}:
[1]
[2, 3]
[]
[]

Such indices generalize the adjacency list representation of a graph.

### Graphs and beyond¶

Following are $\mathcal{C}$-sets for suitable $\mathcal{C}$:

• "undirected" graphs: symmetric graphs and half-edge graphs
• reflexive graphs
• bipartite graphs
• wiring diagrams, directed and undirected, via Catlab.WiringDiagrams
• whole-grain Petri nets (Kock 2020), via AlgebraicPetri.jl
• coming soon: simplicial sets, via CombinatorialSpaces.jl

## Data frames and attributed C-sets¶

A data frame (or data table) with $k$ columns and $n$ rows can be regarded as a multispan in $\mathbf{Set}$ with $k$ legs and apex $[n] := \{1,2,\dots,n\}$.

In [13]:
using TikzPictures

# XXX: Hack to increase size by scaling.