Resource usage of the StellarGraph class¶

This notebooks records the time and memory (both peak and long-term) required to construct a StellarGraph object for several datasets.

Run the latest release of this notebook:

This notebook is aimed at helping contributors to the StellarGraph library itself understand how their changes affect the resource usage of the StellarGraph object.

Various measures of resource usage for several "real world" graphs of various sizes are recorded:

time for construction
memory usage of the final StellarGraph object
peak memory usage during StellarGraph construction (both absolute, and additional compared to the raw input data)

These are recorded both with explicit nodes (and node features if they exist), and implicit/inferred nodes.

The memory usage is recorded end-to-end. That is, the recording starts from data on disk and continues until the StellarGraph object has been constructed and other data has been cleaned up. This is important for accurately recording the total memory usage, as NumPy arrays can often share data with existing arrays in memory and so retroactive or partial (starting from data in memory) analysis can miss significant amounts of data. The parsing code in stellargraph.datasets doesn't allow determining the memory usage of the intermediate nodes and edges input to the StellarGraph constructor, and so cannot be used here.

In [1]:

# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
  %pip install -q stellargraph[demos]==1.1.0b

In [2]:

# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg

try:
    sg.utils.validate_notebook_version("1.1.0b")
except AttributeError:
    raise ValueError(
        f"This notebook requires StellarGraph version 1.1.0b, but a different version {sg.__version__} is installed.  Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
    ) from None

In [3]:

import stellargraph as sg
import pandas as pd
import numpy as np

import gc
import json
import os
import timeit
import tempfile
import tracemalloc

Optional reddit data¶

The original GraphSAGE paper evaluated on a reddit dataset, available at http://snap.stanford.edu/graphsage/#datasets. This dataset is large (1.3GB compressed) and so there is not automatic download support for it. The following reddit_path variable controls whether and how the reddit dataset is included:

to ignore the dataset: set the variable to None
to include the dataset: download the dataset zip, decompress it, and set the variable to the decompressed directory

In [4]:

reddit_path = os.path.expanduser("~/data/reddit")

Datasets¶

Cora¶

In [5]:

cora = sg.datasets.Cora()
cora.download()

cora_cites_path = os.path.join(cora.data_directory, "cora.cites")
cora_content_path = os.path.join(cora.data_directory, "cora.content")
cora_dtypes = {0: int, **{i: np.float32 for i in range(1, 1433 + 1)}}


def cora_parts(include_nodes):
    if include_nodes:
        nodes = pd.read_csv(
            cora_content_path,
            header=None,
            sep="\t",
            index_col=0,
            usecols=range(0, 1433 + 1),
            dtype=cora_dtypes,
            na_filter=False,
        )
    else:
        nodes = None
    edges = pd.read_csv(
        cora_cites_path,
        header=None,
        sep="\t",
        names=["source", "target"],
        dtype=int,
        na_filter=False,
    )
    return nodes, edges, {}

BlogCatalog3¶

In [6]:

blogcatalog3 = sg.datasets.BlogCatalog3()
blogcatalog3.download()

blogcatalog3_edges = os.path.join(blogcatalog3.data_directory, "edges.csv")
blogcatalog3_group_edges = os.path.join(blogcatalog3.data_directory, "group-edges.csv")
blogcatalog3_groups = os.path.join(blogcatalog3.data_directory, "groups.csv")
blogcatalog3_nodes = os.path.join(blogcatalog3.data_directory, "nodes.csv")


def blogcatalog3_parts(include_nodes):
    if include_nodes:
        raw_nodes = pd.read_csv(blogcatalog3_nodes, header=None)[0]
        raw_groups = pd.read_csv(blogcatalog3_groups, header=None)[0]
        nodes = {
            "user": pd.DataFrame(index=raw_nodes),
            "group": pd.DataFrame(index=-raw_groups),
        }
    else:
        nodes = None

    edges = pd.read_csv(blogcatalog3_edges, header=None, names=["source", "target"])

    group_edges = pd.read_csv(
        blogcatalog3_group_edges, header=None, names=["source", "target"]
    )
    group_edges["target"] *= -1
    start = len(edges)
    group_edges.index = range(start, start + len(group_edges))

    edges = {"friend": edges, "belongs": group_edges}
    return nodes, edges, {}

FB15k¶

In [7]:

fb15k = sg.datasets.FB15k()
fb15k.download()
fb15k_files = [
    os.path.join(fb15k.data_directory, f"freebase_mtr100_mte100-{x}.txt")
    for x in ["train", "test", "valid"]
]


def fb15k_parts(include_nodes, usecols=None):
    loaded = [
        pd.read_csv(
            name,
            header=None,
            names=["source", "label", "target"],
            sep="\t",
            dtype=str,
            na_filter=False,
            usecols=usecols,
        )
        for name in fb15k_files
    ]
    edges = pd.concat(loaded, ignore_index=True)

    if include_nodes:
        # infer the set of nodes manually, in a memory-minimal way
        raw_nodes = set(edges.source)
        raw_nodes.update(edges.target)
        nodes = pd.DataFrame(index=raw_nodes)
    else:
        nodes = None

    return nodes, edges, {"edge_type_column": "label"}


def fb15k_no_edge_types_parts(include_nodes):
    nodes, edges, _ = fb15k_parts(include_nodes, usecols=["source", "target"])
    return nodes, edges, {}

Reddit¶

As discussed above, the reddit dataset is large and optional. It is also slow to parse, as the graph structure is a huge JSON file. Thus, we prepare the dataset by converting that JSON file into a NumPy edge list array, of shape (num_edges, 2). This is significantly faster to load from disk.

In [8]:

%%time

# if requested, prepare the reddit dataset by saving the slow-to-read JSON to a temporary .npy file
if reddit_path is not None:
    reddit_graph_path = os.path.join(reddit_path, "reddit-G.json")
    reddit_feats_path = os.path.join(reddit_path, "reddit-feats.npy")

    with open(reddit_graph_path) as f:
        reddit_g = json.load(f)
    reddit_numpy_edges = np.array([[x["source"], x["target"]] for x in reddit_g["links"]])
    
    reddit_edges_file = tempfile.NamedTemporaryFile(suffix=".npy")
    np.save(reddit_edges_file, reddit_numpy_edges)

CPU times: user 16.6 s, sys: 1.75 s, total: 18.4 s
Wall time: 18.4 s

In [9]:

def reddit_parts(include_nodes):
    if include_nodes:
        raw_nodes = np.load(reddit_feats_path)
        nodes = pd.DataFrame(raw_nodes)
    else:
        nodes = None

    raw_edges = np.load(reddit_edges_file.name)
    edges = pd.DataFrame(raw_edges, columns=["source", "target"])
    return nodes, edges, {}

Collected¶

In [10]:

datasets = {
    "Cora": cora_parts,
    "BlogCatalog3": blogcatalog3_parts,
    "FB15k (no edge types)": fb15k_no_edge_types_parts,
    "FB15k": fb15k_parts,
}
if reddit_path is not None:
    datasets["reddit"] = reddit_parts

Measurement¶

In [11]:

def mem_snapshot_diff(after, before):
    """Total memory difference between two tracemalloc.snapshot objects"""
    return sum(elem.size_diff for elem in after.compare_to(before, "lineno"))

In [12]:

# names of columns computed by the measurement code
def measurement_columns(title):
    names = [
        "time",
        "memory (graph)",
        "memory (graph, not shared with data)",
        "peak memory (graph)",
        "peak memory (graph, ignoring data)",
        "memory (data)",
        "peak memory (data)",
    ]
    return [(title, x) for x in names]


columns = pd.MultiIndex.from_tuples(
    [
        ("graph", "nodes"),
        ("graph", "node feat size"),
        ("graph", "edges"),
        *measurement_columns("explicit nodes"),
        *measurement_columns("inferred nodes (no features)"),
    ]
)

In [13]:

def measure_time(f, include_nodes):
    nodes, edges, args = f(include_nodes)
    start = timeit.default_timer()
    sg.StellarGraph(nodes, edges, **args)
    end = timeit.default_timer()
    return end - start

In [14]:

def measure_memory(f, include_nodes):
    """
    Measure exactly what it takes to load the data.
    
    - the size of the original edge data (as a baseline)
    - the size of the final graph
    - the peak memory use of both
    
    This uses a similar technique to the 'allocation_benchmark' fixture in tests/test_utils/alloc.py.
    """
    gc.collect()
    # ensure we're measuring the worst-case peak, when no GC happens
    gc.disable()

    tracemalloc.start()
    snapshot_start = tracemalloc.take_snapshot()

    nodes, edges, args = f(include_nodes)

    gc.collect()
    _, data_memory_peak = tracemalloc.get_traced_memory()
    snapshot_data = tracemalloc.take_snapshot()

    if include_nodes:
        assert nodes is not None, f
        sg_g = sg.StellarGraph(nodes, edges, **args)
    else:
        assert nodes is None, f
        sg_g = sg.StellarGraph(edges=edges, **args)

    gc.collect()
    snapshot_graph = tracemalloc.take_snapshot()

    # clean up the input data and anything else leftover, so that the snapshot
    # includes only the long-lasting data: the StellarGraph.
    del edges
    del nodes
    del args
    gc.collect()

    _, graph_memory_peak = tracemalloc.get_traced_memory()
    snapshot_end = tracemalloc.take_snapshot()
    tracemalloc.stop()

    gc.enable()

    data_memory = mem_snapshot_diff(snapshot_data, snapshot_start)
    graph_memory = mem_snapshot_diff(snapshot_end, snapshot_start)
    graph_over_data_memory = mem_snapshot_diff(snapshot_graph, snapshot_data)

    return (
        sg_g,
        graph_memory,
        graph_over_data_memory,
        graph_memory_peak,
        graph_memory_peak - data_memory,
        data_memory,
        data_memory_peak,
    )

In [15]:

def measure(f):
    time_nodes = measure_time(f, include_nodes=True)
    time_no_nodes = measure_time(f, include_nodes=False)

    sg_g, *mem_nodes = measure_memory(f, include_nodes=True)
    _, *mem_no_nodes = measure_memory(f, include_nodes=False)

    feat_sizes = sg_g.node_feature_sizes()
    try:
        feat_sizes = feat_sizes[sg_g.unique_node_type()]
    except ValueError:
        pass

    return [
        sg_g.number_of_nodes(),
        feat_sizes,
        sg_g.number_of_edges(),
        time_nodes,
        *mem_nodes,
        time_no_nodes,
        *mem_no_nodes,
    ]

In [16]:

%%time
recorded = [measure(f) for f in datasets.values()]

CPU times: user 24.1 s, sys: 4.75 s, total: 28.8 s
Wall time: 29 s

In [17]:

raw = pd.DataFrame(recorded, columns=columns, index=datasets.keys())
raw

Out[17]:

	graph			explicit nodes							inferred nodes (no features)
	nodes	node feat size	edges	time	memory (graph)	memory (graph, not shared with data)	peak memory (graph)	peak memory (graph, ignoring data)	memory (data)	peak memory (data)	time	memory (graph)	memory (graph, not shared with data)	peak memory (graph)	peak memory (graph, ignoring data)	memory (data)	peak memory (data)
Cora	2708	1433	5429	0.027746	15607738	15587033	46764081	31079432	15684649	31995281	0.003061	94074	96441	374259	284126	90133	197529
BlogCatalog3	10351	{'user': 0, 'group': 0}	348459	0.037114	6069823	8864526	21775036	16105955	5669081	10805733	0.029837	6068735	8861638	21689836	16106851	5582985	10711633
FB15k (no edge types)	14951	0	592213	0.117117	6398526	5412944	33209916	17529813	15680103	25831842	0.187143	6398542	5535640	34645128	19090289	15554839	25050795
FB15k	14951	0	592213	0.653126	12071938	15677213	60081827	39179000	20902827	35792614	0.730613	12071938	15799909	60082291	39304744	20777547	35011567
reddit	232965	602	11606919	4.354689	712112941	712119784	3551635605	2243949792	1307685813	1307694920	0.688833	153908015	153909618	622398852	436684187	185714665	185723196

Pretty results¶

This shows the results in a prettier way, such as memory in MB instead of bytes.

In [18]:

mem_columns = raw.columns[["memory" in x[1] for x in raw.columns]]

memory_mb = raw.copy()
memory_mb[mem_columns] = (memory_mb[mem_columns] / 10 ** 6).round(3)
memory_mb

Out[18]:

	graph			explicit nodes							inferred nodes (no features)
	nodes	node feat size	edges	time	memory (graph)	memory (graph, not shared with data)	peak memory (graph)	peak memory (graph, ignoring data)	memory (data)	peak memory (data)	time	memory (graph)	memory (graph, not shared with data)	peak memory (graph)	peak memory (graph, ignoring data)	memory (data)	peak memory (data)
Cora	2708	1433	5429	0.027746	15.608	15.587	46.764	31.079	15.685	31.995	0.003061	0.094	0.096	0.374	0.284	0.090	0.198
BlogCatalog3	10351	{'user': 0, 'group': 0}	348459	0.037114	6.070	8.865	21.775	16.106	5.669	10.806	0.029837	6.069	8.862	21.690	16.107	5.583	10.712
FB15k (no edge types)	14951	0	592213	0.117117	6.399	5.413	33.210	17.530	15.680	25.832	0.187143	6.399	5.536	34.645	19.090	15.555	25.051
FB15k	14951	0	592213	0.653126	12.072	15.677	60.082	39.179	20.903	35.793	0.730613	12.072	15.800	60.082	39.305	20.778	35.012
reddit	232965	602	11606919	4.354689	712.113	712.120	3551.636	2243.950	1307.686	1307.695	0.688833	153.908	153.910	622.399	436.684	185.715	185.723

Run the latest release of this notebook: