Loading Graphs

In addition to the NetworkX compatible APIs, GraphScope proposed a set of APIs in Python to meet the needs for loading/analysing/quering very large graphs.

GraphScope models graph data as property graphs, in which the edges/vertices are labeled and have many properties. In this tutorial, we show how GraphScope load graphs, including

  • How to define the schema of a property graph;
  • Simplified forms to load a graph;
  • Loading graph from various locations;
  • Serializing/Deserializing a graph to/from disk.

Defining the Schema

First, we launch a session and import necessary packages.

In [ ]:
import os
import graphscope
from graphscope.framework.graph import Graph
from graphscope.framework.loader import Loader
import vineyard

k8s_volumes = {
    "data": {
        "type": "hostPath",
        "field": {
          "path": "/testingdata",  # Path in host
          "type": "Directory"
        },
        "mounts": {
          "mountPath": "/home/jovyan/datasets",  # Path in pods
          "readOnly": True
        }
    }
}

graphscope.set_option(show_log=True)  # enable logging
graphscope.set_option(initializing_interactive_engine=False)
sess = graphscope.session(k8s_volumes=k8s_volumes, k8s_etcd_mem='512Mi')  # create a session

We use the class Graph to load a graph. In building process Graph will act like a builder, let user build graph iteratively, that is, user can add a couple of vertices, then add a couple of edges, and so on.

These are the methods we will use to build a graph.

def add_vertices(self, vertices, label="_", properties=None, vid_field=0):
        pass

    def add_edges(self, edges, label="_", properties=None, src_label=None, dst_label=None, src_field=0, dst_field=1):
        pass

Next, we will walk through those methods and demonstrate the usage.

Create a graph instance

We use an method g() defined in Session to create a graph.

In [ ]:
graph = sess.g()

Add vertices to the graph

We first add a kind of vertices to graph.

The parameters contain:

  • A Loader for data source, which can be a file location, or a numpy, etc.
  • The label name of the vertex.
  • A list of properties, the names should consistent to the header_row of the data source file or pandas. This list is optional. When use default value, all columns except the vertex_id column will be added as properties.
  • The column used as vertex_id. The value in this column of the data source will be used for src/dst when loading edges.

Let's see an example.

In [ ]:
graph = sess.g()
graph = graph.add_vertices(
    # source file for vertices labeled as person;    
    Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    "person",
    # columns loaded as property
    ["firstName", "lastName"],
    # The column used as vertex ID
    "id"
)

Here the Loader is a object wraps how to load a data, including its location(e.g, HDFS, local fs, AmazonS3 or Aliyun OSS), column delimiter and some other metadata. In this case, the Loader assigned a file location in the mounted volume.

We can also omit certain configurations for vertices.

  • If the Loader contains only a url, we can omit the class, just put the url

  • properties can be empty list, which means that all columns are selected as properties, which is the default value;

  • vid_field can be represented by a number of index, which is the default value.

In [ ]:
graph = sess.g()
graph = graph.add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"), "person")
  • The label can be omitted if there will only be one vertex label.

In the simplest case, the configuration can only contains a loader. In this case, the first column is used as vid, and the rest columns are used as properties, and the label name is '_'

In [ ]:
graph = sess.g()
graph = graph.add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"))

Add edges to the graph

Then we add a kind of edge to the graph.

The parameter contains:

  • a Loader for data source, it tells graphscope where to find the data for this label, it can be a file location, or a numpy, etc.
  • The label name of the edge.
  • a list of properties, the names should consistent to the header_row of the data source file or pandas. This list is optional. When it omitted or empty, all columns except the src/dst columns will be added as properties.
  • The label name of the source vertex.
  • The label name of the destination vertex.
  • The column use for source vertex id.
  • The column used for destination vertex id.
In [ ]:
# a kind of edge with label "knows"
graph = sess.g()
graph = graph.add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"), label="person")
graph = graph.add_edges(
    # the data source, in this case, is a file location.
    Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
    # Label name
    label="knows",
    # selected column names that would be load as properties
    properties=["creationDate"],
    # Label name of the source vertex
    src_label="person",
    # Label name of the destination vertex
    dst_label="person",
    # Column name, which is the source vertex ID
    src_field="Person.id",
    # Column name, which is the destination vertex ID
    dst_field="Person.id.1"    
)

Some fields can omit for edges.

  • If the Loader contains only a url, we can omit the class, just put the url. i.e. use default value for delimeter and header_row.
  • properties can be empty, which means to select all columns
In [ ]:
graph = sess.g()
graph = graph.add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"), label="person")
graph = graph.add_edges(
    Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
    "knows",
    src_label="person",
    dst_label="person",
    src_field="Person.id",
    dst_field="Person.id.1"
)

Alternatively, all column names can be assigned with index. For example, the number in the src/dst assigned the first column is used as Person.id and the second column is used as Person.id.1. And the default value of src_field and dst_field is just 0 and 1, so we can just use the default value.

In [ ]:
graph = sess.g()
graph = graph.add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"), label="person")
graph = graph.add_edges(
    Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
    "knows",
    src_label="person",
    dst_label="person"
)

Also, edge source and edge destination can be omitted if the graph has only one vertex label, which means all edges relations will contain and only can contain this specific vertex label. Thus it's unambiougous to omit the source and destination specification.

In [ ]:
graph = sess.g()
graph = graph.add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"), label="person")
graph = graph.add_edges(
    Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
    "knows"
)

Moreover, the vertices can be totally omitted. GraphScope will extract vertices ids from edges, and a default label _ will assigned to all vertices in this case.

In the simplest case, the configuration can only assign a loader with path. By default, the first column will be used as Person.id, the second column will be used as Person.id.1. all the rest columns in the file are parsed as properties.

In [ ]:
graph = sess.g()
graph = graph.add_edges(
    Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
    "knows"
)

In some cases, an edge label may connect several kinds of vertices. For example, in ldbc graph, two kinds of edges are labeled with likes but represents two relations. i.e., in a forum, people can give a like to both posts and comments. These relation can be abstracted as person likes post, and person likes comment. In this case, a likes key follows a list of configurations.

In [ ]:
graph = sess.g()
graph = graph.add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"), label="person")
graph = graph.add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/post_0_0.csv", delimiter="|"), label="post")
graph = graph.add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"), label="comment")
graph = graph.add_edges(
    Loader("/home/jovyan/datasets/ldbc_sample/person_likes_comment_0_0.csv", delimiter="|"),
    "comment",
    ["creationDate"],
    src_label="person",
    dst_label="comment"
)
graph = graph.add_edges(
    Loader("/home/jovyan/datasets/ldbc_sample/person_likes_post_0_0.csv", delimiter="|"),
    "likes",
    ["creationDate"],
    src_label="person",
    dst_label="post"
)

Now we have a graph loaded in the graphscope, with two kind of vertice labeled with person and post and one kind of edges labeled with knows, with two relations. Let's check the graph schema.

In [ ]:
print(graph.schema)

The graph also have some meta parameters, listed as follows:

  • oid_type, can be int64_t or string. Default to int64_t cause it's more faster and costs less memory.
  • directed, bool, default to True. Controls load an directed or undirected Graph.
  • generate_eid, bool, default to True. Whether to automatically generate an unique id for all edges.

Let's try to use the skills to load a graph with more complexity.

In [ ]:
graph = (
    sess.g()
    .add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"), "person")
    .add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"), "comment")
    .add_vertices(Loader("/home/jovyan/datasets/ldbc_sample/post_0_0.csv", delimiter="|"), "post")
    .add_edges(Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
              "knows", src_label="person", dst_label="person")
    .add_edges(Loader("/home/jovyan/datasets/ldbc_sample/person_likes_comment_0_0.csv", delimiter="|"),
              "likes", src_label="person", dst_label="comment")
    .add_edges(Loader("/home/jovyan/datasets/ldbc_sample/person_likes_post_0_0.csv", delimiter="|"),
              "likes", src_label="person", dst_label="post")

)

print(graph.schema)

Serialization and Deserialization

When the graph is huge, it takes large amount of time(e.g., maybe hours) for the graph loadding. GraphScope provides serialization and deserialization for graph data, which dumps and load the constructed graphs in the form of binary data to(from) disk. This functions save much time, and make our lives easier.

Serialization

graph.save_to takes a path argument, indicating the location to store the binary data.

In [ ]:
graph.save_to('/tmp/seri')

Deserialization

graph.load_from is a classmethod, its signature looks like graph.save_to. However, its path argument should be exactly the same to the path passed in graph.save_to, as it relys on naming to find the binary files. Please note that during serialization, the workers dump its own data to files with its index as suffix. Thus the number of workers for deserialization should be exactly the same to that for serialization.

In addition, graph.load_from needs an extra sess parameter, specifying which session the graph would be deserialized in.

In [ ]:
deserialized_graph = Graph.load_from('/tmp/seri', sess)
In [ ]:
print(deserialized_graph.schema)

Loading From Various Locations

A Loader defines how to load data, including its location, metadata, and other configurations. Graphscope supports specifying the location in a str, which follows the standard of URI. Upon receiving a request from a Loader, GraphScope parse the URI string and invoke corresponding drivers in vineyard according to the parsed schema. Currently, the location supports local file system, Amazon S3, Aliyun OSS, HDFS and URL on the web.

In addition, pandas dataframes or numpy ndarrays in specified format are also supported.

The data loading is managed by vineyard. vineyard takes advantage of fsspec to resolve schemes and formats. Any additional configurations can be passed in kwargs to Loader, for example, the host and port to HDFS, or access-id, secret-access-key to AliyunOSS or Amazon S3.

Graphs from Location

When a loader wraps a location, it may only contains a str. The string follows the standard of URI.

In [ ]:
ds1 = Loader("file:///var/datafiles/edgefile.e")

To load data from S3, users need to provide the key and the secret. Besides, additional arguments can be passed by client_kwargs, e.g., region_name of bucket.

In [ ]:
ds2 = Loader("s3://bucket/datafiles/edgefile.e", key='access-id', secret='secret-access-key', client_kwargs={'region_name': 'us-east-1'})

To load data from Aliyun OSS, users need to provide key, secret, and endpoint of the bucket.

In [ ]:
ds3 = Loader("oss://bucket/datafiles/edgefile.e", key='access-id', secret='secret-access-key', endpoint='oss-cn-hangzhou.aliyuncs.com')

To load data from HDFS, user need to provide host and port, extra configurations can be specified by extra_conf.

In [ ]:
ds4 = Loader("hdfs:///var/datafiles/edgefile.e", host='localhost', port='9000', extra_conf={'conf1': 'value1'})

Let's see how to load a graph from Amazon S3 as an real example.

In [ ]:
graph = sess.g()
graph = graph.add_vertices(
    Loader("s3://datasets/ldbc_sample/person_0_0.csv", delimiter="|", key='testing', secret='testing', client_kwargs={
                    "endpoint_url": "http://192.168.0.222:5000"
                }),
    "person"
)
graph = graph.add_edges(
    Loader("s3://datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|", key='testing', secret='testing', client_kwargs={
                    "endpoint_url": "http://192.168.0.222:5000"
                }),
    "knows"
)

print(graph)
print(graph.schema)

Load Graphs from Numpy and Pandas

For pandas, the dataframe's format is like in csv files. Note we currently only supports integer or double data types.

In [ ]:
import numpy as np
import pandas as pd
In [ ]:
leader_id = np.array([0, 0, 0, 1, 1, 3, 3, 6, 6, 6, 7, 7, 8])
member_id = np.array([2, 3, 4, 5, 6, 6, 8, 0, 2, 8, 8, 9, 9])
group_size = np.array([4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2])
e_data = np.transpose(np.vstack([leader_id, member_id, group_size]))
df_group = pd.DataFrame(e_data, columns=['leader_id', 'member_id', 'group_size'])
In [ ]:
student_id = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
avg_score = np.array([490.33, 164.5 , 190.25, 762. , 434.2, 513. , 569. ,  25. , 308. ,  87. ])
v_data = np.transpose(np.vstack([student_id, avg_score]))
df_student = pd.DataFrame(v_data, columns=['student_id', 'avg_score']).astype({'student_id': np.int64})
In [ ]:
# use a dataframe as datasource, properties omitted, col_0/col_1 will be used as src/dst by default.
# (for vertices, col_0 will be used as vertex_id by default)
graph = sess.g().add_vertices(df_student).add_edges(df_group)

For numpy, load from ndarrays require the data are organized in COO format.

In [ ]:
array_group = [df_group[col].values for col in ['leader_id', 'member_id', 'group_size']]
array_student = [df_student[col].values for col in ['student_id', 'avg_score']]

graph = sess.g().add_vertices(array_student).add_edges(array_group)

Finally, close the session to release all resources.

In [ ]:
sess.close()