Loading Graphs

GraphScope models graph data as property graphs, in which the edges/vertices are labeled and have many properties. In this tutorial, we show how GraphScope load graphs, including

  • How to define the schema of a property graph;
  • Simplified forms to load a graph;
  • Loading graph from various locations;
  • Serializing/Deserializing a graph to/from disk.

Defining the Schema

First, we launch a session and import necessary packages.

In [ ]:
import os
import graphscope
from graphscope.framework.graph import Graph
from graphscope.framework.loader import Loader
import vineyard

k8s_volumes = {
    "data": {
        "type": "hostPath",
        "field": {
          "path": "/testingdata",  # Path in host
          "type": "Directory"
        },
        "mounts": {
          "mountPath": "/home/jovyan/datasets",  # Path in pods
          "readOnly": True
        }
    }
}

graphscope.set_option(show_log=True)  # enable logging
sess = graphscope.session(k8s_volumes=k8s_volumes, k8s_etcd_mem='512Mi')  # create a session

We use the function load_from to load a graph. In this function, it will

  1. resolve the configurations of vertices and edges
  2. validate the configurations
  3. load data into memory and construct a graphscope.Graph object for subsequent usage.

The basic form of load_from looks like this:

load_from(edges, vertices=None, directed=True, oid_type="int64_t", generate_eid=True)

Next, we give introductions to the parameters.

edges

Required.

edges is a Dict. Each item in the dict determines a label for the edges. More specifically, the key of the pair item is the label name, the value of the pair is a configuration Tuple or List, which contains:

  • a Loader object for data source, it tells graphscope where to find the data for this label, it can be a file location, or a numpy array, etc.

  • a list of properties, the names should consistent to the header_row of the data source file. This list is optional. When it omitted or empty, all columns except the src/dst columns will be added as properties.

  • a pair of str for the edge source, in the format of (column_name_for_src, label_of_src);

  • a pair of str for the edge destination, in the format of (column_name_for_dst, label_of_dst);

Let's see an example:

In [ ]:
edges={
    # a kind of edge with label "knows"
    "knows": (
        # the data source, in this case, is a file location.
        Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
        # selected column names that would be load as properties
        ["creationDate"],
        # use 'Person.id' column as source id, the src label should be 'person'
        ("Person.id", "person"),
        # use 'Person.id.1' column as destination id, the dst label is 'person'
        ("Person.id.1", "person")
    )
}

There is a person field used as vertex label name, we will defer its explanation to the next subsection.

Here the Loader is a object wraps how to load a data, including its location(e.g, HDFS, local fs, AmazonS3 or Aliyun OSS), column delimiter and some other metadata. In this case, the Loader assigned a file location in the mounted volume.

vertices

Optional, Default to None. It can be None only when there is only one the vertex label in the graph and any vertex properties is not required. In this case, the vertex ID is deduced from the both ends of edges.

Similar to edges, a vertex Dict contains a key as the label, and a set of configuration for the label. The configurations contain:

  • a loader for data source, which can be a file location, or a numpy, etc. See more details in Loader object.

  • a list of properties to load, the names should consistent to the header_row of the data source file. This list is optional. When it omitted, all columns except the vertex_id column will be added as properties.

  • the column used as vertex_id. The value in this column of the data source will be used for src/dst when loading edges.

Here is an example for vertices:

In [ ]:
vertices={
    "person": (
        # source file for vertices labeled as person;
        Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
        # columns loaded as property
        ["firstName", "lastName"],
        # the column used as vertex_id
        "id"
    )
}

directed

Optional, default to True.

The parameter Directed indicates whether to load the graph as an undirected or directed graph. Default is set to True.

In [ ]:
directed = True

oid_type

Optional, default to int64_t.

The parameter oid_type indicates the data type of the original IDs in the graph. It can be string or int64_t. We recommend to use int64_t if possible as it could save much memory compared to string, and it also lead to a performance boost.

In [ ]:
oid_type = 'int64_t'

generate_eid

Optional, default to True.

In some cases, like the Graph Interactive Engine requires every edge have an eid. Set generate_eid to True will generate eids for edges. In short, If you want to use interactive engine, then set this field to True, else set to False. Default is False

In this tutorial we just set it to False.

In [ ]:
generate_eid = False

Next, we compose them together to define load_graph.

In [ ]:
graph = sess.load_from(edges, vertices, directed, oid_type, generate_eid)

Now we have a graph loaded in the graphscope, with one kind of vertice labeled with person and one kind of edges labeled with knows. Let's check the graph schema.

In [ ]:
print(graph.schema)

Serialization and Deserialization

When the graph is huge, it takes large amount of time(e.g., maybe hours) for the graph loadding. GraphScope provides serialization and deserialization for graph data, which dumps and load the constructed graphs in the form of binary data to(from) disk. This functions save much time, and make our lives easier.

Serialization

graph.serialize takes a path argument, indicating the location to store the binary data.

In [ ]:
graph.serialize('/tmp/seri')

Deserialization

graph.deserialize is a classmethod, its signature looks like graph.serialize. However, its path argument should be exactly the same to the path passed in graph.serialize, as it relys on naming to find the binary files. Please note that during serialization, the workers dump its own data to files with its index as suffix. Thus the number of workers for deserialization should be exactly the same to that for serialization.

In addition, graph.deserialize needs an extra sess parameter, specifying which session the graph would be deserialized in.

In [ ]:
deserialized_graph = Graph.deserialize('/tmp/seri', sess)
In [ ]:
print(deserialized_graph.schema)

Various Forms to Define a Graph

Revisit the definition of edges in the previous section, it uses a tuple to specify many configurations.

Alternatively, they can be define as a Dict, The reserved keys of the Dict are loader, properties, source and destination. This configuration for edges are exactly the same to the above configuration.

In [ ]:
edges={
    "knows": (
        Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
        ["creationDate"],
        ("Person.id", "person"),
        ("Person.id.1", "person")
    )
}
In [ ]:
edges = {
    "knows": {
            "loader": Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
            "properties": ["creationDate"],
            "source": ("Person.id", "person"),
            "destination": ("Person.id.1", "person"),
        },
    }

In some cases, an edge label may connect several kinds of vertices. For example, in ldbc graph, two kinds of edges are labeled with likes but represents two relations. i.e., in a forum, people can give a like to both posts and comments. These relation can be abstracted as person likes post, and person likes comment. In this case, a likes key follows a list of configurations.

In [ ]:
edges={
    # a kind of edge with label "likes"
    "likes": [
        (
            Loader("/home/jovyan/datasets/ldbc_sample/person_likes_comment_0_0.csv", delimiter="|"),
            ["creationDate"],
            ("Person.id", "person"),
            ("Comment.id", "comment")
        ),
        (
            Loader("/home/jovyan/datasets/ldbc_sample/person_likes_post_0_0.csv", delimiter="|"),
            ["creationDate"],
            ("Person.id", "person"),
            ("Post.id", "post")
        )
    ]
}

Some fields can omit for edges.

  • If the Loader contains only a url, we can omit the class, just put the url. i.e. use default value for delimeter and header_row.
  • properties can be empty, which means to select all columns
In [ ]:
edges={
    "knows": (
        "/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv",
        [],
        ("Person.id", "person"),
        ("Person.id.1", "person")
    )
}

Alternatively, all column names can be assigned with index. For example, the number in the src/dst assigned the first column is used as Person.id and the second column is used as Person.id.1:

In [ ]:
edges={
    "knows": (
        Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
        ["creationDate"],
        # 0 represents the first column.
        (0, "person"),
        # second column used as dst.
        (1, "person"),
    )
}

Also, edge source and edge destination can be omitted if the graph has only one vertex label, which means all edges relations will contain and only can contain this specific vertex label. Thus it's unambiougous to omit the source and destination specification.

In [ ]:
edges={
    "group": (
        Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
        ["creationDate"]
    )
}

In the simplest case, the configuration can only assign a loader with path. By default, the first column will be used as Person.id, the second column will be used as Person.id.1. all the rest columns in the file are parsed as properties.

In [ ]:
edges={
    "knows": Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|")
}

Like the edges, the configuration for vertices can also be a Dict, in which the keys are “loader”, “properties” and “vid”

In [ ]:
vertices={
    "person": {
        "loader": Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
        "properties": ["firstName", "lastName"],
        "vid": "id",
    },
}

We can also omit certain configurations for vertices.

  • If the Loader contains only a url, we can omit the class, just put the url

  • properties can be empty, which means that all columns are selected as properties;

  • vid can be represented by a number of index

In the simplest case, the configuration can only contains a loader. In this case, the first column is used as vid, and the rest columns are used as properties.

In [ ]:
vertices={
    "person": Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|")
}

Moreover, the vertices can be totally omitted. GraphScope will extract vertices ids from edges, and a default label _ will assigned to all vertices in this case.

In [ ]:
g = sess.load_from(
    edges={
        "knows": Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|")
        }
    )

Let's try to use the skills to load a graph with more complexity.

In [ ]:
g = sess.load_from(
    edges={
        "knows": (
            Loader("/home/jovyan/datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|"),
            ["creationDate"],
            ("Person.id", "person"),
            ("Person.id.1", "person")
        ),
        "likes": [
            (
                Loader("/home/jovyan/datasets/ldbc_sample/person_likes_comment_0_0.csv", delimiter="|"),
                ["creationDate"],
                ("Person.id", "person"),
                ("Comment.id", "comment")
            ),
            (
                Loader("/home/jovyan/datasets/ldbc_sample/person_likes_post_0_0.csv", delimiter="|"),
                ["creationDate"],
                ("Person.id", "person"),
                ("Post.id", "post")
            )
        ]
    },
    vertices={
        "person": (
            Loader("/home/jovyan/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
            ["firstName", "lastName"],
            "id",
        ),
        "comment": (
            Loader("/home/jovyan/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
            ["creationDate"],
            "id",
        ),
        "post": (
            Loader("/home/jovyan/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
            ["creationDate"],
            "id",
        )
    },
)

Loading From Various Locations

A Loader defines how to load data, including its location, metadata, and other configurations. Graphscope supports specifying the location in a str, which follows the standard of URI. Upon receiving a request from a Loader, GraphScope parse the URI string and invoke corresponding drivers in vineyard according to the parsed schema. Currently, the location supports local file system, Amazon S3, Aliyun OSS, HDFS and URL on the web.

In addition, pandas dataframes or numpy ndarrays in specified format are also supported.

The data loading is managed by vineyard. vineyard takes advantage of fsspec to resolve schemes and formats. Any additional configurations can be passed in kwargs to Loader, for example, the host and port to HDFS, or access-id, secret-access-key to AliyunOSS or Amazon S3.

Graphs from Location

When a loader wraps a location, it may only contains a str. The string follows the standard of URI.

In [ ]:
ds1 = Loader("file:///var/datafiles/edgefile.e")

To load data from S3, users need to provide the key and the secret. Besides, additional arguments can be passed by client_kwargs, e.g., region_name of bucket.

In [ ]:
ds2 = Loader("s3://bucket/datafiles/edgefile.e", key='access-id', secret='secret-access-key', client_kwargs={'region_name': 'us-east-1'})

To load data from Aliyun OSS, users need to provide key, secret, and endpoint of the bucket.

In [ ]:
ds3 = Loader("oss://bucket/datafiles/edgefile.e", key='access-id', secret='secret-access-key', endpoint='oss-cn-hangzhou.aliyuncs.com')

To load data from HDFS, user need to provide host and port, extra configurations can be specified by extra_conf.

In [ ]:
ds4 = Loader("hdfs:///var/datafiles/edgefile.e", host='localhost', port='9000', extra_conf={'conf1': 'value1'})

Let's see how to load a graph from Amazon S3 as an real example.

In [ ]:
graph = sess.load_from(
    edges={
        "knows": (
                Loader("s3://datasets/ldbc_sample/person_knows_person_0_0.csv", delimiter="|", key='testing', secret='testing', client_kwargs={
                    "endpoint_url": "http://192.168.0.222:5000"
                }),
            )
    },
    vertices={
        "person": (
            Loader("s3://datasets/ldbc_sample/person_0_0.csv", delimiter="|", key='testing', secret='testing', client_kwargs={
                    "endpoint_url": "http://192.168.0.222:5000"
                }),
        ),
    },
)
print(graph.schema)

Load Graphs from Numpy and Pandas

For pandas, the dataframe's format is like in csv files. Note we currently only supports integer or double data types.

In [ ]:
import numpy as np
import pandas as pd
In [ ]:
leader_id = np.array([0, 0, 0, 1, 1, 3, 3, 6, 6, 6, 7, 7, 8])
member_id = np.array([2, 3, 4, 5, 6, 6, 8, 0, 2, 8, 8, 9, 9])
group_size = np.array([4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2])
e_data = np.transpose(np.vstack([leader_id, member_id, group_size]))
df_group = pd.DataFrame(e_data, columns=['leader_id', 'member_id', 'group_size'])
In [ ]:
student_id = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
avg_score = np.array([490.33, 164.5 , 190.25, 762. , 434.2, 513. , 569. ,  25. , 308. ,  87. ])
v_data = np.transpose(np.vstack([student_id, avg_score]))
df_student = pd.DataFrame(v_data, columns=['student_id', 'avg_score']).astype({'student_id': np.int64})
In [ ]:
# use a dataframe as datasource, properties omitted, col_0/col_1 will be used as src/dst by default.
# (for vertices, col_0 will be used as vertex_id by default)
g = sess.load_from(edges=df_group, vertices=df_student)

For numpy, load from ndarrays require the data are organized in COO format.

In [ ]:
array_group = [df_group[col].values for col in ['leader_id', 'member_id', 'group_size']]
array_student = [df_student[col].values for col in ['student_id', 'avg_score']]

g = sess.load_from(edges=array_group, vertices=array_student)

Finally, close the session to release all resources.

In [ ]:
sess.close()