Notebook

Building The Star Wars Graph¶

Thousands of Star Wars fans have contributed to Star Wars wikis like the Wookieepedia Wiki - the open Star Wars encyclopedia that anyone can edit. This wiki collects information about all Star Wars films, including characters, planets, starships and much more. A subset of the information from Wookieepedia is available via REST API at SWAPI.co. SWAPI provides an API for fetching information about which characters appear in which films, the characters' home planets, and what starships they’ve piloted. This information is inherently a graph! So thanks to Wookieepedia and SWAPI.co we’re able to build the Star Wars Graph!

In order to build this graph we'll need to fetch data from SWAPI.co, an open REST API for Star Wars data. We'll use the Python requests package to make requests from the API (which will return JSON), we'll then use py2neo to execute Cypher queries against a Neo4j database using the JSON returned from the API as parameters for our queries. Because this is a REST API, many of the resources are returned as only URLs which we'll need to fetch later to fully populate the data in our graph. We'll use Neo4j as a type of queuing mechanism, creating relationships and nodes using the URL as a placeholder for a resource that needs to be fully hydrated later.

This notebook assumes some basic knowledge of Python and some working knowledge of Neo4j. For a more general overview of Neo4j and graph databases, check out some of the other posts on my blog.

In [1]:

# import dependency packages
from py2neo import Graph    # install with `pip install py2neo`
import requests             # `pip install requests`

In [2]:

# Exploring the API
# what endpoints are available?
r = requests.get("http://swapi.co/api/")
r.json()

Out[2]:

{'people': 'https://swapi.co/api/people/',
 'planets': 'https://swapi.co/api/planets/',
 'films': 'https://swapi.co/api/films/',
 'species': 'https://swapi.co/api/species/',
 'vehicles': 'https://swapi.co/api/vehicles/',
 'starships': 'https://swapi.co/api/starships/'}

The datamodel¶

Based on the entities and the data we have available, this is the property graph data model that we'll be using:

Connect to Neo4j and add constraints¶

We'll be using the py2neo Python driver for Neo4j. We'll first connect to a running Neo4j instance and add constraints based on the data model we've defined above

In [3]:

# Connect to Neo4j instance
graph = Graph("bolt://localhost:7687", auth=("neo4j", "<NEO4J_PASSWORD>"))

# create uniqueness constraints based on the datamodel
graph.run("CREATE CONSTRAINT ON (f:Film) ASSERT f.url IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (p:Person) ASSERT p.url IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (v:Vehicle) ASSERT v.url IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (s:Starship) ASSERT s.url IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (p:Planet) ASSERT p.url IS UNIQUE")

Out[3]:

<py2neo.database.Cursor at 0x7f04aaf90c88>

Inserting a single resource¶

Let's see how we can insert a single resource result from the SWAPI, a single Person object. First we'll look at the data returned for a Person entity, then we'll write a Cypher query that can take the JSON document returned by the API as a parameter object to insert it into our graph.

In [4]:

# Fetch a single person entity from the API
r = requests.get("http://swapi.co/api/people/1/")
params = r.json()
params

Out[4]:

{'name': 'Luke Skywalker',
 'height': '172',
 'mass': '77',
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue',
 'birth_year': '19BBY',
 'gender': 'male',
 'homeworld': 'https://swapi.co/api/planets/1/',
 'films': ['https://swapi.co/api/films/2/',
  'https://swapi.co/api/films/6/',
  'https://swapi.co/api/films/3/',
  'https://swapi.co/api/films/1/',
  'https://swapi.co/api/films/7/'],
 'species': ['https://swapi.co/api/species/1/'],
 'vehicles': ['https://swapi.co/api/vehicles/14/',
  'https://swapi.co/api/vehicles/30/'],
 'starships': ['https://swapi.co/api/starships/12/',
  'https://swapi.co/api/starships/22/'],
 'created': '2014-12-09T13:50:51.644000Z',
 'edited': '2014-12-20T21:17:56.891000Z',
 'url': 'https://swapi.co/api/people/1/'}

In [5]:

# Define a parameterized Cypher query to insert a Person entity into the graph
# For resources referenced in the Person entity (like homeworld and starships) we create a relationship and node
# containing only the url. This new node acts as a placeholder that we'll need to fill in later

CREATE_PERSON_QUERY = '''
MERGE (p:Person {url: {url}})
SET p.birth_year = {birth_year},
    p.created = {created},
    p.edited = {edited},
    p.eye_color = {eye_color},
    p.gender = {gender},
    p.hair_color = {hair_color},
    p.height = {height},
    p.mass = {mass},
    p.name = {name},
    p.skin_color = {skin_color}
REMOVE p:Placeholder
WITH p
MERGE (home:Planet {url: {homeworld}})
ON CREATE SET home:Placeholder
CREATE UNIQUE (home)<-[:IS_FROM]-(p)
WITH p
UNWIND {species} AS specie
MERGE (s:Species {url: specie})
ON CREATE SET s:Placeholder
CREATE UNIQUE (p)-[:IS_SPECIES]->(s)
WITH DISTINCT p
UNWIND {starships} AS starship
MERGE (s:Starship {url: starship})
ON CREATE SET s:Placeholder
CREATE UNIQUE (p)-[:PILOTS]->(s)
WITH DISTINCT p
UNWIND {vehicles} AS vehicle
MERGE (v:Vehicle {url: vehicle})
ON CREATE SET v:Placeholder
CREATE UNIQUE (p)-[:PILOTS]->(v)
'''

In [6]:

# we can execute this query using py2neo
graph.run(CREATE_PERSON_QUERY, params)

Out[6]:

<py2neo.database.Cursor at 0x7f04c0766cf8>

Executing this query creates a Person node for Luke Skywalker and sets properties on this node for properties returned from the API (birth_year, name, height, etc). For other entities (like home planet, species, and the starships he's piloted) the API returns only the url for these entities - we must make an additional API request to hydrate these later. Using the url for these resources as a unique id, we create nodes and relationships to these nodes.

We'll later query the graph for these incomplete nodes so that we can hydrate these entities with a request to the SWAPI. Note that when we query the API to hydrate these entities we may end up adding more nodes and relationships (in addition to just adding properties) - this allows us to build our graph by "crawling" the API, using Neo4j as a queue to store the resources to be crawled asynchronously.

Define Cypher queries¶

We will now look at the JSON format for each type of entity and define the Cypher query to handle updating the graph for that type of entity.

Film¶

The films endpoint returns information about a single film as well as arrays of characters, planets, species, starships, and vehicles that appear in the film. Note that these arrays contain only a url for the entity. We will use the Cypher UNWIND statement to iterate through the elements of these arrays, inserting a Placeholder node which we will fill-in later with an additional call to SWAPI.

In [7]:

# Fetch a single Film entity from the API
r = requests.get("http://swapi.co/api/films/1/")
params = r.json()
params

Out[7]:

{'title': 'A New Hope',
 'episode_id': 4,
 'opening_crawl': "It is a period of civil war.\r\nRebel spaceships, striking\r\nfrom a hidden base, have won\r\ntheir first victory against\r\nthe evil Galactic Empire.\r\n\r\nDuring the battle, Rebel\r\nspies managed to steal secret\r\nplans to the Empire's\r\nultimate weapon, the DEATH\r\nSTAR, an armored space\r\nstation with enough power\r\nto destroy an entire planet.\r\n\r\nPursued by the Empire's\r\nsinister agents, Princess\r\nLeia races home aboard her\r\nstarship, custodian of the\r\nstolen plans that can save her\r\npeople and restore\r\nfreedom to the galaxy....",
 'director': 'George Lucas',
 'producer': 'Gary Kurtz, Rick McCallum',
 'release_date': '1977-05-25',
 'characters': ['https://swapi.co/api/people/1/',
  'https://swapi.co/api/people/2/',
  'https://swapi.co/api/people/3/',
  'https://swapi.co/api/people/4/',
  'https://swapi.co/api/people/5/',
  'https://swapi.co/api/people/6/',
  'https://swapi.co/api/people/7/',
  'https://swapi.co/api/people/8/',
  'https://swapi.co/api/people/9/',
  'https://swapi.co/api/people/10/',
  'https://swapi.co/api/people/12/',
  'https://swapi.co/api/people/13/',
  'https://swapi.co/api/people/14/',
  'https://swapi.co/api/people/15/',
  'https://swapi.co/api/people/16/',
  'https://swapi.co/api/people/18/',
  'https://swapi.co/api/people/19/',
  'https://swapi.co/api/people/81/'],
 'planets': ['https://swapi.co/api/planets/2/',
  'https://swapi.co/api/planets/3/',
  'https://swapi.co/api/planets/1/'],
 'starships': ['https://swapi.co/api/starships/2/',
  'https://swapi.co/api/starships/3/',
  'https://swapi.co/api/starships/5/',
  'https://swapi.co/api/starships/9/',
  'https://swapi.co/api/starships/10/',
  'https://swapi.co/api/starships/11/',
  'https://swapi.co/api/starships/12/',
  'https://swapi.co/api/starships/13/'],
 'vehicles': ['https://swapi.co/api/vehicles/4/',
  'https://swapi.co/api/vehicles/6/',
  'https://swapi.co/api/vehicles/7/',
  'https://swapi.co/api/vehicles/8/'],
 'species': ['https://swapi.co/api/species/5/',
  'https://swapi.co/api/species/3/',
  'https://swapi.co/api/species/2/',
  'https://swapi.co/api/species/1/',
  'https://swapi.co/api/species/4/'],
 'created': '2014-12-10T14:23:31.880000Z',
 'edited': '2015-04-11T09:46:52.774897Z',
 'url': 'https://swapi.co/api/films/1/'}

In [8]:

# Insert a given Film into the graph, including Placeholder nodes for any new entities discovered
CREATE_MOVIE_QUERY = '''
MERGE (f:Film {url: {url}})
SET f.created = {created},
    f.edited = {edited},
    f.episode_id = toInt({episode_id}),
    f.opening_crawl = {opening_crawl},
    f.release_date = {release_date},
    f.title = {title}
WITH f
UNWIND split({director}, ",") AS director
MERGE (d:Director {name: director})
CREATE UNIQUE (f)-[:DIRECTED_BY]->(d)
WITH DISTINCT f
UNWIND split({producer}, ",") AS producer
MERGE (p:Producer {name: producer})
CREATE UNIQUE (f)-[:PRODUCED_BY]->(p)
WITH DISTINCT f
UNWIND {characters} AS character
MERGE (c:Person {url: character})
ON CREATE SET c:Placeholder
CREATE UNIQUE (c)-[:APPEARS_IN]->(f)
WITH DISTINCT f
UNWIND {planets} AS planet
MERGE (p:Planet {url: planet})
ON CREATE SET p:Placeholder
CREATE UNIQUE (f)-[:TAKES_PLACE_ON]->(p)
WITH DISTINCT f
UNWIND {species} AS specie
MERGE (s:Species {url: specie})
ON CREATE SET s:Placeholder
CREATE UNIQUE (s)-[:APPEARS_IN]->(f)
WITH DISTINCT f
UNWIND {starships} AS starship
MERGE (s:Starship {url: starship})
ON CREATE SET s:Placeholder
CREATE UNIQUE (s)-[:APPEARS_IN]->(f)
WITH DISTINCT f
UNWIND {vehicles} AS vehicle
MERGE (v:Vehicle {url: vehicle})
ON CREATE SET v:Placeholder
CREATE UNIQUE (v)-[:APPEARS_IN]->(f)
'''

Planet¶

Details about planets include the climate, residents, and terrain. Note that we are extracting climate and terrain into nodes. This will allow us to define queries that traverse the graph to answer questions like "Which planets are similar to each other?"

In [9]:

# Fetch a single Film entity from the API
r = requests.get("http://swapi.co/api/planets/1/")
params = r.json()
params

Out[9]:

{'name': 'Tatooine',
 'rotation_period': '23',
 'orbital_period': '304',
 'diameter': '10465',
 'climate': 'arid',
 'gravity': '1 standard',
 'terrain': 'desert',
 'surface_water': '1',
 'population': '200000',
 'residents': ['https://swapi.co/api/people/1/',
  'https://swapi.co/api/people/2/',
  'https://swapi.co/api/people/4/',
  'https://swapi.co/api/people/6/',
  'https://swapi.co/api/people/7/',
  'https://swapi.co/api/people/8/',
  'https://swapi.co/api/people/9/',
  'https://swapi.co/api/people/11/',
  'https://swapi.co/api/people/43/',
  'https://swapi.co/api/people/62/'],
 'films': ['https://swapi.co/api/films/5/',
  'https://swapi.co/api/films/4/',
  'https://swapi.co/api/films/6/',
  'https://swapi.co/api/films/3/',
  'https://swapi.co/api/films/1/'],
 'created': '2014-12-09T13:50:49.641000Z',
 'edited': '2014-12-21T20:48:04.175778Z',
 'url': 'https://swapi.co/api/planets/1/'}

In [10]:

# Update Planet entity in the graph
CREATE_PLANET_QUERY = '''
MERGE (p:Planet {url: {url}})
SET p.created = {created},
    p.diameter = {diameter},
    p.edited = {edited},
    p.gravity = {gravity},
    p.name = {name},
    p.orbital_period = {orbital_period},
    p.population = {population},
    p.rotation_period = {rotation_period},
    p.surface_water = {surface_water}
REMOVE p:Placeholder
WITH p
UNWIND split({climate}, ",") AS c
MERGE (cli:Climate {type: c})
CREATE UNIQUE (p)-[:HAS_CLIMATE]->(cli)
WITH DISTINCT p
UNWIND split({terrain}, ",") AS t
MERGE (ter:Terrain {type: t})
CREATE UNIQUE (p)-[:HAS_TERRAIN]->(ter)
'''

Species¶

In [11]:

# Fetch a single Film entity from the API
r = requests.get("http://swapi.co/api/species/2/")
params = r.json()
params

Out[11]:

{'name': 'Droid',
 'classification': 'artificial',
 'designation': 'sentient',
 'average_height': 'n/a',
 'skin_colors': 'n/a',
 'hair_colors': 'n/a',
 'eye_colors': 'n/a',
 'average_lifespan': 'indefinite',
 'homeworld': None,
 'language': 'n/a',
 'people': ['https://swapi.co/api/people/2/',
  'https://swapi.co/api/people/3/',
  'https://swapi.co/api/people/8/',
  'https://swapi.co/api/people/23/',
  'https://swapi.co/api/people/87/'],
 'films': ['https://swapi.co/api/films/2/',
  'https://swapi.co/api/films/7/',
  'https://swapi.co/api/films/5/',
  'https://swapi.co/api/films/4/',
  'https://swapi.co/api/films/6/',
  'https://swapi.co/api/films/3/',
  'https://swapi.co/api/films/1/'],
 'created': '2014-12-10T15:16:16.259000Z',
 'edited': '2015-04-17T06:59:43.869528Z',
 'url': 'https://swapi.co/api/species/2/'}

In [12]:

# Update Species entity in the graph
CREATE_SPECIES_QUERY = '''
MERGE (s:Species {url: {url}})
SET s.name = {name},
    s.language = {language},
    s.average_height = {average_height},
    s.average_lifespan = {average_lifespan},
    s.classification = {classification},
    s.created = {created},
    s.designation = {designation},
    s.eye_colors = {eye_colors},
    s.hair_colors = {hair_colors},
    s.skin_colors = {skin_colors}
REMOVE s:Placeholder
'''

Starships¶

A starship is defined as a vehicle that has hyperdrive capability.

In [13]:

# Fetch a single Film entity from the API
r = requests.get("http://swapi.co/api/starships/2/")
params = r.json()
params

Out[13]:

{'name': 'CR90 corvette',
 'model': 'CR90 corvette',
 'manufacturer': 'Corellian Engineering Corporation',
 'cost_in_credits': '3500000',
 'length': '150',
 'max_atmosphering_speed': '950',
 'crew': '165',
 'passengers': '600',
 'cargo_capacity': '3000000',
 'consumables': '1 year',
 'hyperdrive_rating': '2.0',
 'MGLT': '60',
 'starship_class': 'corvette',
 'pilots': [],
 'films': ['https://swapi.co/api/films/6/',
  'https://swapi.co/api/films/3/',
  'https://swapi.co/api/films/1/'],
 'created': '2014-12-10T14:20:33.369000Z',
 'edited': '2014-12-22T17:35:45.408368Z',
 'url': 'https://swapi.co/api/starships/2/'}

In [14]:

CREATE_STARSHIP_QUERY = '''
MERGE (s:Starship {url: {url}})
SET s.MGLT = {MGLT},
    s.consumables = {consumables},
    s.cost_in_credits = {cost_in_credits},
    s.created = {created},
    s.crew = {crew},
    s.edited = {edited},
    s.hyperdrive_rating = {hyperdrive_rating},
    s.length = {length},
    s.max_atmosphering_speed = {max_atmosphering_speed},
    s.model = {model},
    s.name = {name},
    s.passengers = {passengers}
REMOVE s:Placeholder
MERGE (m:Manufacturer {name: {manufacturer}})
CREATE UNIQUE (s)-[:MANUFACTURED_BY]->(m)
WITH s
MERGE (c:StarshipClass {type: {starship_class}})
CREATE UNIQUE (s)-[:IS_CLASS]->(c)
'''

Vehicles¶

Any vehicles that lack a hyperdrive are called simply, vehicles...

In [15]:

# Fetch a single Film entity from the API
r = requests.get("http://swapi.co/api/vehicles/4/")
params = r.json()
params

Out[15]:

{'name': 'Sand Crawler',
 'model': 'Digger Crawler',
 'manufacturer': 'Corellia Mining Corporation',
 'cost_in_credits': '150000',
 'length': '36.8',
 'max_atmosphering_speed': '30',
 'crew': '46',
 'passengers': '30',
 'cargo_capacity': '50000',
 'consumables': '2 months',
 'vehicle_class': 'wheeled',
 'pilots': [],
 'films': ['https://swapi.co/api/films/5/', 'https://swapi.co/api/films/1/'],
 'created': '2014-12-10T15:36:25.724000Z',
 'edited': '2014-12-22T18:21:15.523587Z',
 'url': 'https://swapi.co/api/vehicles/4/'}

In [16]:

CREATE_VEHICLE_QUERY = '''
MERGE (v:Vehicle {url: {url}})
SET v.cargo_capacity = {cargo_capacity},
    v.consumables = {consumables},
    v.cost_in_credits = {cost_in_credits},
    v.created = {created},
    v.crew = {crew},
    v.edited = {edited},
    v.length = {length},
    v.max_atmosphering_speed = {max_atmosphering_speed},
    v.model = {model},
    v.name = {name},
    v.passengers = {passengers}
REMOVE v:Placeholder
MERGE (m:Manufacturer {name: {manufacturer}})
CREATE UNIQUE (v)-[:MANUFACTURED_BY]->(m)
WITH v
MERGE (c:VehicleClass {type: {vehicle_class}})
CREATE UNIQUE (v)-[:IS_CLASS]->(c)
'''

Crawl the graph¶

Now that we've defined the Cypher queries to handle inserting data from SWAPI, we can start crawling the API by making HTTP requests to SWAPI and building our graph! But first we need a starting point.

Start with films¶

Since we know the films we want to insert into the graph (Episodes 1-6) we will start there. This loop starts with Episode I, fetches the film data from SWAPI then executes the CREATE_MOVIE_QUERY Cypher query using that data as a parameter, then loops through the remaining episodes.

In [17]:

# Fetch Movie entities and insert into graph 
for i in range(1,8):
    url = "http://swapi.co/api/films/" + str(i) + "/"
    r = requests.get(url)
    params = r.json()
    graph.run(CREATE_MOVIE_QUERY, params)
    print("Inserted film: " + str(url))

Inserted film: http://swapi.co/api/films/1/
Inserted film: http://swapi.co/api/films/2/
Inserted film: http://swapi.co/api/films/3/
Inserted film: http://swapi.co/api/films/4/
Inserted film: http://swapi.co/api/films/5/
Inserted film: http://swapi.co/api/films/6/
Inserted film: http://swapi.co/api/films/7/

We've now created Film nodes for each of the seven films, as well as created placeholder nodes for new entities that we've discovered while inserting the films.

In [18]:

# How many Placeholder nodes are in the graph now?
placeholder_count_query = '''
MATCH (p:Placeholder) WITH p
WITH collect(DISTINCT head(labels(p))) AS labels
UNWIND labels AS label
MATCH (p:Placeholder) WHERE head(labels(p))=label
RETURN label, count(*) AS num
'''

result = graph.run(placeholder_count_query)
result.to_table()

Out[18]:

label	num
Planet	21
Placeholder	37
Starship	37
Vehicle	39
Person	86

Fill in Placeholder nodes and crawl the graph¶

We can now continue to populate our graph by crawling the API.

First, we'll define a query to find a single Placeholder node in the graph and return the url for the placeholder. We will then use this url to make a request to SWAPI to populate the entity and its type (Vehicle, Starship, Person, etc).

In [19]:

# Find a single Placeholder entity and return its url and type
FIND_NEW_ENTITY_QUERY = '''
MATCH (p:Placeholder)
WITH rand() AS r, p ORDER BY r LIMIT 1
WITH p
RETURN p.url AS url, CASE WHEN head(labels(p))="Placeholder" THEN labels(p)[1] ELSE head(labels(p)) END AS type
'''

Then we need a function to map the type of the Placeholder entity returned to the Cypher query that inserts that type of entity:

In [20]:

# get Cypher query for label
def getQueryForLabel(label):
    if (label == 'Vehicle'):
        return CREATE_VEHICLE_QUERY
    elif (label == 'Species'):
        return CREATE_SPECIES_QUERY
    elif (label == 'Person'):
        return CREATE_PERSON_QUERY
    elif (label == 'Starship'):
        return CREATE_STARSHIP_QUERY
    elif (label == 'Planet'):
        return CREATE_PLANET_QUERY
    else:
        raise ValueError("Unknown label for entity: " + str(label))

Now we just need to define a loop to fetch a single Placeholder entity (any node with the label Placeholder), make a request to SWAPI for the JSON data for this resource and execute the Cypher query to insert that type of resource. Once that entity is populated in the graph we remove the Placeholder label from the node. Then we just loop until our graph no longer has any Placeholder nodes.

In [21]:

# Fetch a single Placeholder entity from the graph
# Get JSON for Placeholder entity from SWAPI
# Update entity in graph (removing Placeholder label)
# Loop until graph contains no more Placeholder nodes
result = graph.run(FIND_NEW_ENTITY_QUERY)
while result.forward():
    label = result.current["type"]
    url = result.current["url"]
    r = requests.get(url)
    params = r.json()
    graph.run(getQueryForLabel(label), params)
    result = graph.run(FIND_NEW_ENTITY_QUERY)   

And now we've built the Wookiepedia Graph by crawling SWAPI! We can now query our graph to make use of the data to learn more about the Star Wars universe:

Who pilots the same vehicles as Luke Skywalker?¶

What planets are most similar to Naboo?¶

In [22]:

planet_sim_query = '''
MATCH (p:Planet {name: 'Naboo'})-[:HAS_CLIMATE]->(c:Climate)<-[:HAS_CLIMATE]-(o:Planet)
MATCH (p)-[:HAS_TERRAIN]->(t:Terrain)<-[:HAS_TERRAIN]-(o)
WITH DISTINCT o, collect(DISTINCT c.type) AS climates, collect(DISTINCT t.type) AS terrains
RETURN o.name AS planet, climates, terrains, size(climates) + size(terrains) AS sim ORDER BY sim DESC LIMIT 5
'''
result = graph.run(planet_sim_query)
result.to_table()

Out[22]:

planet	climates	terrains	sim
Muunilinst	['temperate']	[' forests', ' mountains']	3
Corellia	['temperate']	[' forests']	2
Chandrila	['temperate']	[' forests']	2
Nal Hutta	['temperate']	[' swamps']	2
Cato Neimoidia	['temperate']	[' forests']	2

TODO¶

Error handling / retry - currently we just assume all requests will complete successfully. In reality we need to have at least some basic retry functionality for requests that do not complete as expected.

Important Copyright information

Star Wars and all associated names are copyright Lucasfilm ltd. All data comes from SWAPI.co

In [ ]: