#!/usr/bin/env python # coding: utf-8 # # Building The Star Wars Graph # # Thousands of Star Wars fans have contributed to Star Wars wikis like the [Wookieepedia Wiki](http://starwars.wikia.com/wiki/Main_Page) - the open Star Wars encyclopedia that anyone can edit. This wiki collects information about all Star Wars films, including characters, planets, starships and much more. A subset of the information from Wookieepedia is available via REST API at [SWAPI.co](http://swapi.co). SWAPI provides an API for fetching information about which characters appear in which films, the characters' home planets, and what starships they’ve piloted. This information is inherently a graph! So thanks to Wookieepedia and SWAPI.co we’re able to build the Star Wars Graph! # # In order to build this graph we'll need to fetch data from [SWAPI.co](http://swapi.co), an open REST API for Star Wars data. We'll use the Python `requests` package to make requests from the API (which will return JSON), we'll then use [py2neo](http://py2neo.org/2.0/) to execute Cypher queries against a Neo4j database using the JSON returned from the API as parameters for our queries. Because this is a REST API, many of the resources are returned as only URLs which we'll need to fetch later to fully populate the data in our graph. We'll use Neo4j as a type of queuing mechanism, creating relationships and nodes using the URL as a placeholder for a resource that needs to be fully hydrated later. # # This notebook assumes some basic knowledge of Python and some working knowledge of [Neo4j](http://neo4j.com). For a more general overview of Neo4j and graph databases, check out some of the other posts on [my blog](http://lyonwj.com). # # In[1]: # import dependency packages from py2neo import Graph # install with `pip install py2neo` import requests # `pip install requests` # In[2]: # Exploring the API # what endpoints are available? r = requests.get("http://swapi.co/api/") r.json() # ## The datamodel # # Based on the entities and the data we have available, this is the property graph data model that we'll be using: # # ![](https://dl.dropboxusercontent.com/u/67572426/Screenshot%202015-12-11%2011.59.31.png) # ### Connect to Neo4j and add constraints # # We'll be using the [py2neo](https://py2neo.org/v4/) Python driver for Neo4j. We'll first connect to a running Neo4j instance and add constraints based on the data model we've defined above # In[3]: # Connect to Neo4j instance graph = Graph("bolt://localhost:7687", auth=("neo4j", "")) # create uniqueness constraints based on the datamodel graph.run("CREATE CONSTRAINT ON (f:Film) ASSERT f.url IS UNIQUE") graph.run("CREATE CONSTRAINT ON (p:Person) ASSERT p.url IS UNIQUE") graph.run("CREATE CONSTRAINT ON (v:Vehicle) ASSERT v.url IS UNIQUE") graph.run("CREATE CONSTRAINT ON (s:Starship) ASSERT s.url IS UNIQUE") graph.run("CREATE CONSTRAINT ON (p:Planet) ASSERT p.url IS UNIQUE") # ### Inserting a single resource # # Let's see how we can insert a single resource result from the SWAPI, a single Person object. First we'll look at the data returned for a Person entity, then we'll write a Cypher query that can take the JSON document returned by the API as a parameter object to insert it into our graph. # In[4]: # Fetch a single person entity from the API r = requests.get("http://swapi.co/api/people/1/") params = r.json() params # In[5]: # Define a parameterized Cypher query to insert a Person entity into the graph # For resources referenced in the Person entity (like homeworld and starships) we create a relationship and node # containing only the url. This new node acts as a placeholder that we'll need to fill in later CREATE_PERSON_QUERY = ''' MERGE (p:Person {url: {url}}) SET p.birth_year = {birth_year}, p.created = {created}, p.edited = {edited}, p.eye_color = {eye_color}, p.gender = {gender}, p.hair_color = {hair_color}, p.height = {height}, p.mass = {mass}, p.name = {name}, p.skin_color = {skin_color} REMOVE p:Placeholder WITH p MERGE (home:Planet {url: {homeworld}}) ON CREATE SET home:Placeholder CREATE UNIQUE (home)<-[:IS_FROM]-(p) WITH p UNWIND {species} AS specie MERGE (s:Species {url: specie}) ON CREATE SET s:Placeholder CREATE UNIQUE (p)-[:IS_SPECIES]->(s) WITH DISTINCT p UNWIND {starships} AS starship MERGE (s:Starship {url: starship}) ON CREATE SET s:Placeholder CREATE UNIQUE (p)-[:PILOTS]->(s) WITH DISTINCT p UNWIND {vehicles} AS vehicle MERGE (v:Vehicle {url: vehicle}) ON CREATE SET v:Placeholder CREATE UNIQUE (p)-[:PILOTS]->(v) ''' # In[6]: # we can execute this query using py2neo graph.run(CREATE_PERSON_QUERY, params) # Executing this query creates a `Person` node for Luke Skywalker and sets properties on this node for properties returned from the API (`birth_year`, `name`, `height`, etc). For other entities (like home planet, species, and the starships he's piloted) the API returns only the url for these entities - we must make an additional API request to hydrate these later. Using the url for these resources as a unique id, we create nodes and relationships to these nodes. # # ![](https://dl.dropboxusercontent.com/u/67572426/Screenshot%202015-12-11%2016.18.21.png) # # We'll later query the graph for these incomplete nodes so that we can hydrate these entities with a request to the SWAPI. Note that when we query the API to hydrate these entities we may end up adding more nodes and relationships (in addition to just adding properties) - this allows us to build our graph by "crawling" the API, using Neo4j as a queue to store the resources to be crawled asynchronously. # ### Define Cypher queries # We will now look at the JSON format for each type of entity and define the Cypher query to handle updating the graph for that type of entity. # # #### Film # # The `films` endpoint returns information about a single film as well as arrays of `characters`, `planets`, `species`, `starships`, and `vehicles` that appear in the film. Note that these arrays contain only a url for the entity. We will use the Cypher `UNWIND` statement to iterate through the elements of these arrays, inserting a Placeholder node which we will fill-in later with an additional call to SWAPI. # In[7]: # Fetch a single Film entity from the API r = requests.get("http://swapi.co/api/films/1/") params = r.json() params # In[8]: # Insert a given Film into the graph, including Placeholder nodes for any new entities discovered CREATE_MOVIE_QUERY = ''' MERGE (f:Film {url: {url}}) SET f.created = {created}, f.edited = {edited}, f.episode_id = toInt({episode_id}), f.opening_crawl = {opening_crawl}, f.release_date = {release_date}, f.title = {title} WITH f UNWIND split({director}, ",") AS director MERGE (d:Director {name: director}) CREATE UNIQUE (f)-[:DIRECTED_BY]->(d) WITH DISTINCT f UNWIND split({producer}, ",") AS producer MERGE (p:Producer {name: producer}) CREATE UNIQUE (f)-[:PRODUCED_BY]->(p) WITH DISTINCT f UNWIND {characters} AS character MERGE (c:Person {url: character}) ON CREATE SET c:Placeholder CREATE UNIQUE (c)-[:APPEARS_IN]->(f) WITH DISTINCT f UNWIND {planets} AS planet MERGE (p:Planet {url: planet}) ON CREATE SET p:Placeholder CREATE UNIQUE (f)-[:TAKES_PLACE_ON]->(p) WITH DISTINCT f UNWIND {species} AS specie MERGE (s:Species {url: specie}) ON CREATE SET s:Placeholder CREATE UNIQUE (s)-[:APPEARS_IN]->(f) WITH DISTINCT f UNWIND {starships} AS starship MERGE (s:Starship {url: starship}) ON CREATE SET s:Placeholder CREATE UNIQUE (s)-[:APPEARS_IN]->(f) WITH DISTINCT f UNWIND {vehicles} AS vehicle MERGE (v:Vehicle {url: vehicle}) ON CREATE SET v:Placeholder CREATE UNIQUE (v)-[:APPEARS_IN]->(f) ''' # #### Planet # # Details about planets include the climate, residents, and terrain. Note that we are extracting climate and terrain into nodes. This will allow us to define queries that traverse the graph to answer questions like "Which planets are similar to each other?" # In[9]: # Fetch a single Film entity from the API r = requests.get("http://swapi.co/api/planets/1/") params = r.json() params # In[10]: # Update Planet entity in the graph CREATE_PLANET_QUERY = ''' MERGE (p:Planet {url: {url}}) SET p.created = {created}, p.diameter = {diameter}, p.edited = {edited}, p.gravity = {gravity}, p.name = {name}, p.orbital_period = {orbital_period}, p.population = {population}, p.rotation_period = {rotation_period}, p.surface_water = {surface_water} REMOVE p:Placeholder WITH p UNWIND split({climate}, ",") AS c MERGE (cli:Climate {type: c}) CREATE UNIQUE (p)-[:HAS_CLIMATE]->(cli) WITH DISTINCT p UNWIND split({terrain}, ",") AS t MERGE (ter:Terrain {type: t}) CREATE UNIQUE (p)-[:HAS_TERRAIN]->(ter) ''' # #### Species # In[11]: # Fetch a single Film entity from the API r = requests.get("http://swapi.co/api/species/2/") params = r.json() params # In[12]: # Update Species entity in the graph CREATE_SPECIES_QUERY = ''' MERGE (s:Species {url: {url}}) SET s.name = {name}, s.language = {language}, s.average_height = {average_height}, s.average_lifespan = {average_lifespan}, s.classification = {classification}, s.created = {created}, s.designation = {designation}, s.eye_colors = {eye_colors}, s.hair_colors = {hair_colors}, s.skin_colors = {skin_colors} REMOVE s:Placeholder ''' # #### Starships # # A starship is defined as a vehicle that has hyperdrive capability. # In[13]: # Fetch a single Film entity from the API r = requests.get("http://swapi.co/api/starships/2/") params = r.json() params # In[14]: CREATE_STARSHIP_QUERY = ''' MERGE (s:Starship {url: {url}}) SET s.MGLT = {MGLT}, s.consumables = {consumables}, s.cost_in_credits = {cost_in_credits}, s.created = {created}, s.crew = {crew}, s.edited = {edited}, s.hyperdrive_rating = {hyperdrive_rating}, s.length = {length}, s.max_atmosphering_speed = {max_atmosphering_speed}, s.model = {model}, s.name = {name}, s.passengers = {passengers} REMOVE s:Placeholder MERGE (m:Manufacturer {name: {manufacturer}}) CREATE UNIQUE (s)-[:MANUFACTURED_BY]->(m) WITH s MERGE (c:StarshipClass {type: {starship_class}}) CREATE UNIQUE (s)-[:IS_CLASS]->(c) ''' # #### Vehicles # # Any vehicles that lack a hyperdrive are called simply, vehicles... # In[15]: # Fetch a single Film entity from the API r = requests.get("http://swapi.co/api/vehicles/4/") params = r.json() params # In[16]: CREATE_VEHICLE_QUERY = ''' MERGE (v:Vehicle {url: {url}}) SET v.cargo_capacity = {cargo_capacity}, v.consumables = {consumables}, v.cost_in_credits = {cost_in_credits}, v.created = {created}, v.crew = {crew}, v.edited = {edited}, v.length = {length}, v.max_atmosphering_speed = {max_atmosphering_speed}, v.model = {model}, v.name = {name}, v.passengers = {passengers} REMOVE v:Placeholder MERGE (m:Manufacturer {name: {manufacturer}}) CREATE UNIQUE (v)-[:MANUFACTURED_BY]->(m) WITH v MERGE (c:VehicleClass {type: {vehicle_class}}) CREATE UNIQUE (v)-[:IS_CLASS]->(c) ''' # ### Crawl the graph # # Now that we've defined the Cypher queries to handle inserting data from SWAPI, we can start crawling the API by making HTTP requests to SWAPI and building our graph! But first we need a starting point. # # #### Start with films # # Since we know the films we want to insert into the graph (Episodes 1-6) we will start there. This loop starts with Episode I, fetches the film data from SWAPI then executes the `CREATE_MOVIE_QUERY` Cypher query using that data as a parameter, then loops through the remaining episodes. # In[17]: # Fetch Movie entities and insert into graph for i in range(1,8): url = "http://swapi.co/api/films/" + str(i) + "/" r = requests.get(url) params = r.json() graph.run(CREATE_MOVIE_QUERY, params) print("Inserted film: " + str(url)) # We've now created Film nodes for each of the seven films, as well as created placeholder nodes for new entities that we've discovered while inserting the films. # In[18]: # How many Placeholder nodes are in the graph now? placeholder_count_query = ''' MATCH (p:Placeholder) WITH p WITH collect(DISTINCT head(labels(p))) AS labels UNWIND labels AS label MATCH (p:Placeholder) WHERE head(labels(p))=label RETURN label, count(*) AS num ''' result = graph.run(placeholder_count_query) result.to_table() # #### Fill in Placeholder nodes and crawl the graph # # We can now continue to populate our graph by crawling the API. # # First, we'll define a query to find a single Placeholder node in the graph and return the url for the placeholder. We will then use this url to make a request to SWAPI to populate the entity and its type (Vehicle, Starship, Person, etc). # In[19]: # Find a single Placeholder entity and return its url and type FIND_NEW_ENTITY_QUERY = ''' MATCH (p:Placeholder) WITH rand() AS r, p ORDER BY r LIMIT 1 WITH p RETURN p.url AS url, CASE WHEN head(labels(p))="Placeholder" THEN labels(p)[1] ELSE head(labels(p)) END AS type ''' # Then we need a function to map the type of the Placeholder entity returned to the Cypher query that inserts that type of entity: # In[20]: # get Cypher query for label def getQueryForLabel(label): if (label == 'Vehicle'): return CREATE_VEHICLE_QUERY elif (label == 'Species'): return CREATE_SPECIES_QUERY elif (label == 'Person'): return CREATE_PERSON_QUERY elif (label == 'Starship'): return CREATE_STARSHIP_QUERY elif (label == 'Planet'): return CREATE_PLANET_QUERY else: raise ValueError("Unknown label for entity: " + str(label)) # Now we just need to define a loop to fetch a single Placeholder entity (any node with the label `Placeholder`), make a request to SWAPI for the JSON data for this resource and execute the Cypher query to insert that type of resource. Once that entity is populated in the graph we remove the `Placeholder` label from the node. Then we just loop until our graph no longer has any Placeholder nodes. # In[21]: # Fetch a single Placeholder entity from the graph # Get JSON for Placeholder entity from SWAPI # Update entity in graph (removing Placeholder label) # Loop until graph contains no more Placeholder nodes result = graph.run(FIND_NEW_ENTITY_QUERY) while result.forward(): label = result.current["type"] url = result.current["url"] r = requests.get(url) params = r.json() graph.run(getQueryForLabel(label), params) result = graph.run(FIND_NEW_ENTITY_QUERY) # And now we've built the Wookiepedia Graph by crawling SWAPI! We can now query our graph to make use of the data to learn more about the Star Wars universe: # #### Who pilots the same vehicles as Luke Skywalker? # ![](https://dl.dropboxusercontent.com/u/67572426/Screenshot%202015-12-14%2011.51.22.png) # ![](https://dl.dropboxusercontent.com/u/67572426/Screenshot%202015-12-14%2011.48.40.png) # #### What planets are most similar to Naboo? # In[22]: planet_sim_query = ''' MATCH (p:Planet {name: 'Naboo'})-[:HAS_CLIMATE]->(c:Climate)<-[:HAS_CLIMATE]-(o:Planet) MATCH (p)-[:HAS_TERRAIN]->(t:Terrain)<-[:HAS_TERRAIN]-(o) WITH DISTINCT o, collect(DISTINCT c.type) AS climates, collect(DISTINCT t.type) AS terrains RETURN o.name AS planet, climates, terrains, size(climates) + size(terrains) AS sim ORDER BY sim DESC LIMIT 5 ''' result = graph.run(planet_sim_query) result.to_table() # ### TODO # # * Error handling / retry - currently we just assume all requests will complete successfully. In reality we need to have at least some basic retry functionality for requests that do not complete as expected. # Important Copyright information # # Star Wars and all associated names are copyright Lucasfilm ltd. # All data comes from SWAPI.co # In[ ]: