#!/usr/bin/env python # coding: utf-8 # # Neo4j Twitter Trolls Tutorial # # **Goal**: This notebook aims to show how to use PyGraphistry to visualize data from [Neo4j](https://neo4j.com/developer/). We also show how to use [graph algorithms in Neo4j](https://neo4j.com/developer/graph-algorithms/) and use PyGraphistry to visualize the result of those algorithms. # # *Prerequesties:* # * You'll need a Graphistry API key, which you can request [here](https://www.graphistry.com/api-request) # * Neo4j. We'll be using [Neo4j Sandbox](https://neo4j.com/sandbox-v2/) (free hosted Neo4j instances pre-populated with data) for this tutorial. Specifically the "Russian Twitter Trolls" sandbox. You can create a Neo4j Sandbox instance [here](https://neo4j.com/sandbox-v2/) # * Python requirements: # * [`neo4j-driver`](https://github.com/neo4j/neo4j-python-driver) - `pip install neo4j-driver` # * [`pygraphistry`](https://github.com/graphistry/pygraphistry/) - `pip install "graphistry[all]"` # # ## Outline # # * Connecting to Neo4j # * using neo4j-driver Python client # * query with Cypher # * Visualizing data in Graphistry from Neo4j # * User-User mentions from Twitter data # * Graph algorithms # * Enhancing our visualization with PageRank # In[1]: # import required dependencies from neo4j.v1 import GraphDatabase, basic_auth from pandas import DataFrame import graphistry # In[2]: # register Graphisty API key # request an API key if you don't have one: https://www.graphistry.com/api-request graphistry.register(key='YOUR API KEY HERE') # ## Connect To Neo4j # # If you haven't already, create an instance of the Russian Twitter Trolls sandbox on [Neo4j Sandbox.](https://neo4j.com/sandbox-v2/) We'll use the [Python driver for Neo4j](https://github.com/neo4j/neo4j-python-driver) to fetch data from Neo4j. To do this we'll need to instantiate a `Driver` object, passing in the credentials for our Neo4j instance. If using Neo4j Sandbox you can find the credentials for your Neo4j instance in the "Details" tab. Specifically we need the IP address, bolt port, username, and password. Bolt is the binary protocol used by the Neo4j drivers so a typical database URL string takes the form `bolt://:` # # ![](./img/sandbox.png) # In[3]: # instantiate Neo4j driver instance # be sure to replace the connection string and password with your own driver = GraphDatabase.driver("bolt://34.201.165.36:34532", auth=basic_auth("neo4j", "capitals-quality-loads")) # Once we've instantiated our Driver, we can use `Session` objects to execute queries against Neo4j. Here we'll use `session.run()` to execute a [Cypher query](https://neo4j.com/developer/cypher-query-language/). Cypher is the query language for graphs that we use with Neo4j (you can think of Cypher as SQL for graphs). # In[4]: # neo4j-driver hello world # execute a simple query to count the number of nodes in the database and print the result with driver.session() as session: results = session.run("MATCH (a) RETURN COUNT(a) AS num") for record in results: print(record) # If we inspect the datamodel in Neo4j we can see that we have inormation about Tweets and specifically Users mentioned in tweets. # # ![](./img/datamodel.png) # # Let's use Graphistry to visualize User-User Tweet mention interactions. We'll do this by querying Neo4j for all tweets that mention users. # ## Using Graphistry With Neo4j # # Currently, PyGraphistry can work with data as a pandas DataFrame, NetworkX graph or IGraph graph object. In this section we'll show how to load data from Neo4j into PyGraphistry by converting results from the Python Neo4j driver into a pandas DataFrame. # # Our goal is to visualize User-User Tweet mention interactions. We'll create two pandas DataFrames, one representing our nodes (Users) and a second representing the relationships in our graph (mentions). # # Some users are known Troll accounts so we include a flag variable, `troll` to indicate when the user is a Troll. This will be used in our visualization to set the color of the known Troll accounts. # In[7]: # Create User DataFrame by querying Neo4j, converting the results into a pandas DataFrame with driver.session() as session: results = session.run(""" MATCH (u:User) WITH u.user_key AS screen_name, CASE WHEN "Troll" IN labels(u) THEN 5 ELSE 0 END AS troll RETURN screen_name, troll""") users = DataFrame(results.data()) # show the first 5 rows of the DataFrame users[:5] # Next, we need some relationships to visualize. In this case we are interested in visualizing user interactions, specifically where users have mentioned users in Tweets. # In[8]: # Query for tweets mentioning a user and create a DataFrame adjacency list using screen_name # where u1 posted a tweet(s) that mentions u2 # num is the number of time u1 mentioned u2 in the dataset with driver.session() as session: results = session.run(""" MATCH (u1:User)-[:POSTED]->(:Tweet)-[:MENTIONS]->(u2:User) RETURN u1.user_key AS u1, u2.user_key AS u2, COUNT(*) AS num """) mentions = DataFrame(results.data()) mentions[:5] # Now we can visualize this mentions network using Graphistry. We'll specify the nodes and relationships for our graph. We'll also use the `troll` property to color the known Troll nodes red, setting them apart from other users in the graph. # In[9]: viz = graphistry.bind(source="u1", destination="u2", node="screen_name", point_color="troll").nodes(users).edges(mentions) viz.plot() # After running the above Python cell you should see an interactive Graphistry visualization like this: # # ![](./img/graphistry1.png) # # Known Troll user nodes are colored red, regular users colored blue. By default, the size of the nodes is proportional to the degree of the node (number of relationships). We'll see in the next section how we can use graph algorithms such as PageRank and visualize the results of those algorithms in Graphistry. # ## Graph Algorithms # # The above visualization shows us User-User Tweet mention interactions from the data. What if we wanted to answer the question "Who is the most important user in this network?". One way to answer that would be to look at the degree, or number of relationships, of each node. By default, PyGraphistry uses degree to style the size of the node, allowing us to determine importance of nodes at a glance. # # We can also use [graph algorithms](https://github.com/neo4j-contrib/neo4j-graph-algorithms) such as PageRank to determine importance in the network. In this section we show how to [run graph algorithms in Neo4j](https://neo4j.com/developer/graph-algorithms/) and use the results of these algorithms in our Graphistry visualization. # In[10]: # run PageRank on the projected mentions graph and update nodes by adding a pagerank property score with driver.session() as session: session.run(""" CALL algo.pageRank("MATCH (t:User) RETURN id(t) AS id", "MATCH (u1:User)-[:POSTED]->(:Tweet)-[:MENTIONS]->(u2:User) RETURN id(u1) as source, id(u2) as target", {graph:'cypher', write: true}) """) # Now that we've calculated PageRank for each User node we need to create a new pandas DataFrame for our user nodes by querying Neo4j: # In[11]: # create a new users DataFrame, now including PageRank score for each user with driver.session() as session: results = session.run(""" MATCH (u:User) WITH u.user_key AS screen_name, u.pagerank AS pagerank, CASE WHEN "Troll" IN labels(u) THEN 5 ELSE 0 END AS troll RETURN screen_name, pagerank, troll""") users = DataFrame(results.data()) users[:5] # In[12]: # render the Graphistry visualization, binding node size to PageRank score viz = graphistry.bind(source="u1", destination="u2", node="screen_name", point_size="pagerank", point_color="troll").nodes(users).edges(mentions) viz.plot() # Now when we render the Graphistry visualization, node size is proprtional to the node's PageRank score. This results in a different set of nodes that are identified as most important. # # ![](./img/graphistry2.png) # # By binding node size to the results of graph algorithms we are able to draw insight from the data at a glance and further explore the interactive visualization. # # In[ ]: