Neo4j Twitter Trolls Tutorial

Goal: This notebook aims to show how to use PyGraphistry to visualize data from Neo4j. We also show how to use graph algorithms in Neo4j and use PyGraphistry to visualize the result of those algorithms.

Prerequesties:

  • You'll need a Graphistry API key, which you can request here
  • Neo4j. We'll be using Neo4j Sandbox (free hosted Neo4j instances pre-populated with data) for this tutorial. Specifically the "Russian Twitter Trolls" sandbox. You can create a Neo4j Sandbox instance here
  • Python requirements:

Outline

  • Connecting to Neo4j
    • using neo4j-driver Python client
    • query with Cypher
  • Visualizing data in Graphistry from Neo4j
    • User-User mentions from Twitter data
  • Graph algorithms
    • Enhancing our visualization with PageRank
In [1]:
# import required dependencies
from neo4j.v1 import GraphDatabase, basic_auth
from pandas import DataFrame
import graphistry
In [2]:
# register Graphisty API key
# request an API key if you don't have one: https://www.graphistry.com/api-request
graphistry.register(key='YOUR API KEY HERE')

Connect To Neo4j

If you haven't already, create an instance of the Russian Twitter Trolls sandbox on Neo4j Sandbox. We'll use the Python driver for Neo4j to fetch data from Neo4j. To do this we'll need to instantiate a Driver object, passing in the credentials for our Neo4j instance. If using Neo4j Sandbox you can find the credentials for your Neo4j instance in the "Details" tab. Specifically we need the IP address, bolt port, username, and password. Bolt is the binary protocol used by the Neo4j drivers so a typical database URL string takes the form bolt://<IP_ADDRESS>:<BOLT_PORT>

In [3]:
# instantiate Neo4j driver instance
# be sure to replace the connection string and password with your own
driver = GraphDatabase.driver("bolt://34.201.165.36:34532", auth=basic_auth("neo4j", "capitals-quality-loads"))

Once we've instantiated our Driver, we can use Session objects to execute queries against Neo4j. Here we'll use session.run() to execute a Cypher query. Cypher is the query language for graphs that we use with Neo4j (you can think of Cypher as SQL for graphs).

In [4]:
# neo4j-driver hello world
# execute a simple query to count the number of nodes in the database and print the result
with driver.session() as session:
    results = session.run("MATCH (a) RETURN COUNT(a) AS num")
for record in results:
    print(record)
<Record num=281217>

If we inspect the datamodel in Neo4j we can see that we have inormation about Tweets and specifically Users mentioned in tweets.

Let's use Graphistry to visualize User-User Tweet mention interactions. We'll do this by querying Neo4j for all tweets that mention users.

Using Graphistry With Neo4j

Currently, PyGraphistry can work with data as a pandas DataFrame, NetworkX graph or IGraph graph object. In this section we'll show how to load data from Neo4j into PyGraphistry by converting results from the Python Neo4j driver into a pandas DataFrame.

Our goal is to visualize User-User Tweet mention interactions. We'll create two pandas DataFrames, one representing our nodes (Users) and a second representing the relationships in our graph (mentions).

Some users are known Troll accounts so we include a flag variable, troll to indicate when the user is a Troll. This will be used in our visualization to set the color of the known Troll accounts.

In [7]:
# Create User DataFrame by querying Neo4j, converting the results into a pandas DataFrame
with driver.session() as session:
    results = session.run("""
    MATCH (u:User) 
    WITH u.user_key AS screen_name, CASE WHEN "Troll" IN labels(u) THEN 5 ELSE 0 END AS troll
    RETURN screen_name, troll""")
    users = DataFrame(results.data())
# show the first 5 rows of the DataFrame
users[:5]
Out[7]:
screen_name troll
0 robbydelaware 5
1 scottgohard 5
2 beckster319 5
3 skatewake1994 5
4 kadirovrussia 5

Next, we need some relationships to visualize. In this case we are interested in visualizing user interactions, specifically where users have mentioned users in Tweets.

In [8]:
# Query for tweets mentioning a user and create a DataFrame adjacency list using screen_name
# where u1 posted a tweet(s) that mentions u2
# num is the number of time u1 mentioned u2 in the dataset
with driver.session() as session:
    results = session.run("""
        MATCH (u1:User)-[:POSTED]->(:Tweet)-[:MENTIONS]->(u2:User)
        RETURN u1.user_key AS u1, u2.user_key AS u2, COUNT(*) AS num
    """)
    mentions  = DataFrame(results.data())
mentions[:5]
Out[8]:
num u1 u2
0 1 dorothiebell dwstweets
1 1 happkendrahappy nineworthies
2 2 aiden7757 theclobra
3 1 ameliebaldwin dcclothesline
4 9 ameliebaldwin jturnershow

Now we can visualize this mentions network using Graphistry. We'll specify the nodes and relationships for our graph. We'll also use the troll property to color the known Troll nodes red, setting them apart from other users in the graph.

In [9]:
viz = graphistry.bind(source="u1", destination="u2", node="screen_name", point_color="troll").nodes(users).edges(mentions)
viz.plot()
Out[9]:

After running the above Python cell you should see an interactive Graphistry visualization like this:

Known Troll user nodes are colored red, regular users colored blue. By default, the size of the nodes is proportional to the degree of the node (number of relationships). We'll see in the next section how we can use graph algorithms such as PageRank and visualize the results of those algorithms in Graphistry.

Graph Algorithms

The above visualization shows us User-User Tweet mention interactions from the data. What if we wanted to answer the question "Who is the most important user in this network?". One way to answer that would be to look at the degree, or number of relationships, of each node. By default, PyGraphistry uses degree to style the size of the node, allowing us to determine importance of nodes at a glance.

We can also use graph algorithms such as PageRank to determine importance in the network. In this section we show how to run graph algorithms in Neo4j and use the results of these algorithms in our Graphistry visualization.

In [10]:
# run PageRank on the projected mentions graph and update nodes by adding a pagerank property score
with driver.session() as session:
    session.run("""
        CALL algo.pageRank("MATCH (t:User) RETURN id(t) AS id",
         "MATCH (u1:User)-[:POSTED]->(:Tweet)-[:MENTIONS]->(u2:User) 
         RETURN id(u1) as source, id(u2) as target", {graph:'cypher', write: true})
     """)

Now that we've calculated PageRank for each User node we need to create a new pandas DataFrame for our user nodes by querying Neo4j:

In [11]:
# create a new users DataFrame, now including PageRank score for each user
with driver.session() as session:
    results = session.run("""
    MATCH (u:User) 
    WITH u.user_key AS screen_name, u.pagerank AS pagerank, CASE WHEN "Troll" IN labels(u) THEN 5 ELSE 0 END AS troll
    RETURN screen_name, pagerank, troll""")
    users = DataFrame(results.data())
users[:5]
Out[11]:
pagerank screen_name troll
0 0.150000 robbydelaware 5
1 0.151547 scottgohard 5
2 0.150000 beckster319 5
3 0.150000 skatewake1994 5
4 0.150000 kadirovrussia 5
In [12]:
# render the Graphistry visualization, binding node size to PageRank score
viz = graphistry.bind(source="u1", destination="u2", node="screen_name", point_size="pagerank", point_color="troll").nodes(users).edges(mentions)
viz.plot()
Out[12]:

Now when we render the Graphistry visualization, node size is proprtional to the node's PageRank score. This results in a different set of nodes that are identified as most important.

By binding node size to the results of graph algorithms we are able to draw insight from the data at a glance and further explore the interactive visualization.