In [1]:
import pandas
import networkx
import numpy
from matplotlib import pyplot
from collections import Counter

Today we compared two representations of a graph: as a set of edges and as a matrix. The edge-list representation is good for understanding features of a graph and for visualization. The matrix view is good for computations.

In [2]:
graph = networkx.Graph()
graph.add_nodes_from(range(6))
graph.add_edges_from([(0,2), (1,2), (2,3), (1,3), (3,4), (3,5), (4,5)])

There are four connections from node 3 to other nodes. I'm setting the node colors to a light purple to make it easier to see the numbers and because they look Easter-ey.

In [4]:
networkx.draw(graph, with_labels=True, node_color="#eeaaee")
pyplot.show()

This network can also be represented as an adjacency matrix, with one row and column for each node. The cell for row $a$ and column $b$ indicates how many edges exist between $a$ and $b$.

In [6]:
adjacency = networkx.convert_matrix.to_numpy_matrix(graph)
adjacency
Out[6]:
matrix([[0., 0., 1., 0., 0., 0.],
        [0., 0., 1., 1., 0., 0.],
        [1., 1., 0., 1., 0., 0.],
        [0., 1., 1., 0., 1., 1.],
        [0., 0., 0., 1., 0., 1.],
        [0., 0., 0., 1., 1., 0.]])
In [18]:
adjacency.sum(axis=1)
Out[18]:
matrix([[1.],
        [2.],
        [3.],
        [4.],
        [2.],
        [2.]])

Let's show that there are four connections from node 3 in a different way. Multiplying this matrix by a vector with a single 1 at index 3 (ie the fourth position) gives us a new vector. That vector has four 1s, each one at the index of a node that is adjacent to node 3.

In [14]:
x = numpy.zeros((6,1)) ## Why a 6x1 matrix instead of a vector? I want to make this the same form as the output
x[3,0] = 1
print("before")
print(x)
x = adjacency.dot(x)
print("after")
print(x)
before
[[0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]]
after
[[0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]]

Multiplying the matrix times this vector gives us the number of paths of length 1 from the nodes that are 1 away from node 3. This is the same as the number of paths of length 2 from node 3.

In [15]:
x = numpy.zeros((6,1))
x[3,0] = 1
#print(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
print(x)
[[1.]
 [1.]
 [1.]
 [4.]
 [1.]
 [1.]]

We can keep iterating like this to get the number of longer paths. Since the numbers get large I'm going to change this into a proportion of paths originating at node 3 and ending at each node.

In [16]:
x = numpy.zeros((6,1))
x[3,0] = 1
#print(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x /= numpy.sum(x)
print(x)
[[0.06916427]
 [0.16990874]
 [0.19320365]
 [0.24951969]
 [0.15910183]
 [0.15910183]]

Here's the cool part: as the path lengths get longer, it stops mattering where you started. If I start at node 2, the distribution of path ending spots is mostly the same as if I started at node 3!

In [17]:
x = numpy.zeros((6,1))
x[2,0] = 1 # <--- changing to start at node 2
#print(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x = adjacency.dot(x)
x /= numpy.sum(x)
print(x)
[[0.07569591]
 [0.16946118]
 [0.18362364]
 [0.26192414]
 [0.15464757]
 [0.15464757]]

Now let's look at a real network, and a matrix view of it. The nodes will be users and movies, and the edges will represent the fact that a user rated a movie.

We can't visualize this graph, because it's too large and too well connected. It will just look like a blob. It will be bipartite: there are two groups of nodes (movies, users) and all edges go between these groups. No edges go from user to user or from movie to movie.

In [20]:
ratings = pandas.read_csv("ratings.csv")
ratings.head(15)
Out[20]:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
5 1 70 3.0 964982400
6 1 101 5.0 964980868
7 1 110 4.0 964982176
8 1 151 5.0 964984041
9 1 157 5.0 964984100
10 1 163 5.0 964983650
11 1 216 5.0 964981208
12 1 223 3.0 964980985
13 1 231 5.0 964981179
14 1 235 4.0 964980908
In [21]:
movies = pandas.read_csv("movies.csv")
movies.head()
Out[21]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy

What is the distribution of ratings?

In [22]:
ratings.groupby("rating").rating.count()
Out[22]:
rating
0.5     1370
1.0     2811
1.5     1791
2.0     7551
2.5     5550
3.0    20047
3.5    13136
4.0    26818
4.5     8551
5.0    13211
Name: rating, dtype: int64
In [24]:
pyplot.hist(ratings.rating, bins=numpy.linspace(0,5.5,12), align="left")
pyplot.show()

Now we'll look for similar movies. One definition of "similar" is that a movie is rated by the same people. First, let's find an ID.

In [25]:
movies[ movies.title.str.contains("Avengers") ]
Out[25]:
movieId title genres
1611 2153 Avengers, The (1998) Action|Adventure
6148 44020 Ultimate Avengers (2006) Action|Animation|Children|Sci-Fi
7693 89745 Avengers, The (2012) Action|Adventure|Sci-Fi|IMAX
8551 115727 Crippled Avengers (Can que) (Return of the 5 D... Action|Adventure
8686 122892 Avengers: Age of Ultron (2015) Action|Adventure|Sci-Fi
8693 122912 Avengers: Infinity War - Part I (2018) Action|Adventure|Sci-Fi
9153 147657 Masked Avengers (1981) Action
9488 170297 Ultimate Avengers 2 (2006) Action|Animation|Sci-Fi
In [26]:
movies.query("movieId == 89745")
Out[26]:
movieId title genres
7693 89745 Avengers, The (2012) Action|Adventure|Sci-Fi|IMAX
In [28]:
ratings.query("movieId == 89745").head()
Out[28]:
userId movieId rating timestamp
1535 15 89745 2.0 1510572842
2182 18 89745 4.0 1455050365
3520 21 89745 4.0 1418063440
7891 52 89745 5.0 1468051351
9030 62 89745 4.0 1521488914

We can't immediately create a graph from the userId and movieId fields, because they both contain the id 1. So we'll add a string prefix to differentiate.

In [29]:
ratings["user_string"] = "u" + ratings.userId.astype(str)
In [30]:
ratings_net = networkx.convert_matrix.from_pandas_edgelist(ratings, "user_string", "movieId")

Now let's look at all the movies rated by people who rated The Avengers (2012). In other words, nodes that are reachable from the query node with paths of length 2.

In [31]:
movie_counts = Counter()

for user in ratings_net[89745].keys():
    for movie in ratings_net[user]:
        movie_counts[movie] += 1

movie_counts.most_common(10)
Out[31]:
[(89745, 69),
 (79132, 60),
 (2571, 57),
 (4993, 56),
 (58559, 55),
 (7153, 55),
 (59315, 53),
 (356, 52),
 (5952, 52),
 (260, 50)]
In [32]:
for movie_id, count in movie_counts.most_common(10):
    print(movies.query("movieId == {}".format(movie_id))["title"])
7693    Avengers, The (2012)
Name: title, dtype: object
7372    Inception (2010)
Name: title, dtype: object
1939    Matrix, The (1999)
Name: title, dtype: object
3638    Lord of the Rings: The Fellowship of the Ring,...
Name: title, dtype: object
6710    Dark Knight, The (2008)
Name: title, dtype: object
4800    Lord of the Rings: The Return of the King, The...
Name: title, dtype: object
6743    Iron Man (2008)
Name: title, dtype: object
314    Forrest Gump (1994)
Name: title, dtype: object
4137    Lord of the Rings: The Two Towers, The (2002)
Name: title, dtype: object
224    Star Wars: Episode IV - A New Hope (1977)
Name: title, dtype: object