Lab 5 (Option 1) - PageRank

Authors:

v1.0 (2014 Fall) Rishi Sharma *, Sahaana Suri *, Kangwook Lee **, Kannan Ramchandran **
v1.1 (2015 Fall) Kabir Chandrasekher *, Max Kanwal *, Kangwook Lee **, Kannan Ramchandran **
v1.2 (2016 Spring) Kabir Chandrasekher, Tony Duan, David Marn, Ashvin Nair, Kangwook Lee, Kannan Ramchandran

Introduction

From Wikipedia:

PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

There are four common frameworks by which academics view Google's PageRank algorithm. The first looks at the social impact, both positive and negative, of immediate access to previously unimaginable knowledge through one centralized terminal. The second, and most mathematical, sees PageRank as a computation of the Singular Value Decomposition (SVD) of the adjacency matrix of the graph formed by the internet, with particular emphasis paid to the first few singular vectors. The third, and most far-reaching, practical technical implication of Google's work is the implementation of algorithms and computation at enormous scale. Much of the computing infrastructure which operates at a global scale deployed today can trace its origins to Google's need to perform SVD on an object as enormous as the Internet. Finally, a more intuitive way to look at the PageRank algorithm is through the lens of a web crawler (or many web crawler) acting as an agent (or agents) in a Markov Chain the size of the web. We will investigate this viewpoint.

This crawler is searching for an approximate "invariant" distribution (why does a true invariant distribution almost certainly not exist?) and will rank pages based on their "probability" in this generated distribution. In order to do so, our crawler chooses to follow a link uniformly at random from the page it is on in order to arrive at a new page, keeping tally of how many times it has visited each page. If this crawler runs for a really, really long time, the fraction of time it has spent on each webpage will approximately be the probability of being on that page (assuming we account for pathologies in the Markov chain which we will discuss soon). We then rank pages in decreasing order of probability.

Alright, great! Let's do stuff. First, visit the following webpage, and see how many web pages can be reached by clicking the links on each page. http://www.eecs.berkeley.edu/~kw1jjang/ee126/1.html

There are total of $8$ pages, and they are connected as follows.

Since we choose a link at uniform from each page, the probability of going between pages $x$ and $y$ is $\Large \frac{\text{# of pages from x to y}}{\text{# of pages leaving x}}$

Thus the Markov chain generated by the web pages above is

and the transition matrix of the Markov chain is

$$ \left( \begin{array}{cccccccc} 0 & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & 0 & 0 & \frac{1}{5} \\ \frac{1}{2} & 0 & 0 & 0 & \frac{1}{2} & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & \frac{1}{3} & \frac{1}{3} & \frac{1}{3} & 0 & 0 & 0 & 0 \\ \frac{1}{4} & 0 & \frac{1}{4} & 0 & 0 & 0 & \frac{1}{4} & \frac{1}{4} \\ \frac{1}{4} & \frac{1}{4} & 0 & 0 & \frac{1}{4} & \frac{1}{4} & 0 & 0 \\ \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & 0 & 0 & \frac{1}{5} & 0 \end{array} \right) $$

$\mathcal{Q}$1. Find the steady-state (invariant/stationary) distribution $\pi$ of the Markov chain above. How do you know that it exists?

The Markov matrix is copied in code below. This might make your computation easier, but you can solve this in any way you wish. (Note: don't forget about the difference between right and left eigenvectors)

In [ ]:
import numpy as np
from __future__ import division

P = np.matrix([[0, 1/5, 1/5, 1/5, 1/5, 0, 0, 1/5],
               [1/2, 0, 0, 0, 1/2, 0, 0, 0],
               [0, 1, 0, 0, 0, 0, 0, 0],
               [0, 0, 0, 0, 1, 0, 0, 0],
               [0, 1/3, 1/3, 1/3, 0, 0, 0, 0],
               [1/4,0,1/4,0,0,0,1/4,1/4],
               [1/4, 1/4, 0, 0, 1/4, 1/4, 0, 0],
               [1/5, 1/5, 1/5, 1/5, 0, 0, 1/5, 0]])

P_transpose = P.T

print P
In [ ]:
# Your code here

Simulation time!

We now want to empirically test what we solved above by modeling a random user hopping along those webpages.

We will start the user at "1.html" and behave as per the Markov chain above. In the code below, we simulate this and keep track of the average amount of time a user spends in each state. We will expect that after enough iterations, the fraction of time spent in each state should approach the stationary distribution.

We use the parse_links() method to parse all hyperlinks in a page. We use the library Beautiful Soup in order to complete this portion of the lab in order to easiliy parse pages. Once you download the latest release, you must build and install setup.py. Alternatively, use pip or easy_install (help).

Note that this code relies on having Python 2 installed -- it will not work with Python 3.

In [ ]:
import re
import sys
import urllib
import urlparse
import random
from bs4 import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
   version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def domain(url):
    """
    Parse a url to give you the domain.
    """
    # urlparse breaks down the url passed it, and you split the hostname up 
    # ex: hostname="www.google.com" becomes ['www', 'google', 'com']
    hostname = urlparse.urlparse(url).hostname.split(".")
    hostname = ".".join(len(hostname[-2]) < 4 and hostname[-3:] or hostname[-2:])
    return hostname
    
def parse_links(url, url_start):
    """
    Return all the URLs on a page and return the start URL if there is an error or no URLS.
    """
    url_list = []
    myopener = MyOpener()
    try:
        # open, read, and parse the text using beautiful soup
        page = myopener.open(url)
        text = page.read()
        page.close()
        soup = BeautifulSoup(text, "html")

        # find all hyperlinks using beautiful soup
        for tag in soup.findAll('a', href=True):
            # concatenate the base url with the path from the hyperlink
            tmp = urlparse.urljoin(url, tag['href'])
            # we want to stay in the berkeley EECS domain (more relevant later)...
            if domain(tmp).endswith('berkeley.edu') and 'eecs' in tmp:
                url_list.append(tmp)
        if len(url_list) == 0:
            return [url_start]
        return url_list
    except:
        return [url_start]

$\mathcal{Q}$2. Simulating a Random Walk

In the following code block, use the above functions to surf the web pages described by the Markov chain above. This code block may take a while to run. If it is taking more than a couple of minutes, maybe try reducing num_of_visits in order to at least get results. Also, running your code while connected to AirBears may help if you have a slow internet connection at home.

In [ ]:
import random

# the url we want to begin with 
url_start = "http://www.eecs.berkeley.edu/~kw1jjang/ee126/1.html"
current_url = url_start

# parameter to set the number of transitions you make/different pages you visit
num_of_visits = 1000

# dictionary of pages visited so far
visit_history = {}

# initialize dictionary since we know exactly where we'll end up
for i in range(1, 9):
    page = "http://www.eecs.berkeley.edu/~kw1jjang/ee126/" + str(i) + ".html"
    visit_history[page] = 0

for i in range(num_of_visits):

    # Your code here

Print your results:

In [ ]:
for i in range(1, 9):
    page = "http://www.eecs.berkeley.edu/~kw1jjang/ee126/" + str(i) + ".html"
    print 'Fraction of time staying on page %d is %f' % (i, float(visit_history[page])/num_of_visits)

Does this approximately match the invariant distribution you expected?

Generalizing to the Web

The toy websites given above conveniently form an irreducible Markov chain (look up what this means if you do not remember from class), but most of the web will not look like this. There will be fringes of the internet containing only self-loops, or some web pages which do not link to others at all. In order to account for such pathologies in the web, we need to make a more intelligent surfer. The simplest idea would be to just jump back to the starting page if there are no links found on the page you are on, and to always return to a "good" starting point with probability $p$ on every page.

This is a very naive scheme, and there are many more intelligent methods by which you can sample from the distribution of the Internet, accounting for its pathologies and all.

Ranking Berkeley Professors

The following code is a (weak) attempt to rank the Berkeley faculty based on a crawler which begins on the EECS research homepage.

In [ ]:
url_start = "http://www.eecs.berkeley.edu/Research/"
current_url = url_start
num_of_visits = 200

#List of professors obtained from the EECS page
profs = ['Abbeel','Agrawala','Alon','Anantharam','Arcak','Arias','Asanović','Bachrach','Bajcsy','Bodik','Bokor','Boser','Brewer','Canny','Chang-Hasnain','Culler','Darrell','Demmel','Fearing','Fox','Franklin','Garcia','Goldberg','Hartmann','Harvey','Hellerstein','Javey','Joseph','Katz','Keutzer','Liu','Klein','Kubiatowicz','Lee','Lustig','Maharbiz','Malik','Nguyen','Niknejad','Nikolic',"O'Brien",'Parekh','Patterson','Paxson','Pisano','Rabaey','Ramchandran','Roychowdhury','Russell','Sahai','Salahuddin','Sanders','Sangiovanni-Vincentelli','Sastry','Sen','Seshia','Shenker','Song','Song','Spanos','Stoica','Stojanovic','Tomlin','Tygar','Walrand','Wawrzynek','Wu','Yablonovitch','Yelick','Zakhor']
#Bad URLs help take care of some pathologies that ruin our surfing
bad_urls = ['http://www.erso.berkeley.edu/','http://www.eecs.berkeley.edu/Rosters/roster.name.nostudentee.html','http://www.eecs.berkeley.edu/Resguide/admin.shtml#aliases','http://www.eecs.berkeley.edu/department/EECSbrochure/c1-s3.html']

#Creating a dictionary to keep track of how often we come across a professor
profdict = {}
for i in profs:
    profdict[i] = 0

for i in range(num_of_visits):
    print  i , ' Visiting... ', current_url
    if random.random() < 0.95: #follow a link!
        url_list = parse_links(current_url, url_start)
        updated = False
        while not updated:
            current_url = random.choice(url_list)
            updated = True
            if current_url in bad_urls or "iris" in current_url or "Deptonly" in current_url or "anchor" in current_url or "erso" in current_url: #dealing with more pathologies:
                updated = False
        myopener = MyOpener()
        page = myopener.open(current_url)
        text = page.read()
        page.close()
        #Figuring out which professor is mentioned on a page.
        for p in profs:
            profdict[p]+= 1 if " " + p + " " in text else 0 #can use regex re.findall(i,text), but it's overkill
    else: #click the "home" button!
        current_url = url_start
In [ ]:
prof_ranks = [pair[0] for pair in sorted(profdict.items(), key = lambda item: item[1], reverse=True)]
top_score = profdict[prof_ranks[0]]
for i in range(len(prof_ranks)):
    print "%d %f: %s" % (i+1,profdict[prof_ranks[i]]/top_score, prof_ranks[i])
In [ ]:
print 'Top score: ', top_score

$\mathcal{Q}$3. Try your hand at applying the above idea to a website you personally visit (somewhat) frequently! Do a simple crawl (similar to the above) and see if you can figure out something interesting. (Keep it simple.)

In [ ]:
# Your code here