The UK Web Archive offers a number of data download for analysis. One of them is a graph of links across the UK domain, and our goal is to outline how to process and visualise this data for a particular domain.

The Dataset

The ~2.5 billion 200 OK responses in the JISC UK Web Domain Dataset (1996-2010) dataset have been scanned for hyperlinks. For each link, we extract the host that the link targets, and use this to build up a picture of which hosts have linked to which other hosts, over time.

This host-level link graph summarises the number of links between hosts, in each year. The data format is a slightly unusual, as you can see from this snippet:

1996|appserver.ed.ac.uk|portico.bl.uk   1
1996|art-www.acorn.co.uk|portico.bl.uk  1
1996|astra.ich.ucl.ac.uk|portico.bl.uk  1
1996|back.niss.ac.uk|portico.bl.uk  1
1996|beta.bids.ac.uk|portico.bl.uk  2
1996|blaiseweb.bl.uk|blaiseweb.bl.uk    4
1996|bonsai.iielr.dmu.ac.uk|portico.bl.uk   4

There are two tab-separated columns. The first contains three bar-separated fields: the crawl year, the source host, and the target host. The second contains the number of linking URLs. Therefore, the first line:

1996|appserver.ed.ac.uk|portico.bl.uk   1

represents an assertion that, from the data crawled in 1996, we found one URL on the 'appserver.ed.ac.uk' host that contained a hyperlink to a resource held on 'portico.bl.uk'.

Scale

Visualising such a large data set is very difficult. A cutting edge algorithm was able to visualised the 1996 part ... see http://britishlibrary.typepad.co.uk/webarchive/2013/07/using-open-data-to-visualise-the-early-web.html

Filtering By Domain

One approach to making the dataset more manageable is to focus on a particular domain of interest. For example, using the compression-aware version of the grep tool, one can quickly pick out the links between the hosts on a particular domain. For example, this command extracts all the links relating to the British Libraries domain.

% zgrep "bl.uk" host-linkage.tsv.gz | sort > bl-uk-linkage.tsv

Note that the final stage sorts the data so that lines from each year appear together.

Talk about chart-based visualisation, linking to Peter's work. see http://peterwebster.me/2014/01/28/distant-reading-the-webarchive/

Dynamic Visualisation

One very common strand of work I've seen over the years is that of looking out to the open source community and finding tools that can be reused.

There are very sophisticated graph analysis tools, notably gephi, although time-dependent graphs are less well supported.

However, another completely different tool may be of use here.

Visualisation using Gource

Gource was designed as a tool for visualising how software has been changed over time.

However, it has a generic input log format which we can re-use here:

1275543595|andrew|A|src/main.cpp|FF0000

If we can get from

1996|appserver.ed.ac.uk|portico.bl.uk   1

to something like

820454400|appserver.ed.ac.uk|A|uk/bl/portico/uk/ac/ed/appserver/1|FF0000

we should find some interesting patterns emerge.

In [6]:
with open('./host-linkage/bl-uk-host-linkage.dat', 'r') as f:
    for line in f:
        print(line)
        row = line.rstrip().replace('|','\t').split("\t")
        print(row)
        break
1996|appserver.ed.ac.uk|portico.bl.uk	1

['1996', 'appserver.ed.ac.uk', 'portico.bl.uk', '1']
In [37]:
import time

# Open input and output files
with open('./host-linkage/bl-uk-host-linkage.dat', 'r') as fin:
    with open('./host-linkage/linkage.log', 'w') as fout:
        counter = 0
        for line in fin:
            # Reformat:
            new_line = to_gource(line)
            fout.write(new_line)
            fout.write('\n')
            
            # Also count:
            counter = counter + 1
            
            # Report progress:
            if( counter%10000 == 0 ):
                print(counter, line, new_line, '\n')
            
    # Report outcome:
    print("Wrote {} lines.".format(counter))
    fout.close()
    
10000 2001|www.orchardhotel.demon.co.uk|portico.bl.uk	2
 978307200|www.orchardhotel.demon.co.uk|A|uk/co/demon/orchardhotel/www|FF0000 

20000 2003|portico.bl.uk|www.london-victorian-ring.com	1
 1041379200|www.london-victorian-ring.com|A|com/london-victorian-ring/www|0000FF 

30000 2004|portico.bl.uk|www.parlophone.co.uk	4
 1072915200|www.parlophone.co.uk|A|uk/co/parlophone/www|0000FF 

40000 2005|portico.bl.uk|www.biomedcentral.com	3
 1104537600|www.biomedcentral.com|A|com/biomedcentral/www|0000FF 

50000 2006|minos.bl.uk|www.theglobalsite.ac.uk	1
 1136073600|www.theglobalsite.ac.uk|A|uk/ac/theglobalsite/www|0000FF 

60000 2007|gopher.bl.uk|www.google.com	5
 1167609600|www.google.com|A|com/google/www|0000FF 

70000 2007|www.bl.uk|www.regione.emilia-romagna.it	13
 1167609600|www.regione.emilia-romagna.it|A|it/emilia-romagna/regione/www|0000FF 

80000 2009|www.bl.uk|www.bank.lv	2
 1230768000|www.bank.lv|A|lv/bank/www|0000FF 

Wrote 86281 lines.
In [2]:
import time

def to_gource(line):
    row = line.rstrip().replace('|','\t').split("\t")
    timestamp = int(time.mktime(time.strptime(row[0], "%Y")))
    hostname = row[1]
    blhost = row[2]
    action = "A"
    colour = "FF0000"
    if( blhost.find("bl.uk") == -1 ):
        hostname = row[2]
        blhost = row[1]
        colour = "0000FF"
    #path = '/'.join(reversed(blhost.split('.')))
    #path = path +'/' + '/'.join(reversed(hostname.split('.')))
    path = '/'.join(reversed(hostname.split('.')))
    return "{}|{}|{}|{}|{}".format(timestamp,hostname,action,path,colour)
    
print(to_gource("1996|appserver.ed.ac.uk|portico.bl.uk	1"))
820454400|appserver.ed.ac.uk|A|uk/ac/ed/appserver|FF0000

Passing this to gource works ok

gource host-linkage/linkage.log

...BUT it's rather unclear what's going on, as each old link turns up afresh. We really need to keep the state from year to year so we can add/delete links and can tune the colour. RED for inlink, BLUE for outlink, GREEN for both.

So, we change tack and store the whole graph first, so we can pick through it later.

In [3]:
import time

# Takes a line from the linkage dataset and converts it into the form
# (year, link_path, link_source, link_num)
# Where 'actor' is the host that created the link
def transform_link(line):
    row = line.rstrip().replace('|','\t').split("\t")
    year = row[0]
    link_source = row[1]
    link_target = row[2]
    host = row[1]
    blhost = row[2]
    link_num = row[3]
    if( blhost.find("bl.uk") == -1 ):
        host = row[2]
        blhost = row[1]
    path = '/'.join(reversed(host.split('.')))
    return (year, path, link_source, link_num)

# Open input and output files
known = {}
years = set()
paths = set()
counter = 0
with open('./host-linkage/york-ac-uk-linkage.tsv', 'r') as fin:
     for line in fin:
            try:
                # Reformat:
                (year, path, link_source, link_num) = transform_link(line)
                if( link_source.find("york.ac.uk") == -1 ):
                    link_type = "in"
                else:
                    link_type = "out"
                key = "{}|{}|{}".format(year, path, link_type)
                known[key] = link_num
                years.add(year)
                paths.add(path)
                
            except Exception as e:
                print(e)
                print(line)
                stop
            
            # Also count:
            counter = counter + 1
            
            # Report progress:
            if( counter%10000 == 0 ):
                print(counter, line, key, '\n')
            
# Report outcome:
print("Processed {} lines.".format(counter))
10000 1997|www.york.ac.uk|www.nsls.bnl.gov	1
 1997|gov/bnl/nsls/www|out 

20000 1999|www.geo.ed.ac.uk|www-users.york.ac.uk	2
 1999|uk/ac/york/www-users|in 

30000 2000|www.together.creations.co.uk|neural13.cs.york.ac.uk	2
 2000|uk/ac/york/cs/neural13|in 

40000 2001|www.melcom.free-online.co.uk|www-users.york.ac.uk	3
 2001|uk/ac/york/www-users|in 

50000 2001|www-users.york.ac.uk|www.trafford.gov.uk	1
 2001|uk/gov/trafford/www|out 

60000 2002|missendenchurch.org.uk|www-users.york.ac.uk	1
 2002|uk/ac/york/www-users|in 

70000 2002|www-users.cs.york.ac.uk|sloan.stanford.edu	4
 2002|edu/stanford/sloan|out 

80000 2002|www-users.york.ac.uk|www.umr.edu	5
 2002|edu/umr/www|out 

90000 2003|ctiwebct.york.ac.uk|www.wfc.ac.uk	4
 2003|uk/ac/wfc/www|out 

100000 2003|www.physrev.york.ac.uk|www-sci.lib.uci.edu	2
 2003|edu/uci/lib/www-sci|out 

110000 2003|www-users.york.ac.uk|www.bmjpg.com	7
 2003|com/bmjpg/www|out 

120000 2003|www.york.ac.uk|www.etc.com.au	2
 2003|au/com/etc/www|out 

130000 2004|npg.york.ac.uk|nnsa.dl.ac.uk	7
 2004|uk/ac/dl/nnsa|out 

140000 2004|www-rr.york.ac.uk|www.thesaurus.com	1
 2004|com/thesaurus/www|out 

150000 2004|www-users.york.ac.uk|www.eee.bham.ac.uk	5
 2004|uk/ac/bham/eee/www|out 

160000 2004|www.york.ac.uk|www.htdig.org	25
 2004|org/htdig/www|out 

170000 2005|www.cpag.org.uk|www.york.ac.uk	3
 2005|uk/ac/york/www|in 

180000 2005|www-users.york.ac.uk|websitegarage.netscape.com	4
 2005|com/netscape/websitegarage|out 

190000 2005|www.york.ac.uk|www.mariestopes.org.uk	2
 2005|uk/org/mariestopes/www|out 

200000 2006|www.gwc.org.uk|www-users.york.ac.uk	8
 2006|uk/ac/york/www-users|in 

210000 2006|www.york.ac.uk|www.brentwood-council.gov.uk	1
 2006|uk/gov/brentwood-council/www|out 

220000 2007|www.aspies.co.uk|www.york.ac.uk	18
 2007|uk/ac/york/www|in 

230000 2007|www-users.york.ac.uk|www.pinegroup.com	15
 2007|com/pinegroup/www|out 

240000 2008|motility.york.ac.uk|www3.ncbi.nlm.nih.gov	3
 2008|gov/nih/nlm/ncbi/www3|out 

250000 2008|www-users.york.ac.uk|www.ntgateway.com	1
 2008|com/ntgateway/www|out 

260000 2009|www.bromhammillers.co.uk|www.york.ac.uk	1
 2009|uk/ac/york/www|in 

270000 2009|www.york.ac.uk|www.prowess.org.uk	1
 2009|uk/org/prowess/www|out 

280000 2010|www.york.ac.uk|www.colartz.com	1
 2010|com/colartz/www|out 

Processed 284247 lines.

So, now we can re-use the processed form and output as changes:

In [4]:
def get_state(year,path):
    key_in = "{}|{}|{}".format(year, path, "in")
    key_out = "{}|{}|{}".format(year, path, "out")
    state = None;
    if( key_in in known ):
        state = "in"
    if( key_out in known ):
        state = "out"
    if( key_in in known and key_out in known ):
        state = "both"
    return state;

# Now process and output in the form "1230768000|www.bank.lv|A|lv/bank/www|0000FF" :    
changes = 0
deletions = 0
with open('./host-linkage/york-linkage.log', 'w') as fout:

    # Loop over all known years and paths:
    for year in sorted(years):
        timestamp = int(time.mktime(time.strptime(year, "%Y")))
        for path in paths:
            # Determine state for current year:
            current_state = get_state(year,path)
            previous_state = get_state(int(year)-1,path)
            host = '.'.join(reversed(path.split('/')))
            if( current_state != None and current_state != previous_state ):
                changes += 1
                if( previous_state == None ):
                    action = "A"
                else:
                    action = "M"
                # And now the in/out state:
                if( current_state == "in" ):
                    colour = "FFFF00"
                elif( current_state == "out" ):
                    colour = "0000FF"
                elif( current_state == "both" ):
                    colour = "00FF00"
                agent = path
                fout.write("{}|{}|{}|{}/{}|{}".format(timestamp,agent,action,path,host,colour) )
                fout.write('\n')
            if( current_state == None and previous_state != None ):
                # This is the case of deleted links:
                deletions += 1
                action = "D"
                colour = "000000"
                agent = path
                fout.write("{}|{}|{}|{}/{}|{}".format(timestamp,agent,action,path,host,colour) )
                fout.write('\n')
                
    print("Output {} changes.".format(changes))
    print("Output {} deletions.".format(deletions))
    fout.close()
    
Output 61007 changes.
Output 51613 deletions.

Then, running gource with these options seems to work well (having the labels on obscures things too much).

 gource  --max-file-lag 0.1 --title "Links to/from bl.uk, 1996-2010." --hide bloom,users,usernames,dirnames,filenames host-linkage/linkage.log

Links from bl.uk are shown in blue. Links to bl.uk are shown in yellow. Both-ways in green.

e.g. 1996

and 2010

Similarly, I was able to create a video suitable for YouTube:

gource -o gource.ppm  --max-file-lag 0.1 --title "Links to/from bl.uk, 1996-2010." --hide bloom,users,usernames,dirnames,filenames host-linkage/linkage.log 
ffmpeg -y -r 60 -f image2pipe -vcodec ppm -i gource.ppm -vcodec libx264 -preset ultrafast -pix_fmt yuv420p -crf 1 -threads 0 -bf 0 gource.mp4

now here: https://www.youtube.com/watch?v=rX6Hix19_No

Post-Generation Filtering

e.g. to look at just the ac.uk bit:

grep "|uk/ac/" linkage.log > linkage-ac.uk.log

and then

 gource --hide bloom,users,usernames,dirnames,filenames host-linkage/linkage-ac.uk.log