The UK Web Archive offers a number of data download for analysis. One of them is a graph of links across the UK domain, and our goal is to outline how to process and visualise this data for a particular domain.
The ~2.5 billion 200 OK responses in the [JISC UK Web Domain Dataset (1996-2010)]({{ site.baseurl }}/ukwa.ds.2/) dataset have been scanned for hyperlinks. For each link, we extract the host that the link targets, and use this to build up a picture of which hosts have linked to which other hosts, over time.
This host-level link graph summarises the number of links between hosts, in each year. The data format is a slightly unusual, as you can see from this snippet:
1996|appserver.ed.ac.uk|portico.bl.uk 1
1996|art-www.acorn.co.uk|portico.bl.uk 1
1996|astra.ich.ucl.ac.uk|portico.bl.uk 1
1996|back.niss.ac.uk|portico.bl.uk 1
1996|beta.bids.ac.uk|portico.bl.uk 2
1996|blaiseweb.bl.uk|blaiseweb.bl.uk 4
1996|bonsai.iielr.dmu.ac.uk|portico.bl.uk 4
There are two tab-separated columns. The first contains three bar-separated fields: the crawl year, the source host, and the target host. The second contains the number of linking URLs. Therefore, the first line:
1996|appserver.ed.ac.uk|portico.bl.uk 1
represents an assertion that, from the data crawled in 1996, we found one URL on the 'appserver.ed.ac.uk' host that contained a hyperlink to a resource held on 'portico.bl.uk'.
Visualising such a large data set is very difficult. A cutting edge algorithm was able to visualised the 1996 part ... see http://britishlibrary.typepad.co.uk/webarchive/2013/07/using-open-data-to-visualise-the-early-web.html
One approach to making the dataset more manageable is to focus on a particular domain of interest. For example, using the compression-aware version of the grep tool, one can quickly pick out the links between the hosts on a particular domain. For example, this command extracts all the links relating to the British Libraries domain.
% zgrep "bl.uk" host-linkage.tsv.gz | sort > bl-uk-linkage.tsv
Note that the final stage sorts the data so that lines from each year appear together.
Talk about chart-based visualisation, linking to Peter's work. see http://peterwebster.me/2014/01/28/distant-reading-the-webarchive/
One very common strand of work I've seen over the years is that of looking out to the open source community and finding tools that can be reused.
There are very sophisticated graph analysis tools, notably gephi, although time-dependent graphs are less well supported.
However, another completely different tool may be of use here.
Gource was designed as a tool for visualising how software has been changed over time.
However, it has a generic input log format which we can re-use here:
1275543595|andrew|A|src/main.cpp|FF0000
If we can get from
1996|appserver.ed.ac.uk|portico.bl.uk 1
to something like
820454400|appserver.ed.ac.uk|A|uk/bl/portico/uk/ac/ed/appserver/1|FF0000
we should find some interesting patterns emerge.
with open('./host-linkage/bl-uk-host-linkage.dat', 'r') as f:
for line in f:
print(line)
row = line.rstrip().replace('|','\t').split("\t")
print(row)
break
1996|appserver.ed.ac.uk|portico.bl.uk 1 ['1996', 'appserver.ed.ac.uk', 'portico.bl.uk', '1']
import time
# Open input and output files
with open('./host-linkage/bl-uk-host-linkage.dat', 'r') as fin:
with open('./host-linkage/linkage.log', 'w') as fout:
counter = 0
for line in fin:
# Reformat:
new_line = to_gource(line)
fout.write(new_line)
fout.write('\n')
# Also count:
counter = counter + 1
# Report progress:
if( counter%10000 == 0 ):
print(counter, line, new_line, '\n')
# Report outcome:
print("Wrote {} lines.".format(counter))
fout.close()
10000 2001|www.orchardhotel.demon.co.uk|portico.bl.uk 2 978307200|www.orchardhotel.demon.co.uk|A|uk/co/demon/orchardhotel/www|FF0000 20000 2003|portico.bl.uk|www.london-victorian-ring.com 1 1041379200|www.london-victorian-ring.com|A|com/london-victorian-ring/www|0000FF 30000 2004|portico.bl.uk|www.parlophone.co.uk 4 1072915200|www.parlophone.co.uk|A|uk/co/parlophone/www|0000FF 40000 2005|portico.bl.uk|www.biomedcentral.com 3 1104537600|www.biomedcentral.com|A|com/biomedcentral/www|0000FF 50000 2006|minos.bl.uk|www.theglobalsite.ac.uk 1 1136073600|www.theglobalsite.ac.uk|A|uk/ac/theglobalsite/www|0000FF 60000 2007|gopher.bl.uk|www.google.com 5 1167609600|www.google.com|A|com/google/www|0000FF 70000 2007|www.bl.uk|www.regione.emilia-romagna.it 13 1167609600|www.regione.emilia-romagna.it|A|it/emilia-romagna/regione/www|0000FF 80000 2009|www.bl.uk|www.bank.lv 2 1230768000|www.bank.lv|A|lv/bank/www|0000FF Wrote 86281 lines.
import time
def to_gource(line):
row = line.rstrip().replace('|','\t').split("\t")
timestamp = int(time.mktime(time.strptime(row[0], "%Y")))
hostname = row[1]
blhost = row[2]
action = "A"
colour = "FF0000"
if( blhost.find("bl.uk") == -1 ):
hostname = row[2]
blhost = row[1]
colour = "0000FF"
#path = '/'.join(reversed(blhost.split('.')))
#path = path +'/' + '/'.join(reversed(hostname.split('.')))
path = '/'.join(reversed(hostname.split('.')))
return "{}|{}|{}|{}|{}".format(timestamp,hostname,action,path,colour)
print(to_gource("1996|appserver.ed.ac.uk|portico.bl.uk 1"))
820454400|appserver.ed.ac.uk|A|uk/ac/ed/appserver|FF0000
Passing this to gource works ok
gource host-linkage/linkage.log
...BUT it's rather unclear what's going on, as each old link turns up afresh. We really need to keep the state from year to year so we can add/delete links and can tune the colour. RED for inlink, BLUE for outlink, GREEN for both.
So, we change tack and store the whole graph first, so we can pick through it later.
import time
# Takes a line from the linkage dataset and converts it into the form
# (year, link_path, link_source, link_num)
# Where 'actor' is the host that created the link
def transform_link(line):
row = line.rstrip().replace('|','\t').split("\t")
year = row[0]
link_source = row[1]
link_target = row[2]
host = row[1]
blhost = row[2]
link_num = row[3]
if( blhost.find("bl.uk") == -1 ):
host = row[2]
blhost = row[1]
path = '/'.join(reversed(host.split('.')))
return (year, path, link_source, link_num)
# Open input and output files
known = {}
years = set()
paths = set()
counter = 0
with open('./host-linkage/york-ac-uk-linkage.tsv', 'r') as fin:
for line in fin:
try:
# Reformat:
(year, path, link_source, link_num) = transform_link(line)
if( link_source.find("york.ac.uk") == -1 ):
link_type = "in"
else:
link_type = "out"
key = "{}|{}|{}".format(year, path, link_type)
known[key] = link_num
years.add(year)
paths.add(path)
except Exception as e:
print(e)
print(line)
stop
# Also count:
counter = counter + 1
# Report progress:
if( counter%10000 == 0 ):
print(counter, line, key, '\n')
# Report outcome:
print("Processed {} lines.".format(counter))
10000 1997|www.york.ac.uk|www.nsls.bnl.gov 1 1997|gov/bnl/nsls/www|out 20000 1999|www.geo.ed.ac.uk|www-users.york.ac.uk 2 1999|uk/ac/york/www-users|in 30000 2000|www.together.creations.co.uk|neural13.cs.york.ac.uk 2 2000|uk/ac/york/cs/neural13|in 40000 2001|www.melcom.free-online.co.uk|www-users.york.ac.uk 3 2001|uk/ac/york/www-users|in 50000 2001|www-users.york.ac.uk|www.trafford.gov.uk 1 2001|uk/gov/trafford/www|out 60000 2002|missendenchurch.org.uk|www-users.york.ac.uk 1 2002|uk/ac/york/www-users|in 70000 2002|www-users.cs.york.ac.uk|sloan.stanford.edu 4 2002|edu/stanford/sloan|out 80000 2002|www-users.york.ac.uk|www.umr.edu 5 2002|edu/umr/www|out 90000 2003|ctiwebct.york.ac.uk|www.wfc.ac.uk 4 2003|uk/ac/wfc/www|out 100000 2003|www.physrev.york.ac.uk|www-sci.lib.uci.edu 2 2003|edu/uci/lib/www-sci|out 110000 2003|www-users.york.ac.uk|www.bmjpg.com 7 2003|com/bmjpg/www|out 120000 2003|www.york.ac.uk|www.etc.com.au 2 2003|au/com/etc/www|out 130000 2004|npg.york.ac.uk|nnsa.dl.ac.uk 7 2004|uk/ac/dl/nnsa|out 140000 2004|www-rr.york.ac.uk|www.thesaurus.com 1 2004|com/thesaurus/www|out 150000 2004|www-users.york.ac.uk|www.eee.bham.ac.uk 5 2004|uk/ac/bham/eee/www|out 160000 2004|www.york.ac.uk|www.htdig.org 25 2004|org/htdig/www|out 170000 2005|www.cpag.org.uk|www.york.ac.uk 3 2005|uk/ac/york/www|in 180000 2005|www-users.york.ac.uk|websitegarage.netscape.com 4 2005|com/netscape/websitegarage|out 190000 2005|www.york.ac.uk|www.mariestopes.org.uk 2 2005|uk/org/mariestopes/www|out 200000 2006|www.gwc.org.uk|www-users.york.ac.uk 8 2006|uk/ac/york/www-users|in 210000 2006|www.york.ac.uk|www.brentwood-council.gov.uk 1 2006|uk/gov/brentwood-council/www|out 220000 2007|www.aspies.co.uk|www.york.ac.uk 18 2007|uk/ac/york/www|in 230000 2007|www-users.york.ac.uk|www.pinegroup.com 15 2007|com/pinegroup/www|out 240000 2008|motility.york.ac.uk|www3.ncbi.nlm.nih.gov 3 2008|gov/nih/nlm/ncbi/www3|out 250000 2008|www-users.york.ac.uk|www.ntgateway.com 1 2008|com/ntgateway/www|out 260000 2009|www.bromhammillers.co.uk|www.york.ac.uk 1 2009|uk/ac/york/www|in 270000 2009|www.york.ac.uk|www.prowess.org.uk 1 2009|uk/org/prowess/www|out 280000 2010|www.york.ac.uk|www.colartz.com 1 2010|com/colartz/www|out Processed 284247 lines.
So, now we can re-use the processed form and output as changes:
def get_state(year,path):
key_in = "{}|{}|{}".format(year, path, "in")
key_out = "{}|{}|{}".format(year, path, "out")
state = None;
if( key_in in known ):
state = "in"
if( key_out in known ):
state = "out"
if( key_in in known and key_out in known ):
state = "both"
return state;
# Now process and output in the form "1230768000|www.bank.lv|A|lv/bank/www|0000FF" :
changes = 0
deletions = 0
with open('./host-linkage/york-linkage.log', 'w') as fout:
# Loop over all known years and paths:
for year in sorted(years):
timestamp = int(time.mktime(time.strptime(year, "%Y")))
for path in paths:
# Determine state for current year:
current_state = get_state(year,path)
previous_state = get_state(int(year)-1,path)
host = '.'.join(reversed(path.split('/')))
if( current_state != None and current_state != previous_state ):
changes += 1
if( previous_state == None ):
action = "A"
else:
action = "M"
# And now the in/out state:
if( current_state == "in" ):
colour = "FFFF00"
elif( current_state == "out" ):
colour = "0000FF"
elif( current_state == "both" ):
colour = "00FF00"
agent = path
fout.write("{}|{}|{}|{}/{}|{}".format(timestamp,agent,action,path,host,colour) )
fout.write('\n')
if( current_state == None and previous_state != None ):
# This is the case of deleted links:
deletions += 1
action = "D"
colour = "000000"
agent = path
fout.write("{}|{}|{}|{}/{}|{}".format(timestamp,agent,action,path,host,colour) )
fout.write('\n')
print("Output {} changes.".format(changes))
print("Output {} deletions.".format(deletions))
fout.close()
Output 61007 changes. Output 51613 deletions.
Then, running gource with these options seems to work well (having the labels on obscures things too much).
gource --max-file-lag 0.1 --title "Links to/from bl.uk, 1996-2010." --hide bloom,users,usernames,dirnames,filenames host-linkage/linkage.log
Links from bl.uk are shown in blue. Links to bl.uk are shown in yellow. Both-ways in green.
e.g. 1996
and 2010
Similarly, I was able to create a video suitable for YouTube:
gource -o gource.ppm --max-file-lag 0.1 --title "Links to/from bl.uk, 1996-2010." --hide bloom,users,usernames,dirnames,filenames host-linkage/linkage.log
ffmpeg -y -r 60 -f image2pipe -vcodec ppm -i gource.ppm -vcodec libx264 -preset ultrafast -pix_fmt yuv420p -crf 1 -threads 0 -bf 0 gource.mp4
e.g. to look at just the ac.uk bit:
grep "|uk/ac/" linkage.log > linkage-ac.uk.log
and then
gource --hide bloom,users,usernames,dirnames,filenames host-linkage/linkage-ac.uk.log