The Listiness of Wikipedia¶

Although it was only an aside, an answer of "What is a Reference work?" caught my attention at UC Berkeley iSchool's March 21st Friday Afternoon Seminar by Michael Buckland. One possible answer suggested was: works that are over 80% list.

That definition, although seeming a bit short, was actually serious suggestion published by Marcia Bates in 1984. [Bates, Marcia J. "What Is a Reference Book: A Theoretical and Empirical Analysis." RQ 26 (Fall 1986): 37-57] This is an elegant solution in my opinion as a way to define reference works because although heuristic, it's entirely quantitative. Still necessary though, is a definition of list. According to Bates every book is a certain percentage list. Consider a classical monograph, it probably has a table of contents, or index - which is a list structure.

At this point in reading, I realised that it would be simple identify what parts of Wikipedia articles are list. And so, we could determine the percentage list - or the "listiness" of - each Wikipedia article.

Method¶

Analysing a May 2014 copy of English Wikipedia, we look at the listiness of all articles in the main namespace. To do this I used the xml_dump library from the excellent mediawiki-utilities by @halfak.

In wikitext lists are identified, by the prepending of a line with the characters * (unordered list) and # (numbered list). Additionally there are Infoboxes, which use | (pipe character) and tables whose rows begin with |- (pipe dash). What percentage of lines begin with any of these characters therefore determine the share of list of an article, or the "listiness" as I am now coining it.

We do not allow redirect pages - pages with a starting '#' character and are one line long. Those pages which are not redirects, we term 'canonical pages'. We do not allow "talk" pages either.

So for instnance, we can look at the statistics of each of these different line-starting characters. Below are the mean number of these line-startings per page.

Results¶

In [3]:

canon_description = canonical.describe()
canon_description.loc[['mean','std']]

Out[3]:

	*	#	\|-	\|	total
mean	6.660130	0.457347	3.392566	29.755385	2284.055053
std	33.667805	7.173435	24.983288	124.254805	4120.072758

2 rows × 5 columns

We read this as the mean average number of undorded list items per page is 6.6, the number of orderder list items per page is 0.45 etc. These number seem reasonable, and commeasurate with casual browsing experience. However we also see that the average number of lines per article is 2284, which seems very high. But as you can see the standard deviation for total lines is very large too. That means there are some extremely long articles out there, especailly when considering an article I wrote at recently is only 24 lines long! Although, to be sure, all lines are not created equally.

Now we are in a perfect place to start looking at listiness distributions. Let's visualise a histogram of the listinesses of all the pages. This first histogram just considers the classical ordered and unordered lists.

In [5]:

show_list_plot()

Quite clearly we can see a power law distribution. Lets recall that 80% listiness is Bates' threshhold for considering a book a reference book. In our case lets see how many articles are at least 80% listy.

In [6]:

canonical[canonical['*and#per'] >= 0.8].shape[0] / float(canonical.shape[0])

Out[6]:

0.0032001019048792404

It's 0.32%, or about three-tenths of one percent of articles are 80% or more listy when considering ordered and unordered lists.

Now lets allow for charts and infoboxes in our analysis. To eliminate confusion, pages like List of Feminists is actually mainly a table! In our terminology it might be called "Table of Feminists", althought the "See Also" section at the end is a proper list.

Now we allow table cells and infoboxes that exist to come into our listiness calculations. This produces a new picture.

In [8]:

show_list_and_table_plot()

Right away a noticable change has occured in the distribution. One can make out a Guassian distribution centered at about 50% listiness. Also the scale has changed, with about half of those articles which were previously 0% listy, becoming listy. Just visually we see that percentage of lines that are contributed to by tables and infoboxes, are appreciable at a large level. In fact, if we find our percentage of articles that exceed the 80% listiness threshhold under the table and infobox definiton, now we have:

In [9]:

canonical[canonical['allper'] >= 0.8].shape[0] / float(canonical.shape[0])

Out[9]:

0.03052258792303092

... about 3% of all articles. This represents an order of magnitude increase.

Now we have such a figure, so what? By itself I'm not sure if we can draw any conclusions about Wikipedia being very listy. It would be instructive to compare this to other encyclopedias or libraries. You could measure whatever of Google Books is OCR'd, if you had access to it. If anyone knows of any comparable statistics please get int touch.

More curiosities¶

To understand more about listy Wikipedia articles there were a few more avenues of inquiry to take. A first stop was to investigate if the list and table types correlated with each other.

In [2]:

canonical.corr()

Out[2]:

	*	#	\|-	\|	total
*	1.000000	0.016075	0.049516	0.045971	-0.093864
#	0.016075	1.000000	0.019311	0.023440	-0.031146
\|-	0.049516	0.019311	1.000000	0.764544	-0.045862
\|	0.045971	0.023440	0.764544	1.000000	-0.093092
total	-0.093864	-0.031146	-0.045862	-0.093092	1.000000

5 rows × 5 columns

None of these correlations are very strong, except for one. The correlation between the number of lines starting with |- and | is 0.76 over English Wikipedia. To those who know more about Wikitext, this would be obvious because table rows start with |-, and the delimiter for each column in that row begins with |. However what makes this difficult to disentangle, is that the number of columns may vary, so a 4-column table would have 4 times as many | as |-. And of course Templates, use | character to separate parameters and are not related to tables whatsoever.

With our correlation matrix being rather uniformative, I wanted to see which where the most highly occuring words in the titles of our listiest pages. The left column is the total number of times the word occurs in all page titles. The listiest column is the number of times that word occurs in the titles articles who are 80% listy by the "list and table" definition. Lastly the ratio column is a division of listiest/all. Compare displayed ratio to the ratio of 0.03 which is the composition of total listy articles versus all articles. The table is sorted by the listiest column.

In [74]:

lexemes_combined[lexemes_combined["listiest"] > 150].sort(columns="listiest", ascending=False).head(20)

Out[74]:

	all	listiest	ratio (listiest/all)
of	640540	60163	0.093925
list	92978	34307	0.368980
in	488559	22937	0.046948
the	355866	17071	0.047970
–	31977	10836	0.338869
for	415190	9120	0.021966
mens	23931	5837	0.243910
singles	7986	5391	0.675056
wikipediaarticles	295972	5358	0.018103
season	41053	5323	0.129662
and	154046	5133	0.033321
championships	17045	5121	0.300440
at	47756	5066	0.106081
world	34675	4981	0.143648
district	51854	4780	0.092182
by	105340	4673	0.044361
team	36328	4256	0.117155
county	104867	4255	0.040575
football	53007	3853	0.072689
wikipediawikiproject	183711	3665	0.019950

20 rows × 3 columns

As one might expect, the word "list" is the second most occuring word in titles of the listiest articles. These articles represent 36% of all the articles that have list in their title, which is significantly higher than what we'd expect on with no other information - 3%. Even "in" and "by" some occur slightly more than expected. This should make sense if you've ever browsed a Wikipedia article with a title something like "List of [x] by [y]" or "List of [x] in [z]".

The remainder of this Top 20 just goes to show how sport fans have chronicled results with great zeal. And the inexplicability of their zeal is analogue to that of my curiosity for investigating this topic.

Contact me with more ideas of question on twitter @notconfusing‽‽‽

Find all the code for this analysis at https://github.com/notconfusing/listiness .

Start of Supporting Code¶

In [4]:

import pandas as pd
import re
import decimal
import string

%pylab inline
linestarts = pd.read_table('linestarts.txt')
#somehow the txtfile has an extra column
del linestarts[u'Unnamed: 6']
#The redirect condition is that there is only one line and it began with a "#"
redir_cond = (linestarts['total'] == 1) & (linestarts['#'] == 1)
canon_cond = [not(x) for x in redir_cond]
redirs = linestarts[redir_cond]
canonical =  linestarts[(canon_cond)]

Populating the interactive namespace from numpy and matplotlib

In [6]:

canonical['*and#per'] = ( canonical['*'] + canonical['#'] ) / canonical['total'] 
canonical['allper'] = ( canonical['*'] + canonical['#'] + canonical['|'] ) / canonical['total']

def show_list_plot():
    fig, axes = plt.subplots(figsize=(8,4.5))
    histplot = canonical['*and#per'].hist(ax=axes, bins=30, color='blue')
    NOPLACES = decimal.Decimal(10) ** 0
    plt.xticks(arange(0,1.1,0.1), [decimal.Decimal(x * 100).quantize(NOPLACES) for x in arange(0,1.1,0.1)])
    plt.xlabel('Listiness - ordered and unordered lists only')
    plt.ylabel('Article frequency')
    plt.title('Frequency of Article List-Percentages of English Wikipedia')

def show_list_and_table_plot():
    fig, axes = plt.subplots(figsize=(8,4.5))
    histplot = canonical['allper'].hist(ax=axes, bins=30, color='green')
    NOPLACES = decimal.Decimal(10) ** 0
    plt.xticks(arange(0,1.1,0.1), [decimal.Decimal(x * 100).quantize(NOPLACES) for x in arange(0,1.1,0.1)])
    plt.xlabel('Listiness - ordered and undordered lists, tables, and infoboxes')
    plt.ylabel('Article frequency')
    plt.title('Frequency of Article List-Percentages of English Wikipedia')

In [67]:

from collections import defaultdict
lexeme_freq_all = defaultdict(int)
lexeme_freq_list_only = defaultdict(int)

exclude = set(string.punctuation)
exclude.add(u'–')
def strip_punct(s):
    ls = s.lower()
    ls = ''.join(ch for ch in ls if ch not in exclude)
    return ls

for title in canonical['page+title']:
    title=str(title)
    for lexeme in title.split():
        cleaned = strip_punct(lexeme)
        #print lexeme, cleaned 
        lexeme_freq_all[cleaned] += 1
for title in canonical[canonical['allper'] >= 0.8]['page+title']:
    title=str(title)
    for lexeme in title.split():
        cleaned = strip_punct(lexeme)
        lexeme_freq_list_only[cleaned] += 1 

lexemes_all = pd.DataFrame.from_dict(data=lexeme_freq_all, orient='index')
lexemes_list_only = pd.DataFrame.from_dict(data=lexeme_freq_list_only, orient='index')
lexemes_combined = lexemes_all.join(lexemes_list_only, how='inner', lsuffix='_all', rsuffix='_listiest')
lexemes_combined.columns = [u'all', u'listiest']
lexemes_combined['ratio (listiest/all)'] = lexemes_combined['listiest'] / lexemes_combined['all'] 

In [70]:

lexemes_combined[lexemes_combined["listiest"] > 150].sort(columns="ratio (listiest/all)", ascending=False).head(16)

Out[70]:

	all	listiest	ratio (listiest/all)
pri	728	664	0.912088
vrh	235	210	0.893617
divisional	410	325	0.792683
vas	365	279	0.764384
gornji	229	167	0.729258
filmography	526	371	0.705323
secretariat	463	325	0.701944
numberone	2394	1675	0.699666
singles	7986	5391	0.675056
stakes	1149	727	0.632724
handicap	342	213	0.622807
billboard	740	454	0.613514
fia	369	207	0.560976
iaaf	718	354	0.493036
grade	895	421	0.470391
listings	2959	1376	0.465022

16 rows × 3 columns

Research questions:

What percentage of articles are above 80% list.
1. how does that answer change if we allow tables and infoboxes as well?
2. What do the different distributions mean?
What are some top occuring word in the listiest articles? 2. How does inclusion of infoboxes and tables affect this as well?
Are there any strong correllations between the different list types?