Wiki-Class Set-up Guide and Exploration

Wiki-Class is python package that can determine the quality of a Wikipedia page, using machine learning. It is the open-sourcing of the Random Forest algorithm used by SuggestBot. SuggestBot is an opt-in recommender to Wikipedia editors, offering pages that need work which look like pages they've worked on before. Similarly, with this package, you get a function that accepts a string of wikitext, and returns a Wikipedia Class ('Stub', 'C-Class', 'Featured Article', etc.). Wiki-class is currently in alpha according to its packager and developer @halfak, and although I had to make a few patches to get some examples to work, it's ready to start classifying your wikitext.


  1. Setting it up on Ubuntu.
  2. Testing the batteries-included model.
  3. Using the output by introducing a closeness measure.
  4. Testing making our own model.


At first you may be frustrated to learn that Wiki-Class is Python 3 only. You'll not be able to mix it with pywikibot, which is Python 2.7 only, and that can also mean upgrading some of your other tools. However just try to recall these update gripes next time you encounter a UnicodeError in Python 2.x; and then be thankful to Halfak for making us give Python 3 a try. I outline getting the environment running in Ubuntu 14.04 here.

Firstly, if you want to use the Ipython notebook with python3 you can do so with apt-get. And while we're at it, for convenince we'll also install another version of pip for Python 3.

In [95]:
!sudo apt-get install ipython3-notebook python3-pip
[sudo] password for notconfusing: 

Some requirements of Wiki-class, including sklearn, and nltk, which are a pain with Python 3 since they haven't been properly packaged for it yet. So these you'll have to get from source:

In [1]:
!pip3 install git+
!pip3 install git+

Making some random pages for a test dataset

We'll need to get some Wikitext, with associated classifications, to start testing. I elected to make a random datasetin pywikibot, which as already stated is Python 2.7 only, and thus needs to be in a separate notebook, you can view it on the nbviewer still. Its output is a file test_class_data.json (github link of the bzip) which is just a dictionary associating qualities and page-texts.

Warning, this dataset has some examples that can cause a ZeroDivisonError because some of these pages have 0 non-mark-up text. I wrote this patch which fixes this issue.

Testing the Pre-built Model

In [3]:
import json
import pandas as pd
from wikiclass.models import RFTextModel
/usr/local/lib/python3.4/dist-packages/pandas/io/ UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))

Each model is stored in a .model file. A default one is included in the github repo.

In [ ]:
In [35]:
!mv enwiki.rf_text.model\?raw\=true enwiki.rf_text.model

Now we load the model.

In [4]:
model = RFTextModel.from_file(open("enwiki.rf_text.model",'rb'))
In [5]:
classed_items = json.load(open('test_class_data.json','r'))
print(sum([len(l) for l in classed_items.values()]))

The Wiki-Class-provided model only deals with 'Stub', 'Start', 'B', 'C', 'Good Article', and 'Featured Article' classifications. It does not include not 'List', 'Featured List', or 'Disambig' class pages. So we have to sort out the standard classes out of our 38,000 test articles.

In [6]:
standards = {actual: text for actual, text in classed_items.items() if actual in ['Stub', 'Start', 'C', 'B', 'GA', 'FA'] }
In [5]:
print(sum([len(l) for l in standards.values()]))

Now we iterate over our 36,000 standard-class pages, and put their Wiki-Class assessments into a DataFrame.

In [6]:
accuracy_df = pd.DataFrame(index=classed_items.keys(), columns=['actual','correct', 'model_prob', 'actual_prob'])
for actual, text_list in standards.items():
    #see if actual is even here, otherwise no fair comparison
        for text in text_list:
                assessment, probabilities = model.classify(text)
            except ZeroDivisionError:
                #print(actual, text)
            accuracy_df = accuracy_df.append({'actual': actual,
                                              'correct':int(assessment == actual),
                                              'model_prob': probabilities[assessment],
                                              'actual_prob': probabilities[actual]}, ignore_index=True)

What you see here is that the output of an assessment is really two things. The 'assessment' which is simply the 'class' which the algorithm predicts best, but secondly a dictionary of probablities of how likely the text is to belong to each class.

In our DataFrame we record four data. The 'actual' class as Wikipedia classes it; whether the actual class matches the model prediction. The probabilty (read: "confidence") of the model prediction. And lastly the probability of the actual class. Note in the "correct" case model_prob and actual_prob are the same.

In [7]:
df  = accuracy_df.dropna(how='all')
actual correct model_prob actual_prob
18 Start 0 0.4 0.0
19 Start 1 0.8 0.8
20 Start 0 0.4 0.0
21 Start 0 1.0 0.0
22 Start 1 0.7 0.7

If we look at the correct mean averages we should hopefully see something above 1/6th, which would be the performance of just guessing. Which we do.

In [8]:
groups = df.groupby(by='actual')
B         0.247391
C         0.278138
FA        0.854167
GA        0.444444
Start     0.387334
Stub      0.698394
Name: correct, dtype: float64

See how "close" predications are if they are not correct.

Now we hack on the output. The Random Forest is really just binning text into difference classes, it doesn't know that some of the classes are closer to each other than others. Therefore we define a distance metric on the Standard Wiki classes. I call this order the "Classic Order" To get an intuition, consider this example. If an article is a Good Aritcle and the model prediction is also Good Article then it is off by 0; if the model prediction is Featured Article it is off off by 1; if the model prediction is Start then it was off by 3.

In [7]:
classic_order = ['Stub', 'Start', 'C', 'B', 'GA', 'FA']
enum_classic = enumerate(classic_order)

for enum, classic in dict(enum_classic).items():
    print(enum, classic)
0 Stub
1 Start
2 C
3 B
4 GA
5 FA

Now we are going to iterate over the same dataset as above, but instead of recording "correctness", we record the closesness in a DataFrame.

In [8]:
classic_order = ['Stub', 'Start', 'C', 'B', 'GA', 'FA']
classic_dict = dict(zip(classic_order, range(len(classic_order))))

off_by_df = pd.DataFrame(index=classed_items.keys(), columns=['actual','off_by'])

for classic in classic_order:
    for text in standards[classic]:
                assessment, probabilities = model.classify(text)
            except ZeroDivisionError:
                #print(actual, text)
            off_by_df = off_by_df.append({'actual': classic,
                                              'off_by':abs(classic_dict[assessment] - classic_dict[classic])}, ignore_index=True)

So it should look something like this as a table

In [9]:
off_by  = off_by_df.dropna(how='all')
actual off_by
18 Stub 2
19 Stub 1
20 Stub 0
21 Stub 0
22 Stub 0

And as a chart.

In [10]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['text']
`%pylab --no-import-all` prevents importing * from pylab and numpy

We can see that the middle classes are less easy to predict where as the ends are easier. This would corroborate our expectations. Since the the quality sprectrum bleed past these rather arbitrary cut-off points,ore of the quality specturm would lie in these intervals, and so its easier to bin them.

In [11]:
ax = off_by.groupby(by='actual',sort=False).mean().plot(title='Prediction Closeness by Quality Class', kind='bar', legend=False)
ax.set_ylabel('''Prediction Closeness (lower is more accurate)''')
ax.set_xlabel('''Quality Class''')
<matplotlib.text.Text at 0x7fc089810550>

Making a model

Now we test the model-making feature. We will use our dataset of 'standards' from above, using a random 80% for training and 20% for testing.

In [27]:
from wikiclass.models import RFTextModel
from wikiclass import assessments

Divvyig up our data into two lists.

In [28]:
import random

train_set = list()
test_set = list()
for actual, text_list in standards.items():
    for text in text_list:
        if random.randint(0,9) >= 8:
            test_set.append( (text, actual) )
            train_set.append( (text, actual) )


And the next step is quite simple, we just click a button supplying our train_set list, and test by supplying our test_set list. Also the package conveniently supplies a saving function for us to store our model for later use.

In [29]:
# Train a model
model = RFTextModel.train(

# Run the test set & print the results
results = model.test(test_set)

# Write the model to disk for reuse.
model.to_file(open("36K_random_enwiki.rf_text.model", "wb"))
pred assessment    B    C  FA  GA  Start  Stub
real assessment                               
B                130   29   1   5    105    40
C                 34  112   0   2    151    33
FA                 7    3   4   0      1     0
GA                 8    8   0  11      9     1
Start             80   87   0   2   1420   525
Stub              40   32   0   0    547  3973

Now to look at accuracy, we norm the DataFrame row-wise.

In [30]:
norm_results = results.apply(lambda col: col / col.sum(), axis=1)
pred assessment B C FA GA Start Stub
real assessment
B 0.419355 0.093548 0.003226 0.016129 0.338710 0.129032
C 0.102410 0.337349 0.000000 0.006024 0.454819 0.099398
FA 0.466667 0.200000 0.266667 0.000000 0.066667 0.000000
GA 0.216216 0.216216 0.000000 0.297297 0.243243 0.027027
Start 0.037843 0.041154 0.000000 0.000946 0.671712 0.248344
Stub 0.008711 0.006969 0.000000 0.000000 0.119120 0.865200

And finally we can view the peformance by class, which intriguingly seems to be better than what we got with the batteries-included model.

In [35]:
for c in classic_order:
    print(c, norm_results.loc[c][c])
Stub 0.865200348432
Start 0.671712393567
C 0.33734939759
B 0.41935483871
GA 0.297297297297
FA 0.266666666667

We can see that, having a large number of stubs to train on really gives us a high precision in classifying them.

So there you have it - a brief playing around with Wiki-Class, an easy way to get rough quality estimates out of your data. If you extend any more examples of using this class, I'd be intrigued to see and collaborate on them.


In [ ]: