La stylométrie dans The Federalist Papers

NOTE: Ce projet ayant initialement été réalisé dans le cadre d'un séminaire de cycles supérieurs tenu à l'Université McGill, le texte qui suit est en anglais.

For this project, I have decided to work with the Federalist Papers, a collection of 85 seminal political theory articles published by James Madison, Alexander Hamilton and John Jay between October 1787 and May 1788. These papers, published as the debate over the ratification of the Constitution of the United States was raging, presented the case for the system of government that the U.S. ultimately adopted.

Corpus description

The Federalist Papers were published anonymously, under the shared pseudonym "Publius", and several of them have remained of uncertain authorship to this day.

According to a famous two-part 1944 article by Douglass Adair (1), neither Madison nor Hamilton wanted the true authorship of some of the papers to be known during their lifetimes because they had come to bitterly regret some of the positions they had espoused in the Federalist after the Constitution came into effect. However, as Hamilton prepared for his 1804 duel with then-sitting Vice-President of the United States Aaron Burr, in which Hamilton was killed, he left a handwritten note claiming authorship of 63 of the 85 papers; probably, according to Adair, to make sure that posterity would view him as the senior author and Madison and Jay as, at most, his lieutenants. Madison later refuted some of Hamilton's claims, stating that he (and not Hamilton) had written papers 49 through 58, 62 and 63, and that he had been the sole (or at least principal) author of papers 18, 19 and 20 for which Hamilton had claimed equal credit.

Historians, computer scientists and linguistics specialists have debated the issue ever since, the problem being compounded by the fact that, in the words of David Holmes and Richard Forsyth, the two men's writing styles are "unusually similar" (2). Most recent studies now agree that Madison wrote the disputed papers (3); however, Collins et al have cast new doubt on the issue by looking for traces of collaborative authorship and classifying four of the disputed papers as primarily Hamilton's work (4).

Furthermore, there was also briefly some uncertainty as to whether Hamilton or Jay had written Federalist 64. In his list, Hamilton credited Jay with #54 and himself with #64; however, a draft of #64 was later found in Jay's personal papers (1, p. 239) and the historical consensus is that Jay (who has understandably distracted by his impending duel) simply made a transcription mistake.

Interpretive questions

This project will implement some of the techniques described in the authorship attribution literature and apply them to the Federalist papers.

I will seek out answers to two primary questions:

  • What hints about the true authorship of the disputed Federalist papers can we find with these relatively simple techniques?
  • How applicable are these techniques in a context like this one, where the corpus is relatively small and the number of candidate authors is limited to two, or at least three?

1.0 Corpus acquisition and manipulation

1.1 Getting the source file from Gutenberg

The Project Gutenberg archives contain two versions of the Federalist Papers. I have decided to use this one: http://www.gutenberg.org/cache/epub/1404/pg1404.txt

The reason? It is the most internally consistent in terms of format. For example, each paper begins with a string of type "FEDERALIST No. XX", which makes it relatively easy to split the book-length file into individual papers; the other version occasionally inserts a period between "FEDERALIST" and "No.", for example, making manipulation awkward.

As usual, since Gutenberg is suspicious of multiple downloads of the same file, we will grab a copy and store it locally.

In [26]:
import urllib.request
federalistURL = "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
federalistString = urllib.request.urlopen( federalistURL ).read().decode()

import os
directory = "data"
if not os.path.exists( directory ):
    os.makedirs( directory )
    
fic = open( "data/federalist.txt", "w" )
fic.write( federalistString )
fic.close()

1.2 Extracting the Federalist Papers from the source file

OK. Now that we have a local copy of the collection, let's split it into a separate file for each of the 85 papers. First, load the data:

In [27]:
fic = open( "data/federalist.txt", "r" )
federalistString = fic.read( )
fic.close()

print( federalistString[ :200 ] )
The Project Gutenberg EBook of The Federalist Papers, by
Alexander Hamilton, John Jay, and James Madison

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whats

A quick look at the file's contents shows us that everything up to the first occurence of "FEDERALIST No." is extraneous material that can be ignored. Same for everything that follows "End of the Project Gutenberg EBook". Let's strip this material away.

In [28]:
# Strip the file header and footer
startIndex = federalistString.find( "FEDERALIST No." ) 
endIndex = federalistString.find( "End of the Project Gutenberg EBook of The Federalist Papers" )
federalistStringNoHeaderFooter = federalistString[ startIndex : endIndex ]

We now have a string that contains the 85 papers and nothing else. Let's split it into individual paper-sized strings, using the string class' split() method, documented here

Most of the time, this method is used to split a string into words by looking for white space to use as a separator. But we will be twisting its purpose by separating over the "FEDERALIST No." tag and using entire Federalist Papers as "words". Note that the method's behaviour can sometimes be annoying: in our case, for example, it yields 86 strings instead of 85, with the first one an empty string containing the (non-existent) material before the first separator.

In [29]:
# Divide into 85 separate files
papersList = federalistStringNoHeaderFooter.split( "FEDERALIST No.", 85 )

# Since split() removes the separator, let's return it to each paper by hand in case we end up using it sometime.
papersList = [ "FEDERALIST No." + paper for paper in papersList ]

# And now, save the files. Remember that the first entry in papersList is a dummy that we need to 
# ignore, thus the slice in the for loop 
currentPaper = 1
for paper in papersList[ 1: ]:
    currentPaperFileName = "data/federalist_{0}.txt".format( currentPaper )
    fic = open( currentPaperFileName, "w" )
    fic.write( papersList[ currentPaper ] )
    fic.close()
    currentPaper += 1
    

Good. We now have 85 files with names like "federalist_72.txt" in our data archive. From now on, we'll be able to play around with whatever subset of the Papers that we want.

1.3 Constructing sub-corpora

To study each of the three authors' styles, we will construct sub-corpora containing all of the papers which they unquestionably wrote. We will also build a collection of the papers whose authorship is disputed between Hamilton and Madison to see whether they, as a group, resemble the writings of one man rather than the other. Finally, we will creat a fifth subcorpus consisting solely of Federalist 64.

To do so, we will rebuild a single string out of the files associated with the papers written by each author. A function will come in handy for this purpose:

In [30]:
# A function that concatenates a list of text files into a single string
def read_files_into_string( fileList ):
    theString = ""
    for eachFile in fileList:
        fic = open( "data/federalist_{0}.txt".format( eachFile ), "r" )
        theString += fic.read()
        fic.close()
    return theString
        
In [31]:
# Define the lists of papers in the sub-corpora
madisonPapersList = [ 10, 14, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 62, 63]
hamiltonPapersList = [ 1, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 21, 22, 23, 24, 25, 26, 27, 28, 29, \
                      30, 31, 32, 33, 34, 35, 36, 59, 60, 61, 65, 66, 67, 68, 69, 70, 71, 72, 73, \
                      74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85 ]
jayPapersList = [ 2, 3, 4, 5 ]
disputedPapersList = [ 18, 19, 20, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58 ]
testCaseList = [ 64 ]

# Make a dictionary out of the sub-corpora
federalistByAuthor = dict()
federalistByAuthor[ "Madison" ] = read_files_into_string( madisonPapersList )
federalistByAuthor[ "Hamilton" ] = read_files_into_string( hamiltonPapersList )
federalistByAuthor[ "Jay" ] = read_files_into_string( jayPapersList )
federalistByAuthor[ "Disputed" ] = read_files_into_string( disputedPapersList )
federalistByAuthor[ "TestCase" ] = read_files_into_string( testCaseList )

2.0 Hamilton vs Madison

We will now apply a number of techniques to compare the Madison and Hamilton sub-corpora with each other and to see whether the disputed ones, as a group, resemble one more than the other.

2.1 Mendenhall's Characteristic Curves of Composition (1887)

Let's start by going old school and looking at what T. C. Mendenhall (5) identifies as a signature of authorship: the frequency at which an author uses words of different lengths.

To do this, we will take each individual sub-corpus, extract the lengths of the words in it, and then plot the frequency distribution of word lengths in a graph. For this purpose, we will retain all of the word tokens, as small function words are at least as likely to carry an author's stylistic signature as longer "meaningful" words.

In [32]:
# Setup procedure
import nltk
%matplotlib inline

# Tokenize the sub-corpora. We tokenize Jay's texts right away, even though we don't consider
# them at this point, because they'll be useful later on.
federalistByAuthorTokens = dict()
federalistByAuthorLengthDistributions = dict()
for subcorpus in [ "Hamilton", "Madison", "Disputed", "Jay" ]:
    tokens = nltk.word_tokenize( federalistByAuthor[ subcorpus ] )
    
    # Filter out punctuation
    federalistByAuthorTokens[ subcorpus ] = [ token.lower() for token in tokens \
                                             if any (c.isalpha() for c in token) ]
   
    # Get a distribution of token lengths
    tokenLengths = [ len( token ) for token in federalistByAuthorTokens[ subcorpus ] ]
    federalistByAuthorLengthDistributions[ subcorpus ] = nltk.FreqDist( tokenLengths )
    federalistByAuthorLengthDistributions[ subcorpus ].plot( 15, title = subcorpus )
    

Interesting: for 2- to 6-letter words, the shape of the characteristic curve for the disputed papers bears a striking resemblance to the one computed from Madison's papers and they are both quite clearly different from the distribution in Hamilton's. In the tail end of the distribution, the differences are far less clear, but the first piece of evidence points Madison's way.

2.2 Common words and n-grams

Let us now take a look at the authors' favourite words and expressions to see if we can identify patterns. First, let's examine the 10 most frequent words used in each of the sub-corpora:

In [33]:
federalistByAuthorTokenDistributions = dict()
for subcorpus in [ "Hamilton", "Madison", "Disputed" ]:
    federalistByAuthorTokenDistributions[ subcorpus ] = nltk.FreqDist( federalistByAuthorTokens[ subcorpus ] )
    print( "Favourite words for:", subcorpus, ":", \
          federalistByAuthorTokenDistributions[ subcorpus ].most_common( 10 ), "\n" )
Favourite words for: Hamilton : [('the', 10598), ('of', 7370), ('to', 4614), ('in', 2833), ('and', 2730), ('a', 2507), ('be', 2300), ('that', 1717), ('it', 1549), ('is', 1330)] 

Favourite words for: Madison : [('the', 4435), ('of', 2668), ('to', 1435), ('and', 1306), ('in', 926), ('a', 904), ('be', 876), ('that', 627), ('it', 568), ('is', 554)] 

Favourite words for: Disputed : [('the', 2454), ('of', 1488), ('to', 758), ('and', 671), ('in', 538), ('be', 491), ('a', 488), ('that', 299), ('it', 295), ('which', 275)] 

Another point for Madison, albeit a less convincing one. While the contents of both authors' corpora "match" 9 out of 10 words used most often in the disputed papers, the Madison and disputed papers have 8 of 9 in the exact same positions in the frequency distributions, whereas Hamilton only does the same for 5 out of the 9 words.

Now, let's look at bigrams:

In [34]:
federalistByAuthorText = dict()
for subcorpus in [ "Hamilton", "Madison", "Disputed" ]:
    federalistByAuthorText[ subcorpus ] = nltk.Text( federalistByAuthorTokens[ subcorpus ] )
    print( "Favourite bigrams for", subcorpus, ":\n" )
    federalistByAuthorText[ subcorpus ].collocations( 20 )
    print( "\n" )
Favourite bigrams for Hamilton :

new york; united states; supreme court; national government; great
britain; state governments; independent journal; federal government;
two thirds; proposed constitution; york packet; chief magistrate;
state legislatures; legislative body; publius federalist; standing
armies; military establishments; national legislature; packet tuesday;
journal wednesday


Favourite bigrams for Madison :

united states; new york; state governments; federal government;
publius federalist; january madison; legislative department; state
legislatures; several states; judiciary departments; independent
journal; public good; judiciary department; great britain; legislative
executive; executive department; general government; proposed
constitution; general welfare; york packet


Favourite bigrams for Disputed :

new york; february madison; united states; publius federalist;
biennial elections; rhode island; independent journal; republican
government; york packet; south carolina; packet tuesday; federal
legislature; executive magistrate; journal saturday; thirty thousand;
great britain; federal constitution; three years; would probably;
different states


Not much of value to see here, especially since some of the frequent bigrams are not content at all but formatting artifacts (PUBLIUS FEDERALIST) or information about the newspaper in which the Federalist was published. Let's try trigrams:

In [35]:
federalistByAuthorTrigrams = dict()
for subcorpus in [ "Hamilton", "Madison", "Disputed" ]:
    federalistByAuthorTrigrams[ subcorpus ] = \
        list( nltk.ngrams( federalistByAuthorTokens[ subcorpus ], 3 ) )
    print( "Favourite trigrams for", subcorpus, ":\n" )
    trigramDist = nltk.FreqDist( federalistByAuthorTrigrams[ subcorpus ] )
    print( trigramDist.most_common( 10 ), "\n\n" )
    
    
Favourite trigrams for Hamilton :

[(('of', 'the', 'state'), 135), (('of', 'the', 'union'), 132), (('the', 'united', 'states'), 125), (('of', 'the', 'people'), 104), (('the', 'power', 'of'), 103), (('of', 'new', 'york'), 83), (('the', 'people', 'of'), 77), (('of', 'the', 'united'), 75), (('of', 'the', 'national'), 71), (('to', 'the', 'people'), 67)] 


Favourite trigrams for Madison :

[(('of', 'the', 'people'), 71), (('of', 'the', 'state'), 54), (('the', 'united', 'states'), 49), (('the', 'people', 'of'), 42), (('of', 'the', 'states'), 39), (('of', 'the', 'union'), 38), (('the', 'federal', 'government'), 36), (('members', 'of', 'the'), 34), (('of', 'the', 'federal'), 33), (('the', 'state', 'governments'), 32)] 


Favourite trigrams for Disputed :

[(('of', 'the', 'people'), 37), (('the', 'house', 'of'), 31), (('of', 'the', 'state'), 29), (('house', 'of', 'representatives'), 28), (('to', 'the', 'people'), 25), (('the', 'people', 'of'), 25), (('the', 'number', 'of'), 23), (('of', 'the', 'government'), 23), (('ought', 'to', 'be'), 20), (('people', 'of', 'the'), 17)] 


Four out of the top 10 trigrams in Hamilton's papers also appear in the top 10 for the disputed papers, none of them in the "right" rank. For Madison, we get 3 out of 10 matches, but the highest-ranked trigram is the same in both sets. Not much valuable evidence either way.

Maybe we'll have more luck looking at content words only: if the disputed papers discuss the same kinds of topics as those written by Madison or Hamilton, that might be valuable information. For this purpose, we will filter out stopwords and lemmatize the rest.

In [36]:
# Create the data structures
federalistByAuthorContentTokens = dict()
federalistByAuthorContentFreqDist = dict()

# Setup for filtering and lemmatizing
stopwords = nltk.corpus.stopwords.words( "English" )
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

# Build lists of content-word lemmas and plot their distributions
for subcorpus in [ "Hamilton", "Madison", "Disputed" ]:
    federalistByAuthorContentTokens[ subcorpus ] = [ wnl.lemmatize( token ) \
                                                    for token in federalistByAuthorTokens[ subcorpus ] \
                                                    if not token in stopwords ]
    federalistByAuthorContentFreqDist[ subcorpus ] = \
        nltk.FreqDist( federalistByAuthorContentTokens[ subcorpus ] )
    federalistByAuthorContentFreqDist[ subcorpus ].plot( 20, title = subcorpus )

As expected in such a corpus, words such as "state", "government", "power", "constitution" and "federal" appear quite often no matter who the author of a particular paper happens to be. At first glance, the oddest thing in these distributions is the surprisingly frequent use of "would" in Hamilton's papers compared with the others, which may signal that he did not write the disputed papers. However, the fact that "power" appears in third place in both the Hamiltonian and Madisonian corpora and only in 13th position in the disputed papers also seems very strange and might suggest that either the topics covered by the disputed papers differ significantly from the others or that they were written by someone else entirely.

Let's take a deeper look at the top 50 words in the frequency distributions, computationally:

In [37]:
# How many of the 50 most frequent words in the disputed papers are also among the top 50 in
# Hamilton's own? And in Madison's?
hamiltonTop50 = [ word for (word, freq) \
                 in federalistByAuthorContentFreqDist[ "Hamilton" ].most_common( 50 ) ]
madisonTop50 = [ word for (word, freq) \
                 in federalistByAuthorContentFreqDist[ "Madison" ].most_common( 50 ) ]
disputedTop50 = [ word for (word, freq) \
                 in federalistByAuthorContentFreqDist[ "Disputed" ].most_common( 50 ) ]
hamiltonHowMany = len( [ word for word in hamiltonTop50 if word in disputedTop50 ] )
madisonHowMany = len( [ word for word in madisonTop50 if word in disputedTop50 ] )
print( "Of Hamilton's top 50, {0} appear in the disputed papers.".format( hamiltonHowMany ) )
print( "Of Madison's top 50, {0} appear in the disputed papers.".format( madisonHowMany ) )
Of Hamilton's top 50, 31 appear in the disputed papers.
Of Madison's top 50, 33 appear in the disputed papers.

Not very significant. Perhaps if, instead of looking at the mere presence of the words, we looked at their relative positions in the lists? For this purpose, we can use the list's index() function, which returns the position of the first instance of an object in the list; since our lists never contain more than one instance of a word because they have been built out of frequency distributions, this should work nicely.

In [38]:
# A little helper function to calculate the distances between the positions of words in List1
# and the positions of the same words in List2; if the words aren't in List2 at all, assign a 
# large distance
def calc_distances_between_lists( list1, list2 ):
    dist = 0
    for word in list1:
        if word in list2:
            dist += abs( list1.index( word ) - list2.index( word ) )
        else:
            dist += 50  # If the words don't match, they are far, far away
    return dist

print( "Hamilton's distances:", calc_distances_between_lists( hamiltonTop50, disputedTop50 ) )
print( "Madison's distances:", calc_distances_between_lists( madisonTop50, disputedTop50 ) )
Hamilton's distances: 1233
Madison's distances: 1169

Again, not a huge difference, but one that points towards Madison possibly being responsible for more of the content of the disputed papers.

Overall, simple word and n-gram counts don't yield very convincing evidence. Let's go on to an entirely different metric.

2.3 Kilgarriff's Chi-Squared Statistic (2001)

In a 2001 paper, Adam Kilgarriff recommends using the chi-squared statistic to compare two corpora (6). Chi-squared's most common application is in testing two variables for statistical independence, which is not what we're after in this case. In Kilgarriff's words: "... the statistic is not in general appropriate for hypothesis-testing in corpus linguistics: a corpus is never a random sample of words, so the null hypothesis [of independence] is of no interest. But once divested of the hypothesis-testing link, [chi-squared] is suitable."

The way to apply the statistic is the following:

  • For each of the N most common words in the union of the two corpora, calculate the number of occurrences in each corpus that would be expected if both corpora were random samples of words drawn from the same population. (Basically, a weighted mean over the two corpora.)
  • The chi-squared statistic is computed by summing, over the N most common words, the squares of the differences between the observed frequencies and the expected frequencies, divided by the expected frequencies. (The usual formula for chi-squared.)

The statistic gives a measure of the difference between two corpora, for example Hamilton and Disputed. Repeating the procedure for Madison vs Disputed, we can see which of Hamilton or Madison is statistically "closer" to Disputed.

Kilgarriff uses raw word tokens instead of lemmas and achieves his best results with 320 to 640 "most common words". I ran the test with 50, 100, 200 and 500 words; in all cases, the Madison corpus was much closer to the Disputed corpus than Hamilton's. Here are the results for 500 common words:

In [39]:
for candidate in [ "Hamilton", "Madison" ]:
    # First, build a joint corpus and identify the most frequent words in it
    # We'll keep the stopwords since they are commonly used in authorship attribution studies
    jointCorpus = federalistByAuthorTokens[ candidate ] + federalistByAuthorTokens[ "Disputed" ]
    jointFreqDist = nltk.FreqDist( jointCorpus )
    mostCommonInJointCorpus = list( jointFreqDist.most_common( 500 ) )

    # What proportion of the joint corpus is made up of the candidate corpus' tokens?
    candidateShareInJointCorpus = len( federalistByAuthorTokens[ candidate ] ) / len( jointCorpus )
    
    # Now, let's look at these 50 words in the candidate author's corpus and compare the number of
    # times it can be observed to what would be expected if the candidate corpus and the Disputed
    # corpus were both random samples from the same distribution.
    chisquared = 0
    for word, jointCount in mostCommonInJointCorpus:
        
        # How often do we really see it?
        candidateCount = federalistByAuthorTokens[ candidate ].count( word )
        disputedCount = federalistByAuthorTokens[ "Disputed" ].count( word )
        
        # How often should we see it?
        expCandidateCount = jointCount * candidateShareInJointCorpus
        expDisputedCount = jointCount * ( 1 - candidateShareInJointCorpus )
        
        # Add the word's contribution to the chi-squared statistic
        chisquared += ( candidateCount - expCandidateCount ) * \
                    ( candidateCount - expCandidateCount ) / expCandidateCount
                    
        chisquared += ( disputedCount - expDisputedCount ) * \
                    ( disputedCount - expDisputedCount ) / expDisputedCount
        
    print( "The Chi-squared statistic for candidate", candidate, "is", chisquared )
The Chi-squared statistic for candidate Hamilton is 2997.851461213525
The Chi-squared statistic for candidate Madison is 1533.5696224509918

So another point for Madison, and the most quantitatively impressive so far.

2.4 Conclusions

To recap:

  • A visual assessment of the characteristic curves points towards Madison.
  • Word and ngram counts give a slight edge to Madison.
  • The chi-squared test bends towards Madison in a big way.

This is getting pretty convincing: it looks like Madison probably wrote a majority of the Disputed papers.

3.0 The Strange Case of Federalist 64

Federalist 64 is the only paper for which John Jay is part of the discussion, if only because of Hamilton's carelessness.

Let's make things even more interesting. For the sake of argument, let's assume that we have four candidates instead of two or three: Hamilton, Jay, Madison, and some unknown author who would have penned the other Disputed papers. There is no reason in the historical record to believe that some fourth person got involved in the ccreation of the Federalist papers, although Hamilton may have (unsuccessfully) tried to convince his future Undersecretary of the Treasury William Duer to contribute to the project. However, positing a fourth author will allow us to study whether #64 is closest to Hamilton's known work, to Madison's, to Jay's, or to the other disputed papers.

Since the Burrows Delta method (7) is quite effective at picking the most likely author of a text from a set of candidates of arbitrary size, at least up to 25, we will apply this method to Federalist 64.

3.1 The Burrows Delta (2002): a description

The method is designed to measure the differences between an individual author's style and the "average" style of a set of authors. The method works like this:

  • Find the N most frequent word tokens in the corpus as a whole, to use as features. It is recommended to apply parts-of-speech tagging to the tokens beforehand, so that the same token used as two different parts of speech may count as two features.

  • Divide the corpus into M subcorpora: one for each author.

  • For each of the N features, calculate frequencies of occurrence in each of the M authorial subcorpora, as a percentage of the total number of POS-tagged word tokens in this particular subcorpus. Then calculate the mean and the standard deviation of these M values and use them as the offical mean and standard deviation for this feature over the whole corpus. (We use a "mean of means" instead of calculating a single frequency for the entire corpus to avoid a larger subcorpus, like Hamilton's in our case, over-influencing the results in its favor and defining the "norm" in such a way that everything would be expected to look like it.)

  • For each of the N features and M subcorpora, calculate a z-score describing how far away from the "corpus norm" the usage of this particular feature in this subcorpus happens to be. To do this, subtract the corpus average for the feature from the feature's frequency in the subcorpus and divide the result by the feature's standard deviation.

  • Also calculate z-scores for each feature in the test case (i.e., Federalist 64).

  • Calculate a delta score comparing the test case with each candidate's subcorpus. To do do, take the average of the absolute values of the differences of the z-scores for each feature in the test case and in the candidate's subcorpus. This gives equal weight to each feature, no matter their respective frequences of occurrence in the texts; otherwise, Zipf's Law would ensure that the top 3 or 4 features would overwhelm everything else.

  • The "winning" candidate is the one for whom the delta score is the lowest.

In his article, Burrows uses a grand corpus formed by assembling poems by 25 different writers. In our case, there are only 4 "authors" to deal with; this may be a problem since calculating standard deviations from a sample size of 4 is not ideal but we'll see what happens.

3.2 The most frequent words in the entire corpus

Since we already have the subcorpora for our 4 "candidates", we will POS-tag them first before merging them into a single corpus for the purpose of finding features. This way, we will avoid having to perform the same tagging twice.

In [40]:
candidateList = [ "Hamilton", "Madison", "Jay", "Disputed" ]
federalistByAuthorPOS = dict()
for candidate in candidateList:
    federalistByAuthorPOS[ candidate ] = nltk.pos_tag( federalistByAuthorTokens[ candidate ] )
    print( federalistByAuthorPOS[ candidate ][ :10 ] )
    
[('federalist', 'NN'), ('no', 'DT'), ('general', 'JJ'), ('introduction', 'NN'), ('for', 'IN'), ('the', 'DT'), ('independent', 'JJ'), ('journal', 'NN'), ('saturday', 'NN'), ('october', 'NN')]
[('federalist', 'NN'), ('no', 'DT'), ('the', 'DT'), ('same', 'JJ'), ('subject', 'NN'), ('continued', 'VBD'), ('the', 'DT'), ('union', 'NN'), ('as', 'IN'), ('a', 'DT')]
[('federalist', 'NN'), ('no', 'DT'), ('concerning', 'NN'), ('dangers', 'NNS'), ('from', 'IN'), ('foreign', 'JJ'), ('force', 'NN'), ('and', 'CC'), ('influence', 'NN'), ('for', 'IN')]
[('federalist', 'NN'), ('no', 'DT'), ('the', 'DT'), ('same', 'JJ'), ('subject', 'NN'), ('continued', 'VBD'), ('the', 'DT'), ('insufficiency', 'NN'), ('of', 'IN'), ('the', 'DT')]

This took more than 4 minutes so doing the work once is going to be enough! Now, let's combine the candidates' subcorpora into a single corpus and find the Top 30 most frequent (word, pos) pairs.

In [41]:
# Combine into a single corpus
wholeCorpusPOS = []
for candidate in candidateList:
    wholeCorpusPOS += federalistByAuthorPOS[ candidate ]
    
# Get a frequency distribution
wholeCorpusPOSFreqsTop30 = list( nltk.FreqDist( wholeCorpusPOS ).most_common( 30 ) )
wholeCorpusPOSFreqsTop30[ :10 ]
Out[41]:
[(('the', 'DT'), 17846),
 (('of', 'IN'), 11796),
 (('to', 'TO'), 7012),
 (('and', 'CC'), 5016),
 (('in', 'IN'), 4385),
 (('a', 'DT'), 3967),
 (('be', 'VB'), 3752),
 (('it', 'PRP'), 2520),
 (('is', 'VBZ'), 2178),
 (('which', 'WDT'), 2053)]

OK, now all that we truly want out of this is the (word, pos) pairs that we will be looking for in the subcorpora. Thus:

In [42]:
featuresList = [ wordpospair for ( wordpospair, freq ) in wholeCorpusPOSFreqsTop30 ]
featuresList[ :10 ]
Out[42]:
[('the', 'DT'),
 ('of', 'IN'),
 ('to', 'TO'),
 ('and', 'CC'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('be', 'VB'),
 ('it', 'PRP'),
 ('is', 'VBZ'),
 ('which', 'WDT')]

3.3 Calculating features for each subcorpus

Let's look at the frequencies of each feature in each candidate's subcorpus, as a proportion of the total number of tokens in the subcorpus. We'll calculate these values and store them in a dictionary of dictionaries, the most convenient way I know to build a two-dimensional array in Python.

In [43]:
# The main data structure
featureFrequencies = dict()

for candidate in candidateList:
    # A dictionary for each candidate's features
    featureFrequencies[ candidate ] = dict()  
    
    # A helper value containing the number of (token, pos) pairs in the subcorpus
    overall = len( federalistByAuthorPOS[ candidate] )
    
    # Calculate each feature's presence in the subcorpus
    for feature in featuresList:
        presence = federalistByAuthorPOS[ candidate ].count( feature )
        featureFrequencies[ candidate ][ feature ] = presence / overall

3.4 The corpus averages and standard deviations

Given the feature frequencies for all four subcorpora, we can find a "mean of mean" and a standard deviation for each feature. We'll store these values in a 2D array called corpusFeatures.

In [44]:
import math

# The data structure into which we will be storing the "corpus standard" statistics
corpusFeatures = dict()

# For each feature...
for feature in featuresList:
    # Create a sub-dictionary that will contain the feature's mean and standard deviation
    corpusFeatures[ feature ] = dict()
    
    # Calculate the mean of the frequencies expressed in the subcorpora
    featureAverage = 0
    for candidate in candidateList:
        featureAverage += featureFrequencies[ candidate ][ feature ]
    featureAverage /= len( candidateList )
    corpusFeatures[ feature ][ "Mean" ] = featureAverage
    
    # Calculate the standard deviation using the basic formula for a sample
    featureStdDev = 0
    for candidate in candidateList:
        diff = featureFrequencies[ candidate ][ feature ] - corpusFeatures[ feature ][ "Mean" ]
        featureStdDev += ( diff * diff )
    featureStdDev /= ( len( candidateList ) - 1 )
    featureStdDev = math.sqrt( featureStdDev )
    corpusFeatures[ feature ][ "StdDev" ] = featureStdDev

3.5 Calculating z-scores for the 4 candidates

Next, we transform the observed feature frequencies in the four candidates' subcorpora into z-scores describing how far away from the "corpus norm" these observations are. Nothing fancy here: I merely apply the definition of the z-score and store the results into yet another 2D array.

In [45]:
featureZScores = dict()
for candidate in candidateList:
    featureZScores[ candidate ] = dict()
    for feature in featuresList:
        
        # Z-score definition = (value - mean) / stddev
        # We use intermediate variables to make the code easier to read
        featureVal = featureFrequencies[ candidate ][ feature ]
        featureMean = corpusFeatures[ feature ][ "Mean" ]
        featureStdDev = corpusFeatures[ feature ][ "StdDev" ]
        featureZScores[ candidate ][ feature ] = ( featureVal - featureMean ) / featureStdDev

3.6 The test case: Federalist 64

OK, now we need to tokenize the test case, extract features and their frequencies, and calculate z-scores. This duplicates some of the preceding code; we could have inserted the test case into the preceding loops in some cases, but that would have made the code harder to read.

In [46]:
# Tokenize the test case
testCaseTokens = nltk.word_tokenize( federalistByAuthor[ "TestCase" ] )
    
# Filter out punctuation
testCaseTokens = [ token.lower() for token in testCaseTokens \
                                             if any (c.isalpha() for c in token) ]
 
# Tag the test case for parts of speech
testCaseTokensPOS = nltk.pos_tag( testCaseTokens )

# Calculate the test case's features
overall = len( testCaseTokensPOS )
testCaseFeatureFrequencies = dict()
for feature in featuresList:
    presence = testCaseTokensPOS.count( feature )
    testCaseFeatureFrequencies[ feature ] = presence / overall
    
# Calculate the test case's feature z-scores
testCaseZScores = dict()
for feature in featuresList:
    featureVal = testCaseFeatureFrequencies[ feature ]
    featureMean = corpusFeatures[ feature ][ "Mean" ]
    featureStdDev = corpusFeatures[ feature ][ "StdDev" ]
    testCaseZScores[ feature ] = ( featureVal - featureMean ) / featureStdDev
    print( "Test case z-score for feature", feature, "is", testCaseZScores[ feature ] )
    
Test case z-score for feature ('the', 'DT') is -0.5715580470247853
Test case z-score for feature ('of', 'IN') is -1.5305700230155077
Test case z-score for feature ('to', 'TO') is 0.8614793246103539
Test case z-score for feature ('and', 'CC') is 0.9923420641528881
Test case z-score for feature ('in', 'IN') is 0.4020915007213734
Test case z-score for feature ('a', 'DT') is -0.9117144930479929
Test case z-score for feature ('be', 'VB') is 3.211624164136822
Test case z-score for feature ('it', 'PRP') is -0.4225688536389689
Test case z-score for feature ('is', 'VBZ') is -1.0961185455958937
Test case z-score for feature ('which', 'WDT') is -1.8617352051613247
Test case z-score for feature ('that', 'IN') is 3.4609374852197483
Test case z-score for feature ('by', 'IN') is 1.4958416606713247
Test case z-score for feature ('as', 'IN') is 9.90787889195006
Test case z-score for feature ('this', 'DT') is -1.0795620198226084
Test case z-score for feature ('not', 'RB') is 1.636646994434489
Test case z-score for feature ('would', 'MD') is -1.2087806882807304
Test case z-score for feature ('for', 'IN') is -1.9510710252068257
Test case z-score for feature ('will', 'MD') is 4.10229975683253
Test case z-score for feature ('or', 'CC') is -0.4629523405451141
Test case z-score for feature ('from', 'IN') is -0.6428230279149659
Test case z-score for feature ('their', 'PRP$') is 0.9313818415056075
Test case z-score for feature ('with', 'IN') is 0.009860834778102507
Test case z-score for feature ('are', 'VBP') is 6.406235972339704
Test case z-score for feature ('on', 'IN') is -0.1254483939660962
Test case z-score for feature ('an', 'DT') is -0.7038566628061753
Test case z-score for feature ('they', 'PRP') is 3.751454855609827
Test case z-score for feature ('government', 'NN') is -3.002488881246878
Test case z-score for feature ('states', 'NNS') is -1.6770915879061807
Test case z-score for feature ('may', 'MD') is 1.9500329999248902
Test case z-score for feature ('been', 'VBN') is -1.5205160129401523

3.7 Calculating Delta values

Now, all that remains is to calculate a delta value for the test case compared to each candidate subcorpus and to figure out the best match. Reminder: the delta score is the average of the absolute values of the differences between the test case's z-scores and the candidate corpus' z-scores for all of the features.

In [47]:
for candidate in candidateList:
    delta = 0
    for feature in featuresList:
        delta += math.fabs( testCaseZScores[ feature ] - featureZScores[ candidate ][ feature ] )
    delta /= len( featuresList )
    print( "Delta score for candidate", candidate, "is", delta )
Delta score for candidate Hamilton is 2.2728145489103126
Delta score for candidate Madison is 2.101912545169431
Delta score for candidate Jay is 1.9528815383895344
Delta score for candidate Disputed is 1.9810301574357099

3.8 Conclusions

And after all of that work, we end up with the expected result: while all three "real" candidates are quite close in delta values, John Jay is identified as the most likely author of Federalist 64 -- but by the slimmest of margins when compared with the non-existent author of the Disputed corpus. That does not necessarily mean that Jay helped write the other Disputed papers: the delta scores are based on absolute values of differences and the underlying values may be on the opposite of the mean.

3.9 Notes about the Delta experiment

All of the Delta scores above are quite high: what they mean is that, on average, the frequency of each feature in the test case is about 2 standard deviations away from the corpus average.

The reason for this, I believe, is that the test case is quite short compared to the subcorpora. As a result, some features that appear a "few too many times" in the test case or that do not appear at all yield truly aberrant z-scores: for example, the feature ('as', 'IN') has a z-score of 9.9, which represents a frequency of occurrence in the test case 9.9 standard deviations above the corpus average. (As a comparison: an event a mere 7 standard deviations away from the mean is supposed to happen only once in 390 billion tries!)

This value, and a handful of others like it, grossly inflate the deltas. However, this is not a problem, for two reasons: first, because the aberrant z-scores inflate every delta equally; second, because the delta test is designed for ranking candidates only; the individual delta scores themselves are more or less meaningless.

Note that an experiment with 100 features instead of 30 yielded a different result, with Hamilton in first place and Jay a close second. Two observations about this:

  • Burrows recommends treating any result in which the true author ends up in first or second place as a success. Therefore, the test with 100 features would also count as a "win" -- albeit one that would be more impressive if we had to pick between dozens of candidates instead of a mere four.
  • Given the relatively small size of the test case and of Jay's subcorpus, I feel that the test with 30 features is probably more reliable than the one with 100 features. Since many of the corpus' least prominent features appear in these small data sets once or not at all, the number of aberrant z-scores in the 100-feature test was truly remarkable, with one value reaching 14.3 standard deviations above average!

That being said, it seems that a larger test case and larger/more numerous candidate subcorpora might support a test using a larger feature set, thus making the test more "robust" and potentially more interesting.

4.0 Future Work

There are several other techniques that have been applied to authorship problems. Some, like the ones used by Holmes et al. in determining whether a set of letters purportedly written by Confederate general George Pickett were in fact his or forgeries written by his widow (8), involve multivariate statistics and the extraction of principal components, which I am unfamiliar with but which have doubtless been implemented in various statistics software libraries and packages; given enough time for research, it would be relatively straightforward to learn and apply them.

It would also be interesting to compare the Federalist with other texts penned by Hamilton, Madison and Jay to see whether they differ in significant ways from the styles shown by the authors before and after -- which might give us a clue as to the level of hidden co-authorship in these papers.

Appendix: References

(1) Douglass Adair, "The Authorship of the Disputed Federalist Papers", The William and Mary Quarterly, vol. 1, no. 2 (April 1944), pp. 97-122; and Adair, « The Authorship of the Disputed Federalist Papers: Part II », The William and Mary Quarterly, vol. 1, no. 3, (July 1944), p. 235-264.

(2) David I. Holmes and Richard S. Forsyth, "The Federalist Revisited: New Directions in Authorship Attribution", Literary and Linguisting Computing, vol. 10, no. 2 (1995), pp. 111-127.

(3) Glenn Fung, "The disputed Federalist papers: SVM feature selection via concave minimization", TAPIA '03: Proceedings of the 2003 conference on Diversity in Computing, pp. 42-46.

(4) Jeff Collins, David Kaufer, Pantelis Vlachos, Brian Butler and Suguru Ishizaki, "Detecting Collaborations in Text: Comparing the Authors' Rhetorical Language Choices in The Federalist Papers", Computers and the Humanities 38 (2004), pp. 15-36.

(5) T. C. Mendenhall, "The Characteristic Curves of Composition", Science, vol. 9, no. 214 (Mar. 11, 1887), pp. 237-249.

(6) Adam Kilgarriff, "Comparing Corpora", International Journal of Corpus Linguistics, vol. 6, no. 1 (2001), pp. 97-133. Quote on page 254.

(7) John Burrows, "'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship", Literary and Linguistic Computing, vol. 17, no. 3 (2002), pp. 267-287.

(8) David I. Holmes, Lesley J. Gordon and Christine Wilson, "A Widow and her Soldier: Stylometry and the American Civil War", Literary and Linguistic Computing, vol. 16, no. 4 (2001), pp. 403-420.