NOTE: Ce projet ayant initialement été réalisé dans le cadre d'un séminaire de cycles supérieurs tenu à l'Université McGill, le texte qui suit est en anglais.
For this project, I have decided to work with the Federalist Papers, a collection of 85 seminal political theory articles published by James Madison, Alexander Hamilton and John Jay between October 1787 and May 1788. These papers, published as the debate over the ratification of the Constitution of the United States was raging, presented the case for the system of government that the U.S. ultimately adopted.
The Federalist Papers were published anonymously, under the shared pseudonym "Publius", and several of them have remained of uncertain authorship to this day.
According to a famous two-part 1944 article by Douglass Adair (1), neither Madison nor Hamilton wanted the true authorship of some of the papers to be known during their lifetimes because they had come to bitterly regret some of the positions they had espoused in the Federalist after the Constitution came into effect. However, as Hamilton prepared for his 1804 duel with then-sitting Vice-President of the United States Aaron Burr, in which Hamilton was killed, he left a handwritten note claiming authorship of 63 of the 85 papers; probably, according to Adair, to make sure that posterity would view him as the senior author and Madison and Jay as, at most, his lieutenants. Madison later refuted some of Hamilton's claims, stating that he (and not Hamilton) had written papers 49 through 58, 62 and 63, and that he had been the sole (or at least principal) author of papers 18, 19 and 20 for which Hamilton had claimed equal credit.
Historians, computer scientists and linguistics specialists have debated the issue ever since, the problem being compounded by the fact that, in the words of David Holmes and Richard Forsyth, the two men's writing styles are "unusually similar" (2). Most recent studies now agree that Madison wrote the disputed papers (3); however, Collins et al have cast new doubt on the issue by looking for traces of collaborative authorship and classifying four of the disputed papers as primarily Hamilton's work (4).
Furthermore, there was also briefly some uncertainty as to whether Hamilton or Jay had written Federalist 64. In his list, Hamilton credited Jay with #54 and himself with #64; however, a draft of #64 was later found in Jay's personal papers (1, p. 239) and the historical consensus is that Jay (who has understandably distracted by his impending duel) simply made a transcription mistake.
This project will implement some of the techniques described in the authorship attribution literature and apply them to the Federalist papers.
I will seek out answers to two primary questions:
The Project Gutenberg archives contain two versions of the Federalist Papers. I have decided to use this one: http://www.gutenberg.org/cache/epub/1404/pg1404.txt
The reason? It is the most internally consistent in terms of format. For example, each paper begins with a string of type "FEDERALIST No. XX", which makes it relatively easy to split the book-length file into individual papers; the other version occasionally inserts a period between "FEDERALIST" and "No.", for example, making manipulation awkward.
As usual, since Gutenberg is suspicious of multiple downloads of the same file, we will grab a copy and store it locally.
import urllib.request
federalistURL = "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
federalistString = urllib.request.urlopen( federalistURL ).read().decode()
import os
directory = "data"
if not os.path.exists( directory ):
os.makedirs( directory )
fic = open( "data/federalist.txt", "w" )
fic.write( federalistString )
fic.close()
OK. Now that we have a local copy of the collection, let's split it into a separate file for each of the 85 papers. First, load the data:
fic = open( "data/federalist.txt", "r" )
federalistString = fic.read( )
fic.close()
print( federalistString[ :200 ] )
The Project Gutenberg EBook of The Federalist Papers, by Alexander Hamilton, John Jay, and James Madison This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whats
A quick look at the file's contents shows us that everything up to the first occurence of "FEDERALIST No." is extraneous material that can be ignored. Same for everything that follows "End of the Project Gutenberg EBook". Let's strip this material away.
# Strip the file header and footer
startIndex = federalistString.find( "FEDERALIST No." )
endIndex = federalistString.find( "End of the Project Gutenberg EBook of The Federalist Papers" )
federalistStringNoHeaderFooter = federalistString[ startIndex : endIndex ]
We now have a string that contains the 85 papers and nothing else. Let's split it into individual paper-sized strings, using the string class' split() method, documented here
Most of the time, this method is used to split a string into words by looking for white space to use as a separator. But we will be twisting its purpose by separating over the "FEDERALIST No." tag and using entire Federalist Papers as "words". Note that the method's behaviour can sometimes be annoying: in our case, for example, it yields 86 strings instead of 85, with the first one an empty string containing the (non-existent) material before the first separator.
# Divide into 85 separate files
papersList = federalistStringNoHeaderFooter.split( "FEDERALIST No.", 85 )
# Since split() removes the separator, let's return it to each paper by hand in case we end up using it sometime.
papersList = [ "FEDERALIST No." + paper for paper in papersList ]
# And now, save the files. Remember that the first entry in papersList is a dummy that we need to
# ignore, thus the slice in the for loop
currentPaper = 1
for paper in papersList[ 1: ]:
currentPaperFileName = "data/federalist_{0}.txt".format( currentPaper )
fic = open( currentPaperFileName, "w" )
fic.write( papersList[ currentPaper ] )
fic.close()
currentPaper += 1
Good. We now have 85 files with names like "federalist_72.txt" in our data archive. From now on, we'll be able to play around with whatever subset of the Papers that we want.
To study each of the three authors' styles, we will construct sub-corpora containing all of the papers which they unquestionably wrote. We will also build a collection of the papers whose authorship is disputed between Hamilton and Madison to see whether they, as a group, resemble the writings of one man rather than the other. Finally, we will creat a fifth subcorpus consisting solely of Federalist 64.
To do so, we will rebuild a single string out of the files associated with the papers written by each author. A function will come in handy for this purpose:
# A function that concatenates a list of text files into a single string
def read_files_into_string( fileList ):
theString = ""
for eachFile in fileList:
fic = open( "data/federalist_{0}.txt".format( eachFile ), "r" )
theString += fic.read()
fic.close()
return theString
# Define the lists of papers in the sub-corpora
madisonPapersList = [ 10, 14, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 62, 63]
hamiltonPapersList = [ 1, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 21, 22, 23, 24, 25, 26, 27, 28, 29, \
30, 31, 32, 33, 34, 35, 36, 59, 60, 61, 65, 66, 67, 68, 69, 70, 71, 72, 73, \
74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85 ]
jayPapersList = [ 2, 3, 4, 5 ]
disputedPapersList = [ 18, 19, 20, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58 ]
testCaseList = [ 64 ]
# Make a dictionary out of the sub-corpora
federalistByAuthor = dict()
federalistByAuthor[ "Madison" ] = read_files_into_string( madisonPapersList )
federalistByAuthor[ "Hamilton" ] = read_files_into_string( hamiltonPapersList )
federalistByAuthor[ "Jay" ] = read_files_into_string( jayPapersList )
federalistByAuthor[ "Disputed" ] = read_files_into_string( disputedPapersList )
federalistByAuthor[ "TestCase" ] = read_files_into_string( testCaseList )
We will now apply a number of techniques to compare the Madison and Hamilton sub-corpora with each other and to see whether the disputed ones, as a group, resemble one more than the other.
Let's start by going old school and looking at what T. C. Mendenhall (5) identifies as a signature of authorship: the frequency at which an author uses words of different lengths.
To do this, we will take each individual sub-corpus, extract the lengths of the words in it, and then plot the frequency distribution of word lengths in a graph. For this purpose, we will retain all of the word tokens, as small function words are at least as likely to carry an author's stylistic signature as longer "meaningful" words.
# Setup procedure
import nltk
%matplotlib inline
# Tokenize the sub-corpora. We tokenize Jay's texts right away, even though we don't consider
# them at this point, because they'll be useful later on.
federalistByAuthorTokens = dict()
federalistByAuthorLengthDistributions = dict()
for subcorpus in [ "Hamilton", "Madison", "Disputed", "Jay" ]:
tokens = nltk.word_tokenize( federalistByAuthor[ subcorpus ] )
# Filter out punctuation
federalistByAuthorTokens[ subcorpus ] = [ token.lower() for token in tokens \
if any (c.isalpha() for c in token) ]
# Get a distribution of token lengths
tokenLengths = [ len( token ) for token in federalistByAuthorTokens[ subcorpus ] ]
federalistByAuthorLengthDistributions[ subcorpus ] = nltk.FreqDist( tokenLengths )
federalistByAuthorLengthDistributions[ subcorpus ].plot( 15, title = subcorpus )
Interesting: for 2- to 6-letter words, the shape of the characteristic curve for the disputed papers bears a striking resemblance to the one computed from Madison's papers and they are both quite clearly different from the distribution in Hamilton's. In the tail end of the distribution, the differences are far less clear, but the first piece of evidence points Madison's way.
Let us now take a look at the authors' favourite words and expressions to see if we can identify patterns. First, let's examine the 10 most frequent words used in each of the sub-corpora:
federalistByAuthorTokenDistributions = dict()
for subcorpus in [ "Hamilton", "Madison", "Disputed" ]:
federalistByAuthorTokenDistributions[ subcorpus ] = nltk.FreqDist( federalistByAuthorTokens[ subcorpus ] )
print( "Favourite words for:", subcorpus, ":", \
federalistByAuthorTokenDistributions[ subcorpus ].most_common( 10 ), "\n" )
Favourite words for: Hamilton : [('the', 10598), ('of', 7370), ('to', 4614), ('in', 2833), ('and', 2730), ('a', 2507), ('be', 2300), ('that', 1717), ('it', 1549), ('is', 1330)] Favourite words for: Madison : [('the', 4435), ('of', 2668), ('to', 1435), ('and', 1306), ('in', 926), ('a', 904), ('be', 876), ('that', 627), ('it', 568), ('is', 554)] Favourite words for: Disputed : [('the', 2454), ('of', 1488), ('to', 758), ('and', 671), ('in', 538), ('be', 491), ('a', 488), ('that', 299), ('it', 295), ('which', 275)]
Another point for Madison, albeit a less convincing one. While the contents of both authors' corpora "match" 9 out of 10 words used most often in the disputed papers, the Madison and disputed papers have 8 of 9 in the exact same positions in the frequency distributions, whereas Hamilton only does the same for 5 out of the 9 words.
Now, let's look at bigrams:
federalistByAuthorText = dict()
for subcorpus in [ "Hamilton", "Madison", "Disputed" ]:
federalistByAuthorText[ subcorpus ] = nltk.Text( federalistByAuthorTokens[ subcorpus ] )
print( "Favourite bigrams for", subcorpus, ":\n" )
federalistByAuthorText[ subcorpus ].collocations( 20 )
print( "\n" )
Favourite bigrams for Hamilton : new york; united states; supreme court; national government; great britain; state governments; independent journal; federal government; two thirds; proposed constitution; york packet; chief magistrate; state legislatures; legislative body; publius federalist; standing armies; military establishments; national legislature; packet tuesday; journal wednesday Favourite bigrams for Madison : united states; new york; state governments; federal government; publius federalist; january madison; legislative department; state legislatures; several states; judiciary departments; independent journal; public good; judiciary department; great britain; legislative executive; executive department; general government; proposed constitution; general welfare; york packet Favourite bigrams for Disputed : new york; february madison; united states; publius federalist; biennial elections; rhode island; independent journal; republican government; york packet; south carolina; packet tuesday; federal legislature; executive magistrate; journal saturday; thirty thousand; great britain; federal constitution; three years; would probably; different states
Not much of value to see here, especially since some of the frequent bigrams are not content at all but formatting artifacts (PUBLIUS FEDERALIST) or information about the newspaper in which the Federalist was published. Let's try trigrams:
federalistByAuthorTrigrams = dict()
for subcorpus in [ "Hamilton", "Madison", "Disputed" ]:
federalistByAuthorTrigrams[ subcorpus ] = \
list( nltk.ngrams( federalistByAuthorTokens[ subcorpus ], 3 ) )
print( "Favourite trigrams for", subcorpus, ":\n" )
trigramDist = nltk.FreqDist( federalistByAuthorTrigrams[ subcorpus ] )
print( trigramDist.most_common( 10 ), "\n\n" )
Favourite trigrams for Hamilton : [(('of', 'the', 'state'), 135), (('of', 'the', 'union'), 132), (('the', 'united', 'states'), 125), (('of', 'the', 'people'), 104), (('the', 'power', 'of'), 103), (('of', 'new', 'york'), 83), (('the', 'people', 'of'), 77), (('of', 'the', 'united'), 75), (('of', 'the', 'national'), 71), (('to', 'the', 'people'), 67)] Favourite trigrams for Madison : [(('of', 'the', 'people'), 71), (('of', 'the', 'state'), 54), (('the', 'united', 'states'), 49), (('the', 'people', 'of'), 42), (('of', 'the', 'states'), 39), (('of', 'the', 'union'), 38), (('the', 'federal', 'government'), 36), (('members', 'of', 'the'), 34), (('of', 'the', 'federal'), 33), (('the', 'state', 'governments'), 32)] Favourite trigrams for Disputed : [(('of', 'the', 'people'), 37), (('the', 'house', 'of'), 31), (('of', 'the', 'state'), 29), (('house', 'of', 'representatives'), 28), (('to', 'the', 'people'), 25), (('the', 'people', 'of'), 25), (('the', 'number', 'of'), 23), (('of', 'the', 'government'), 23), (('ought', 'to', 'be'), 20), (('people', 'of', 'the'), 17)]
Four out of the top 10 trigrams in Hamilton's papers also appear in the top 10 for the disputed papers, none of them in the "right" rank. For Madison, we get 3 out of 10 matches, but the highest-ranked trigram is the same in both sets. Not much valuable evidence either way.
Maybe we'll have more luck looking at content words only: if the disputed papers discuss the same kinds of topics as those written by Madison or Hamilton, that might be valuable information. For this purpose, we will filter out stopwords and lemmatize the rest.
# Create the data structures
federalistByAuthorContentTokens = dict()
federalistByAuthorContentFreqDist = dict()
# Setup for filtering and lemmatizing
stopwords = nltk.corpus.stopwords.words( "English" )
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
# Build lists of content-word lemmas and plot their distributions
for subcorpus in [ "Hamilton", "Madison", "Disputed" ]:
federalistByAuthorContentTokens[ subcorpus ] = [ wnl.lemmatize( token ) \
for token in federalistByAuthorTokens[ subcorpus ] \
if not token in stopwords ]
federalistByAuthorContentFreqDist[ subcorpus ] = \
nltk.FreqDist( federalistByAuthorContentTokens[ subcorpus ] )
federalistByAuthorContentFreqDist[ subcorpus ].plot( 20, title = subcorpus )
As expected in such a corpus, words such as "state", "government", "power", "constitution" and "federal" appear quite often no matter who the author of a particular paper happens to be. At first glance, the oddest thing in these distributions is the surprisingly frequent use of "would" in Hamilton's papers compared with the others, which may signal that he did not write the disputed papers. However, the fact that "power" appears in third place in both the Hamiltonian and Madisonian corpora and only in 13th position in the disputed papers also seems very strange and might suggest that either the topics covered by the disputed papers differ significantly from the others or that they were written by someone else entirely.
Let's take a deeper look at the top 50 words in the frequency distributions, computationally:
# How many of the 50 most frequent words in the disputed papers are also among the top 50 in
# Hamilton's own? And in Madison's?
hamiltonTop50 = [ word for (word, freq) \
in federalistByAuthorContentFreqDist[ "Hamilton" ].most_common( 50 ) ]
madisonTop50 = [ word for (word, freq) \
in federalistByAuthorContentFreqDist[ "Madison" ].most_common( 50 ) ]
disputedTop50 = [ word for (word, freq) \
in federalistByAuthorContentFreqDist[ "Disputed" ].most_common( 50 ) ]
hamiltonHowMany = len( [ word for word in hamiltonTop50 if word in disputedTop50 ] )
madisonHowMany = len( [ word for word in madisonTop50 if word in disputedTop50 ] )
print( "Of Hamilton's top 50, {0} appear in the disputed papers.".format( hamiltonHowMany ) )
print( "Of Madison's top 50, {0} appear in the disputed papers.".format( madisonHowMany ) )
Of Hamilton's top 50, 31 appear in the disputed papers. Of Madison's top 50, 33 appear in the disputed papers.
Not very significant. Perhaps if, instead of looking at the mere presence of the words, we looked at their relative positions in the lists? For this purpose, we can use the list's index() function, which returns the position of the first instance of an object in the list; since our lists never contain more than one instance of a word because they have been built out of frequency distributions, this should work nicely.
# A little helper function to calculate the distances between the positions of words in List1
# and the positions of the same words in List2; if the words aren't in List2 at all, assign a
# large distance
def calc_distances_between_lists( list1, list2 ):
dist = 0
for word in list1:
if word in list2:
dist += abs( list1.index( word ) - list2.index( word ) )
else:
dist += 50 # If the words don't match, they are far, far away
return dist
print( "Hamilton's distances:", calc_distances_between_lists( hamiltonTop50, disputedTop50 ) )
print( "Madison's distances:", calc_distances_between_lists( madisonTop50, disputedTop50 ) )
Hamilton's distances: 1233 Madison's distances: 1169
Again, not a huge difference, but one that points towards Madison possibly being responsible for more of the content of the disputed papers.
Overall, simple word and n-gram counts don't yield very convincing evidence. Let's go on to an entirely different metric.
In a 2001 paper, Adam Kilgarriff recommends using the chi-squared statistic to compare two corpora (6). Chi-squared's most common application is in testing two variables for statistical independence, which is not what we're after in this case. In Kilgarriff's words: "... the statistic is not in general appropriate for hypothesis-testing in corpus linguistics: a corpus is never a random sample of words, so the null hypothesis [of independence] is of no interest. But once divested of the hypothesis-testing link, [chi-squared] is suitable."
The way to apply the statistic is the following:
The statistic gives a measure of the difference between two corpora, for example Hamilton and Disputed. Repeating the procedure for Madison vs Disputed, we can see which of Hamilton or Madison is statistically "closer" to Disputed.
Kilgarriff uses raw word tokens instead of lemmas and achieves his best results with 320 to 640 "most common words". I ran the test with 50, 100, 200 and 500 words; in all cases, the Madison corpus was much closer to the Disputed corpus than Hamilton's. Here are the results for 500 common words:
for candidate in [ "Hamilton", "Madison" ]:
# First, build a joint corpus and identify the most frequent words in it
# We'll keep the stopwords since they are commonly used in authorship attribution studies
jointCorpus = federalistByAuthorTokens[ candidate ] + federalistByAuthorTokens[ "Disputed" ]
jointFreqDist = nltk.FreqDist( jointCorpus )
mostCommonInJointCorpus = list( jointFreqDist.most_common( 500 ) )
# What proportion of the joint corpus is made up of the candidate corpus' tokens?
candidateShareInJointCorpus = len( federalistByAuthorTokens[ candidate ] ) / len( jointCorpus )
# Now, let's look at these 50 words in the candidate author's corpus and compare the number of
# times it can be observed to what would be expected if the candidate corpus and the Disputed
# corpus were both random samples from the same distribution.
chisquared = 0
for word, jointCount in mostCommonInJointCorpus:
# How often do we really see it?
candidateCount = federalistByAuthorTokens[ candidate ].count( word )
disputedCount = federalistByAuthorTokens[ "Disputed" ].count( word )
# How often should we see it?
expCandidateCount = jointCount * candidateShareInJointCorpus
expDisputedCount = jointCount * ( 1 - candidateShareInJointCorpus )
# Add the word's contribution to the chi-squared statistic
chisquared += ( candidateCount - expCandidateCount ) * \
( candidateCount - expCandidateCount ) / expCandidateCount
chisquared += ( disputedCount - expDisputedCount ) * \
( disputedCount - expDisputedCount ) / expDisputedCount
print( "The Chi-squared statistic for candidate", candidate, "is", chisquared )
The Chi-squared statistic for candidate Hamilton is 2997.851461213525 The Chi-squared statistic for candidate Madison is 1533.5696224509918
So another point for Madison, and the most quantitatively impressive so far.
To recap:
This is getting pretty convincing: it looks like Madison probably wrote a majority of the Disputed papers.
Federalist 64 is the only paper for which John Jay is part of the discussion, if only because of Hamilton's carelessness.
Let's make things even more interesting. For the sake of argument, let's assume that we have four candidates instead of two or three: Hamilton, Jay, Madison, and some unknown author who would have penned the other Disputed papers. There is no reason in the historical record to believe that some fourth person got involved in the ccreation of the Federalist papers, although Hamilton may have (unsuccessfully) tried to convince his future Undersecretary of the Treasury William Duer to contribute to the project. However, positing a fourth author will allow us to study whether #64 is closest to Hamilton's known work, to Madison's, to Jay's, or to the other disputed papers.
Since the Burrows Delta method (7) is quite effective at picking the most likely author of a text from a set of candidates of arbitrary size, at least up to 25, we will apply this method to Federalist 64.
The method is designed to measure the differences between an individual author's style and the "average" style of a set of authors. The method works like this:
Find the N most frequent word tokens in the corpus as a whole, to use as features. It is recommended to apply parts-of-speech tagging to the tokens beforehand, so that the same token used as two different parts of speech may count as two features.
Divide the corpus into M subcorpora: one for each author.
For each of the N features, calculate frequencies of occurrence in each of the M authorial subcorpora, as a percentage of the total number of POS-tagged word tokens in this particular subcorpus. Then calculate the mean and the standard deviation of these M values and use them as the offical mean and standard deviation for this feature over the whole corpus. (We use a "mean of means" instead of calculating a single frequency for the entire corpus to avoid a larger subcorpus, like Hamilton's in our case, over-influencing the results in its favor and defining the "norm" in such a way that everything would be expected to look like it.)
For each of the N features and M subcorpora, calculate a z-score describing how far away from the "corpus norm" the usage of this particular feature in this subcorpus happens to be. To do this, subtract the corpus average for the feature from the feature's frequency in the subcorpus and divide the result by the feature's standard deviation.
Also calculate z-scores for each feature in the test case (i.e., Federalist 64).
Calculate a delta score comparing the test case with each candidate's subcorpus. To do do, take the average of the absolute values of the differences of the z-scores for each feature in the test case and in the candidate's subcorpus. This gives equal weight to each feature, no matter their respective frequences of occurrence in the texts; otherwise, Zipf's Law would ensure that the top 3 or 4 features would overwhelm everything else.
The "winning" candidate is the one for whom the delta score is the lowest.
In his article, Burrows uses a grand corpus formed by assembling poems by 25 different writers. In our case, there are only 4 "authors" to deal with; this may be a problem since calculating standard deviations from a sample size of 4 is not ideal but we'll see what happens.
Since we already have the subcorpora for our 4 "candidates", we will POS-tag them first before merging them into a single corpus for the purpose of finding features. This way, we will avoid having to perform the same tagging twice.
candidateList = [ "Hamilton", "Madison", "Jay", "Disputed" ]
federalistByAuthorPOS = dict()
for candidate in candidateList:
federalistByAuthorPOS[ candidate ] = nltk.pos_tag( federalistByAuthorTokens[ candidate ] )
print( federalistByAuthorPOS[ candidate ][ :10 ] )
[('federalist', 'NN'), ('no', 'DT'), ('general', 'JJ'), ('introduction', 'NN'), ('for', 'IN'), ('the', 'DT'), ('independent', 'JJ'), ('journal', 'NN'), ('saturday', 'NN'), ('october', 'NN')] [('federalist', 'NN'), ('no', 'DT'), ('the', 'DT'), ('same', 'JJ'), ('subject', 'NN'), ('continued', 'VBD'), ('the', 'DT'), ('union', 'NN'), ('as', 'IN'), ('a', 'DT')] [('federalist', 'NN'), ('no', 'DT'), ('concerning', 'NN'), ('dangers', 'NNS'), ('from', 'IN'), ('foreign', 'JJ'), ('force', 'NN'), ('and', 'CC'), ('influence', 'NN'), ('for', 'IN')] [('federalist', 'NN'), ('no', 'DT'), ('the', 'DT'), ('same', 'JJ'), ('subject', 'NN'), ('continued', 'VBD'), ('the', 'DT'), ('insufficiency', 'NN'), ('of', 'IN'), ('the', 'DT')]
This took more than 4 minutes so doing the work once is going to be enough! Now, let's combine the candidates' subcorpora into a single corpus and find the Top 30 most frequent (word, pos) pairs.
# Combine into a single corpus
wholeCorpusPOS = []
for candidate in candidateList:
wholeCorpusPOS += federalistByAuthorPOS[ candidate ]
# Get a frequency distribution
wholeCorpusPOSFreqsTop30 = list( nltk.FreqDist( wholeCorpusPOS ).most_common( 30 ) )
wholeCorpusPOSFreqsTop30[ :10 ]
[(('the', 'DT'), 17846), (('of', 'IN'), 11796), (('to', 'TO'), 7012), (('and', 'CC'), 5016), (('in', 'IN'), 4385), (('a', 'DT'), 3967), (('be', 'VB'), 3752), (('it', 'PRP'), 2520), (('is', 'VBZ'), 2178), (('which', 'WDT'), 2053)]
OK, now all that we truly want out of this is the (word, pos) pairs that we will be looking for in the subcorpora. Thus:
featuresList = [ wordpospair for ( wordpospair, freq ) in wholeCorpusPOSFreqsTop30 ]
featuresList[ :10 ]
[('the', 'DT'), ('of', 'IN'), ('to', 'TO'), ('and', 'CC'), ('in', 'IN'), ('a', 'DT'), ('be', 'VB'), ('it', 'PRP'), ('is', 'VBZ'), ('which', 'WDT')]
Let's look at the frequencies of each feature in each candidate's subcorpus, as a proportion of the total number of tokens in the subcorpus. We'll calculate these values and store them in a dictionary of dictionaries, the most convenient way I know to build a two-dimensional array in Python.
# The main data structure
featureFrequencies = dict()
for candidate in candidateList:
# A dictionary for each candidate's features
featureFrequencies[ candidate ] = dict()
# A helper value containing the number of (token, pos) pairs in the subcorpus
overall = len( federalistByAuthorPOS[ candidate] )
# Calculate each feature's presence in the subcorpus
for feature in featuresList:
presence = federalistByAuthorPOS[ candidate ].count( feature )
featureFrequencies[ candidate ][ feature ] = presence / overall
Given the feature frequencies for all four subcorpora, we can find a "mean of mean" and a standard deviation for each feature. We'll store these values in a 2D array called corpusFeatures.
import math
# The data structure into which we will be storing the "corpus standard" statistics
corpusFeatures = dict()
# For each feature...
for feature in featuresList:
# Create a sub-dictionary that will contain the feature's mean and standard deviation
corpusFeatures[ feature ] = dict()
# Calculate the mean of the frequencies expressed in the subcorpora
featureAverage = 0
for candidate in candidateList:
featureAverage += featureFrequencies[ candidate ][ feature ]
featureAverage /= len( candidateList )
corpusFeatures[ feature ][ "Mean" ] = featureAverage
# Calculate the standard deviation using the basic formula for a sample
featureStdDev = 0
for candidate in candidateList:
diff = featureFrequencies[ candidate ][ feature ] - corpusFeatures[ feature ][ "Mean" ]
featureStdDev += ( diff * diff )
featureStdDev /= ( len( candidateList ) - 1 )
featureStdDev = math.sqrt( featureStdDev )
corpusFeatures[ feature ][ "StdDev" ] = featureStdDev
Next, we transform the observed feature frequencies in the four candidates' subcorpora into z-scores describing how far away from the "corpus norm" these observations are. Nothing fancy here: I merely apply the definition of the z-score and store the results into yet another 2D array.
featureZScores = dict()
for candidate in candidateList:
featureZScores[ candidate ] = dict()
for feature in featuresList:
# Z-score definition = (value - mean) / stddev
# We use intermediate variables to make the code easier to read
featureVal = featureFrequencies[ candidate ][ feature ]
featureMean = corpusFeatures[ feature ][ "Mean" ]
featureStdDev = corpusFeatures[ feature ][ "StdDev" ]
featureZScores[ candidate ][ feature ] = ( featureVal - featureMean ) / featureStdDev
OK, now we need to tokenize the test case, extract features and their frequencies, and calculate z-scores. This duplicates some of the preceding code; we could have inserted the test case into the preceding loops in some cases, but that would have made the code harder to read.
# Tokenize the test case
testCaseTokens = nltk.word_tokenize( federalistByAuthor[ "TestCase" ] )
# Filter out punctuation
testCaseTokens = [ token.lower() for token in testCaseTokens \
if any (c.isalpha() for c in token) ]
# Tag the test case for parts of speech
testCaseTokensPOS = nltk.pos_tag( testCaseTokens )
# Calculate the test case's features
overall = len( testCaseTokensPOS )
testCaseFeatureFrequencies = dict()
for feature in featuresList:
presence = testCaseTokensPOS.count( feature )
testCaseFeatureFrequencies[ feature ] = presence / overall
# Calculate the test case's feature z-scores
testCaseZScores = dict()
for feature in featuresList:
featureVal = testCaseFeatureFrequencies[ feature ]
featureMean = corpusFeatures[ feature ][ "Mean" ]
featureStdDev = corpusFeatures[ feature ][ "StdDev" ]
testCaseZScores[ feature ] = ( featureVal - featureMean ) / featureStdDev
print( "Test case z-score for feature", feature, "is", testCaseZScores[ feature ] )
Test case z-score for feature ('the', 'DT') is -0.5715580470247853 Test case z-score for feature ('of', 'IN') is -1.5305700230155077 Test case z-score for feature ('to', 'TO') is 0.8614793246103539 Test case z-score for feature ('and', 'CC') is 0.9923420641528881 Test case z-score for feature ('in', 'IN') is 0.4020915007213734 Test case z-score for feature ('a', 'DT') is -0.9117144930479929 Test case z-score for feature ('be', 'VB') is 3.211624164136822 Test case z-score for feature ('it', 'PRP') is -0.4225688536389689 Test case z-score for feature ('is', 'VBZ') is -1.0961185455958937 Test case z-score for feature ('which', 'WDT') is -1.8617352051613247 Test case z-score for feature ('that', 'IN') is 3.4609374852197483 Test case z-score for feature ('by', 'IN') is 1.4958416606713247 Test case z-score for feature ('as', 'IN') is 9.90787889195006 Test case z-score for feature ('this', 'DT') is -1.0795620198226084 Test case z-score for feature ('not', 'RB') is 1.636646994434489 Test case z-score for feature ('would', 'MD') is -1.2087806882807304 Test case z-score for feature ('for', 'IN') is -1.9510710252068257 Test case z-score for feature ('will', 'MD') is 4.10229975683253 Test case z-score for feature ('or', 'CC') is -0.4629523405451141 Test case z-score for feature ('from', 'IN') is -0.6428230279149659 Test case z-score for feature ('their', 'PRP$') is 0.9313818415056075 Test case z-score for feature ('with', 'IN') is 0.009860834778102507 Test case z-score for feature ('are', 'VBP') is 6.406235972339704 Test case z-score for feature ('on', 'IN') is -0.1254483939660962 Test case z-score for feature ('an', 'DT') is -0.7038566628061753 Test case z-score for feature ('they', 'PRP') is 3.751454855609827 Test case z-score for feature ('government', 'NN') is -3.002488881246878 Test case z-score for feature ('states', 'NNS') is -1.6770915879061807 Test case z-score for feature ('may', 'MD') is 1.9500329999248902 Test case z-score for feature ('been', 'VBN') is -1.5205160129401523
Now, all that remains is to calculate a delta value for the test case compared to each candidate subcorpus and to figure out the best match. Reminder: the delta score is the average of the absolute values of the differences between the test case's z-scores and the candidate corpus' z-scores for all of the features.
for candidate in candidateList:
delta = 0
for feature in featuresList:
delta += math.fabs( testCaseZScores[ feature ] - featureZScores[ candidate ][ feature ] )
delta /= len( featuresList )
print( "Delta score for candidate", candidate, "is", delta )
Delta score for candidate Hamilton is 2.2728145489103126 Delta score for candidate Madison is 2.101912545169431 Delta score for candidate Jay is 1.9528815383895344 Delta score for candidate Disputed is 1.9810301574357099
And after all of that work, we end up with the expected result: while all three "real" candidates are quite close in delta values, John Jay is identified as the most likely author of Federalist 64 -- but by the slimmest of margins when compared with the non-existent author of the Disputed corpus. That does not necessarily mean that Jay helped write the other Disputed papers: the delta scores are based on absolute values of differences and the underlying values may be on the opposite of the mean.
All of the Delta scores above are quite high: what they mean is that, on average, the frequency of each feature in the test case is about 2 standard deviations away from the corpus average.
The reason for this, I believe, is that the test case is quite short compared to the subcorpora. As a result, some features that appear a "few too many times" in the test case or that do not appear at all yield truly aberrant z-scores: for example, the feature ('as', 'IN') has a z-score of 9.9, which represents a frequency of occurrence in the test case 9.9 standard deviations above the corpus average. (As a comparison: an event a mere 7 standard deviations away from the mean is supposed to happen only once in 390 billion tries!)
This value, and a handful of others like it, grossly inflate the deltas. However, this is not a problem, for two reasons: first, because the aberrant z-scores inflate every delta equally; second, because the delta test is designed for ranking candidates only; the individual delta scores themselves are more or less meaningless.
Note that an experiment with 100 features instead of 30 yielded a different result, with Hamilton in first place and Jay a close second. Two observations about this:
That being said, it seems that a larger test case and larger/more numerous candidate subcorpora might support a test using a larger feature set, thus making the test more "robust" and potentially more interesting.
There are several other techniques that have been applied to authorship problems. Some, like the ones used by Holmes et al. in determining whether a set of letters purportedly written by Confederate general George Pickett were in fact his or forgeries written by his widow (8), involve multivariate statistics and the extraction of principal components, which I am unfamiliar with but which have doubtless been implemented in various statistics software libraries and packages; given enough time for research, it would be relatively straightforward to learn and apply them.
It would also be interesting to compare the Federalist with other texts penned by Hamilton, Madison and Jay to see whether they differ in significant ways from the styles shown by the authors before and after -- which might give us a clue as to the level of hidden co-authorship in these papers.
(1) Douglass Adair, "The Authorship of the Disputed Federalist Papers", The William and Mary Quarterly, vol. 1, no. 2 (April 1944), pp. 97-122; and Adair, « The Authorship of the Disputed Federalist Papers: Part II », The William and Mary Quarterly, vol. 1, no. 3, (July 1944), p. 235-264.
(2) David I. Holmes and Richard S. Forsyth, "The Federalist Revisited: New Directions in Authorship Attribution", Literary and Linguisting Computing, vol. 10, no. 2 (1995), pp. 111-127.
(3) Glenn Fung, "The disputed Federalist papers: SVM feature selection via concave minimization", TAPIA '03: Proceedings of the 2003 conference on Diversity in Computing, pp. 42-46.
(4) Jeff Collins, David Kaufer, Pantelis Vlachos, Brian Butler and Suguru Ishizaki, "Detecting Collaborations in Text: Comparing the Authors' Rhetorical Language Choices in The Federalist Papers", Computers and the Humanities 38 (2004), pp. 15-36.
(5) T. C. Mendenhall, "The Characteristic Curves of Composition", Science, vol. 9, no. 214 (Mar. 11, 1887), pp. 237-249.
(6) Adam Kilgarriff, "Comparing Corpora", International Journal of Corpus Linguistics, vol. 6, no. 1 (2001), pp. 97-133. Quote on page 254.
(7) John Burrows, "'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship", Literary and Linguistic Computing, vol. 17, no. 3 (2002), pp. 267-287.
(8) David I. Holmes, Lesley J. Gordon and Christine Wilson, "A Widow and her Soldier: Stylometry and the American Civil War", Literary and Linguistic Computing, vol. 16, no. 4 (2001), pp. 403-420.