#!/usr/bin/env python
# coding: utf-8

# # Punch Card Operations
# 
# Stéfan Sinclair &amp; Geoffry Rockwell
# 
# We're exploring here some of the operations and algorithms used by early computing humanists, including Father Roberto Busa. For this experiment we're considering in particular IBM engineer Paul Tasman's description "[Literary Data Processing](http://ieeexplore.ieee.org/document/5392686/?reload=true&arnumber=5392686)."

# ## Data Entry of Phrases
# 
# *The Automatic Punch, controlled by a keyboard similar to that of an ordinary typewriter, «wrote» by holes or perforations, one for each card, all the lines; a total of 136 cards. This is the sole work done by human eyes and fingers directly and responsibly; if at this point oversights occur, the error will be repeated from stage to stage; but if no mistakes were made, or were eliminated, there is no fear of fresh errors; human work from now onwards is reduced to mere supervision on the proper functioning of the various machines.*
# 
# For the purposes of this replication we won't bother initially with the mechanics of a punchcard, we'll simply create an array of 136 strings which we'll then endeavour to treat as if they were punch cards.
# 
# Rather than use a local data file for the Dante's *Divine Comedy*, we'll read from a [plain text URL](https://www.clear.rice.edu/comp200/resources/texts/Dante%20Divine%20Comedy.txt) (and assume that it will remain available), and then extract Canto III.

# In[11]:


import urllib.request

# grab the entire Divine Comedy
inferno_url = 'https://raw.githubusercontent.com/zioproto/hadoop-swift-tutorial/master/dcu.txt'
with urllib.request.urlopen(inferno_url) as response:
   inferno_text = response.read().decode()


# In[12]:


import re

# extract Canto III
canto3 = inferno_text[inferno_text.find("Inferno: Canto III") : inferno_text.find("Inferno: Canto IV")].strip()

# split the string into lines (removing the title line)
cards = re.split('\n+', canto3)[1:]
print("We have ", len(cards), '"cards" (or lines).')


# ## Checking the Data Entry
# 
# *The collator can also be used to verify and correct the cards which have been manually punched at the beginning, and thus guarantee the accuracy of the transcription, an indispensable condition for philological works, particularly in the light of their peculiar function. Two separate typists punch the same text, each on his own; the collator compares the two series of cards, perceiving the discrepancies; of the cards not coinciding, at least one is wrong. This control allows only the following case to pass unobserved, namely two typists make the same error in the same place. This case is very improbable and so much the less probable in as much as the qualities and circumstances of typing and typist are different.*
# 
# We don't need to check transcription, although there could be some discussion about possible typos and editorial decisions in this edition.
# 
# ## Another Way of Checking
# 
# *This method of verifying, although substantially the same, offers perhaps some advantages over the other, usually employed by IBM in the intent of not doubling the number, and consequently the cost, of the cards purposely, whereas in our case this is no hindrance, since each card already has to be multiplied as many times as the words it contains; the punched cards are put through the Verifier on the keys of which a typist repeats the sane text; the machine signals him when his punching does not concord with the existing holes; one of the two is wrong.*
# 
# Again, this doesn't seem especially relevant.
# 
# ## Transcribing
# 
# *The contents of each card can be made legible either on the punch itself which, if required, can simultaneously write in letters on the upper edge of the card what is «written» in holes on the various lines of columns thereon; or else on a second machine, the so-called Interpreter, which transcribes in letters the holes it encounters on the cards (previously punched). This offers not only a more accurate transcription in virtue of the better type and greater spacing of the characters, but a transcription which can be effected on any desired portion of the card.*
# 
# For now, since we're not reproducing the punch card object visually, this can be skipped.
# 
# ## Tokenization
# 
# *The 136 cards thus punched were then processed through a third machine, the Reproducer: this automatically copied them on another 136 cards, but adding, sideways of the lines and their quotations, the first of the words contained in each. Subsequently it makes a second copy, adding on the side the second word, then a third copy adding the third, and so forth. There were finally 943 cards, as many as were the words of the third canto of Dante's Inferno; thus each word in that canto had its card, accompanied by the text (or rather, here, by the line) and by the quotation.*
# 
# *This is equivalent to state that each line was multiplied as many times as words it contained. I must confess that in actual practice this was not so simple as I endeavoured to make it in the description; the second and the successive words did not actually commence in the same column on all cards. In fact, it was this lack of determined fields which constituted the greatest hindrance in transposing the system from the commercial and statistical uses to the sorting of words from a literary text. The result was attained by exploring the cards, column by column, in order to identify by the non-punched columns the end of the previous word and the commencement of the following one; thus, operating with the sorter and reproducer together, were produced only those words commencing and finishing in the same columns.*
# 
# So, before we begin this, an important step is needed to transform the text from each card into something more plausible. At the moment each line has been transcribed faithfully, but that's not what happened (even if it's not described). In particular, we need at least the following operations:
# 
# * convert words to uppercase
# * filter out anything that's not part of a word (quotes, punctuation, based on the total number of cards, hyphens and apostrophes internal to words – like *l’etterno* – were also stripped)
# * strip out accented characters
# * normalize spacing (no leading or trailing spaces, single space between words)
# 
# Presumably these operations would have been performed during the initial data entry, but we'll do it now since it wasn't made clear earlier.

# In[13]:


import unicodedata

# convert to upper case
clean_cards = [line.upper() for line in cards]

# keep word characters only
clean_cards = [" ".join(re.findall(r'\b\w+\b', line.strip())) for line in clean_cards]

# remove accents
clean_cards = [unicodedata.normalize('NFKD', line).encode("ascii", 'ignore').decode() for line in clean_cards]
clean_cards[0:10]


# 
# The next step is to tokenize. The algorighm described above by Tasman may not seem especially efficient, but it does have the merit of being algorithmically simple: you consider every column where a word starts (the first column or any column preceded by a space (or no punches), then one column at a time, you look for cards where words end.

# In[22]:


wordcards = [] # our new stack of cards with each word

# look for columns with a space (starting with -1, or before the first word)
for start_column in range(-1, 50):
    
    # make a temporary stack of cards for words that start in this column
    start_cards = [card for card in clean_cards
                   if len(card)>start_column and (start_column==-1 or card[start_column]==" ")]
    
    # now look at each column after the start column (skip the first letter)
    for end_column in range(start_column+1, 50):
        
        # look at each remaining card
        for index, card in enumerate(list(start_cards)):
            
            # if the end column is a space or if it's the last character in the line
            if len(card) >= end_column and (len(card)==end_column or card[end_column]==" "):
                
                # add it to our word cards and remove it from our temporary stack
                wordcards.append((index, start_column+1, end_column, card[start_column+1:end_column], card))
                start_cards.remove(card)

wordcards[0:5]


# In[24]:


len(wordcards)


# **GR**: Tasman talks about 943 cards, one per word, but try as I might I can't see how he gets to that number... Also, I'm not sure how to deal with diacritics, you?

# ## Sorting
# 
# *Having reached this point, it is a trifle to put the words into alphabetical order; the Sorter, proceeding backwards, from the last letter, sorts and groups gradually column by column, all the identical letters; in a few minutes the words are aligned and the card file, in alphabetical order, is already compiled.*

# In[25]:


# we'll do this properly as per the description, but here's a simpler version
sorted(wordcards, key=lambda tup: tup[3])


# ## Lemmatization
# 
# *The philologist, however, must group or sort further on what the machine has not been able to «feel»; thus have, had are different forms of the same verb; thus, in Italian, andiamocene, diamogliene are several words joined into one, and for the Latin mortuus est is a single word form which means died, but could also mean the dead man is and then they would be two items; and so on for the whole wide range of homonyms.*
# 
# ## Output
# 
# *When the order has thus been properly modified and attains its final form, the cards are ready to be process in the Alphanumerical Accounting Machine, or Tabulator.
# The tabulator retranscribes on a sheet of paper, in letters and numbers— no longer in holes— line after line, the contents represented by the holes in the cards, at the rate of 4,800 cards per hour; and this is a page of the concordance or index in its final arrangement.*
# 
# ## Headings
# 
# *The concordance which I am presenting as an example is precisely an off-set reproduction of tabulated sheets turned out by the accounting machine. The tabulator's performance is extremely useful when, to use, the current technical phrase, it is running in tab. When another machine called the Summary Punch is connected to the accounting machine running in tab, while the latter is turning out the long tabulated list of different words, the former, electrically controlled by the accounting machine, simultanteously punches a new card for each of these words, thus providing ready headings to be placed before the single groups of lines or quotations. If necessary, these can be inserted in their proper place among all the others automatically by the collator.
# 
# ## Phrases
# 
# *This Collator which searches simultaneously two separate groups of cards at the rate of 20,000 per hour, and can insert, substitute and change cards from one with the cards from the other group, also offers some initial solutions to the problem of finding phrases or compound expressions. Taking, for example the expression according to: the group of cards containing according and that containing to are processed in the machine; on the basis of the identical quotation, the machine will extract all those cards on which both appear. It is true that they may be separated by other words, but one thing is certain, namely that all the cards bearing according to will be among those extracted; the eye and the hand must do the rest. It is still easier to obtain the same result when a card bearing the phrase sought for can be used as a pilot-card.*

# ---
# 
# CC-BY Stéfan Sinclair &amp; Geoffrey Rockwell
# 
# Last updated September 14, 2016.