#!/usr/bin/env python
# coding: utf-8

# # Theory of collation
# ## Gothenburg model
# 
# The *Gothenburg model* emerged from a meeting of the developers of [CollateX](http://collatex.net) and [Juxta](http://www.juxtasoftware.org/) in Gothenburg, Sweden in 2009 at a joint workshop of the EU-funded research projects [Open Scholarly Communities on the Web (COST Action 32)](http://www.cost-a32.eu/index.html) and [Interedition]( http://www.interedition.eu) (http://wiki.tei-c.org/index.php/Textual_Variance). The Gothenburg model conceptualizes collation as a pipeline that involves the following independent but coordinated modules, so that one or another may be modified without requiring revision of the rest of the system. The stages are:
# 
# 1. Tokenization
# 1. Normalization/regularization
# 1. Alignment
# 1. Analysis
# 1. Visualization/output
# 
# The Gothenburg model is mostly about separation of concerns: it tries to find a division of the work processes involved with collation that results in steps that could reasonably be seen as self contained units or tasks. The aim is to minimize the dependencies or couplings between these steps—ideally one step's output is the next step's input only, and there are no other strong relations or dependencies between the steps. This separation of concerns ensures that we can see each task as a self-contained challenge to be solved. How these tasks should be solved is not defined by the Gothenburg model. The implementation is free and up to any one who wants (to try) to provide a solution. The developers of CollateX wanted to try for a maximal computational solution (that is: one that if possible would computationally fully automate the task of collation). 
# 
# The description of the five stages of collation below presupposes that the texts to be collated have already been transcribed and digitized, which is to say that transcription and digitization function as an implicit Stage 0 in the pipeline.

# ## The different phases
# ### 1. Tokenization
# *Tokenization* is the division of a continuous text into units to be aligned (called *tokens*). Most commonly, tokens are whitespace-delimited words, but tokenization can be performed at any level of granularity, e.g., “syllables, words, lines, phrases, verses, paragraphs, or text nodes” (http://wiki.tei-c.org/index.php/Textual_Variance). Challenges that arise during tokenization include the following:
# 
# * **Ambiguity.** In texts written without spaces between works (*scriptio continua*), the division into words may be ambiguous, that is, it may be possible to divide the same continuous writing in two ways, either of which would be linguistically correct.
# * **Punctuation.** Punctuation is commonly tokenized by itself, so that, for example, “cat” and “cat,” (without and with a trailing comma) will be recognized as instances of the same word. The situation is less clear with non-final punctuation, though, such as hyphenated words.
# * **Contractions** like English “doesn’t” or “can’t” raise questions about whether they should be treated as one word or two for the purpose of collation.
# 
# The preceding issues affect tokenization on an intellectual level in that they involve decisions by the researcher that would arise with or without a computational environment. Machine-assisted algorithmic tokenization, though, raises additional challenges, including the following: 
# 
# * **Word-internal punctuation** means that it cannot safely be assumed that a word is a continuous sequence of alphabetic characters, and that a punctuation character indicates the beginning of a new token. In addition to the ambiguities mentioned above that involve hyphenation and English negative contractions, consider:
#     * **Lexical contraction.** The contraction of “Amsterdam” as “A’dam”, which is lexically specific.
#     * **Punctuation before bound morphemes.** The English “-’s” possessive is different from English negative contractions in “-n’t”. The “-n’t” portion might be understood as a variant spelling of “not” (insofar as “doesn’t” may be replaced by “does not”, etc., without violating English language norms), but the “-’s” possessive particle does not have a free-standing lexical counterpart in modern English.
# * **Superscription.** In some scribal practice, such as Church Slavonic manuscripts, it is common to write some letters as superscripts, and the base and superscript letters may belong to different words. For example, “ona že” (‘and she’) may be written as “ona<sup>ž</sup>”. The visual form that appears in the manuscript isn’t easily reproduced here in a web interface, but the superscript “ž” is not merely raised, but also centered over the “a” that ends the preceding word. There is no graphic difference in this type of writing between a combination of base plus superscript letter when they are part of the same word and when they belong to different words.
# * **Markup**, such as XML, may be intermingled with textual content, and XML element tags may surround an entire word or part of a word, or they may begin inside one word and end inside another, which poses special challenges for tokenizing in a way that does not contradict XML well-formedness. Even where well-formedness is not an issue, researchers may differ in their preferences for taking markup into account when collating a set of witnesses.
# 
# CollateX comes with a default tokenizer with the following properties:
# 
# * It divides the text into tokens at white space.
# * Punctuation is tokenized separately from alphanumeric characters.
# 
# This means that, for example, the default tokenizer would split the input “Peter’s cat.” into five tokens: “Peter”, “’”, “s ”, “cat”, and “.”. The tokenizer keeps trailing white space with the preceding token, so the third token contains two characters, the possessive “s” and the space that follows it. By default, CollateX ignores trailing white space when it compares tokens for alignment (Stage 3 of the Gothenburg model), so “cat” (three alphabetic characters) and “cat ” (three alphabetic characters followed by a space) would be treated as the same for collation purposes, but the trailing white space in the input in the second example would remain available should the user need to retrieve it later. 
# 
# In conformity with the modular structure of the Gothenburg model, if the default tokenizer built into CollateX does not meet your research needs, you can replace it with your own without having to write your own code to modify the behavior of later parts of the Gothenburg processing pipeline. We discuss how to do this later in the workshop.

# ### 2. Normalization/regularization
# 
# Some degree of normalization is inevitable during transcription because the editor has to transform handwriting, which is analog, into a digital representation. Every instance of a handwritten letter is likely to differ at least minutely from every other instance of the same letter, and the editors will nonetheless transcribe them the same way if they do not consider the differences meaningful. In diplomatic editions editors aim for maximal fidelity to the original, but even there some degree of conflation of variants is inevitable. The normalization process in the Gothenburg model does not refer to normalization during transcription. Normalization in the Gothenburg model refers to situations where the editor has transcribed two different written forms differently because the difference is important in some way, but it is not important for alignment. For that reason, the normalization stage of the Gothenburg model refers to the process of telling the alignment stage, which follows, to treat some graphically different forms as if they were the same for the purpose of aligning them.
# 
# One common guideline in digital editing is to ignore *non-substantive* variation when performing comparisons for the purpose of alignment, although the understanding of non-substantive is both project-specific and subjective. Many (not all) editors may choose to normalize away differences in punctuation and in upper vs. lower case. Some editors may further elect to normalize orthographic variation (the use of what humans would regard as different spellings, with different letters), variant letterforms (what humans would regard as the same letters, but with different shapes), abbreviation, superscription, etc. If an editor considers these types of variation completely unimportant, they might be normalized during transcription, but if the variants need to be available for the output but should not be allowed to influence the alignment, then can be normalized between the tokenization and the alignment stages of the Gothenburg model. This type of alignment creates a normalized *shadow copy* of the literal token, so that the normalized form can be used to determine what should be considered the same for alignment purposes, but the eventual output can nonetheless preserve and return the original spelling.
# 
# The types of textual variation that occur during the copying of texts may be categorized as the *insertion* of new textual material, the *deletion* of textual material present in the source, the *substitution* of new text in place of old, and the *transposition* of old material to a new location. Transposition can be encoded as deletion of material from one location and its insertion in another, but if the editor believes that the two acts are related, that is, that the material has been moved and the deletion and insertion are not independent of each other, the eventual edition should record that fact.
# 
# Editors understand variation as *substantive* or *non-substantive*, but that distinction depends on the writing system, the goals of the edition, and even the purpose of the textual comparison. For example, for the purpose of alignment we might prefer to ignore grammatical differences and align only on lemmata, while for philological analysis and interpretation we might wish to distinguish which witnesses use which inflected forms of the same lexeme. Most commonly, though, substantive variation includes *equipollent* readings (that is, different but equivalent or synonymous textual content), *linguistic* variation, and *scribal error*. Non-substantive variation commonly includes *graphic variants* (such as variant ways of writing the same letter), *punctuation*, upper vs lower *case,* and *orthographic variation* (different spellings of the same word form).
# 
# *Variant letterforms* (different shapes of what is meant to represent the same basic letter), and *abbreviation*. Which of these types of variation count as substantive or non-substantive depends on the nature of the writing system and the types of roles the edition is intended to fulfill. One project may reasonably regard (and process) as substantive a type of variation that another project would regularize as non-substantive.
# 
# *Markup normalization* requires attention at both the tokenization and the normalization stage. The markup is part of the stream of characters, so when the text is divided into words or other tokens, the tokenizer must know what to do with any markup it encounters. For example, the characters “text” in an XML tag `<text>` should not be treated the same way as the word “text” in the content of an element. This means that either the markup must be treated as something other than textual content during tokenization or it must be handled specially during normalization.
# 
# The output of tokenization (Stage 1) serves as the input to normalization (Stage 2), which provides, in turn, the input for alignment (Stage 3). In CollateX, the output of normalization works by encoding each token as a complex object that contains both a textual (*t*) property and a normalized (*n*) property. The *t* property remains available throughout all subsequent stages of the Gothenburg model, and is therefore available for rendering in the eventual output visualization. The *n* property is used in the alignment stage to tell the collation engine which different *t* values should be treated as the same for alignment purposes. We will discuss how CollateX represents and uses these properties later in the workshop.
# 
# By default, the only normalization that CollateX performs is that it ignores trailing white space at the ends of tokens. This means that, for example, “Peter’s&nbsp;cat” and “Peter’s&nbsp;&nbsp;cat”, which differ (this may be difficult to see in the HTML rendering) only in that there are two spaces between the words in the second version, would be aligned as if they contained exactly the same tokens. Some projects are likely to need to perform more aggressive normalization, which can range from the algorithmic, such as case folding (treat upper- and lowercase letters as if they were the same) or lemmatization (align according to the base lexeme, ignoring inflection information), to the idiosyncratic, such as a synonym lookup dictionary that might recognize that “3” and “three” should be treated as equivalent for alignment purposes.
# 
# What counts as substantive variation (not to be normalized) and non-substantive variation (to be normalized) may differ not only among collation projects, but also among steps within a project. For example, for alignment purposes we might want to treat all inflected forms of the same lexeme as equivalent, but at a later analytic stage we might want to differentiate among inflected forms of the same lexeme. The normalization process described here refers to normalization for alignment purposes, and because CollateX retains information about the original (non-normalized) token in the *t* property, it remains available should it be needed at a later analytic stage. 
# 

# ### 3. Alignment
# The alignment stage of the Gothenburg model compares the normalized forms of the tokens for the witnesses and tries to create the best alignment possible. There are alignment ambiguities that are difficult even for a human to resolve unambiguously, such as aligning “Peter’s very blue cat” with “Peter’s very very blue cat”, where it is not clear which instance of “very” in the second witness is the appropriate alignment point for the single instance of “very” in the first. Where CollateX cannot make a principled decision, it makes an arbitrary one; in this case it would align “very” in the first witness arbitrarily with the first instances of “very” in the second.
# 
# To instantiate the alignment CollateX builds a *variant graph*, a structure that can be represented through the following visualization:
# 
# <img src="variant_graph.png"/>
# 
# The graph is *directed* from an empty *start* node (the square at the left numbered 5) to an empty *end* node (the square on the right numbered 3). The specific numbers in the start and end nodes are arbitrary, and are not important for understanding the graph. The graph also contains a lot of diagnostic information that is introduced by CollateX and that is useful for developers. The way to read the graph is as follows:
# 
# * There are three witnesses, which we number 1, 2, and 3. Their textual readings are as follows:
#     * Witness 1: a b c d
#     * Witness 2: a c d b
#     * Witness 3: b c d
# * The text of each witness is represented as a path through the graph. All paths start at the start node on the left and follow the numbered arrows until they reach the end node on the right. This means that you can, for example, trace the four tokens of Witness 1 by following the arrows numbered 1 from the start to the end. 
# * The squares that represent nodes other than start and end all contain tables that consist of two-column rows. The top row presents a normalized reading in the left column and the heading label “Sigla” in the right. This header row is followed by one row for each different actual reading that is grouped within the normalization, and in those rows the left cell provides the literal text of the reading and the right cell lists the witnesses in which it appears. In this example, the leftmost node after the start node has a normalized reading of “a” (left cell in the top row), which appears in witnesses 1 and 2, and the actual (non-normalized) reading that appears in those witnesses is also “a” (the reading is in the left cell of the second row and the witness numbers in the right cell of the same row). The node that represents the token “d” looks odd because it has two rows in which the literal reading looks identical in the web rendering, but they’re actually different. Both are normalized to “d”, but in Witness 2 the “d” has a trailing white space because it’s followed by another token (recall that with the default tokenizer, the white space that separates tokens is kept with a preceding token in the textual reading, but ignored in the derived normalized version), while in Witnesses 1 and 3 it has no trailing space because it’s at the end of the witness text, and there was no space character there. If the visualization showed white space as a visible character, we would see that the reading for Witnesses 1 and 3 is one character long (just “d”), while the one for Witness 2 is two characters long (“d” plus the trailing space character). What the collation engine sees, though, is that they have the same normalized version (just “d”), so it groups them together in the variant graph.
# 
# CollateX can output the variant graph, and we sometimes do that for development purposes, but it is not a traditional way of representing variation to philologists. The variant graph is nonetheless used as an output format, intended for human interaction, in some modern applications, such as the [Stemmaweb](https://stemmaweb.net/stemmaweb/) collection of tools for analysis of collated texts. We discuss visualization below, under Stage 5 of the Gothenburg model, but we introduce the variant graph here because of the insights it provides into Stage 3, the alignment module.
# 
# <img align="right" src="Collation_Aligner.png"/>So how does the alignment process build the variant graph? In the example to the right, imagine that each column represents a witness and each square with a letter in it (but not the ones with dashes) in the column represents a token. Witness 1 has four tokens (“a”, “b”, “c”, “d”), Witness 2 has the same four tokens, but in a different order (token “b” has been moved to the end), and Witness 3 has three tokens (“b”, “c”, “d”). At the alignment stage of the Gothenburg model, the alignment engine will recognize that all three witnesses have tokens “c” and “d” in order and connect them (represented by horizontal lines in the image here), which translates into grouping them into the same node in the variant graph above. It will also recognize that token “a” appears in only two witnesses, and it will align those and recognize that there is nothing in Witness 3 in the place where this token appears in the other two witnesses. The situation with token “b”, though, is complicated because it occurs in all three witnesses, but in different positions. At this stage the alignment process is not capable of changing the order of the tokens in the witnesses, so instead of recognizing the instance of “b” in Witness 2 as a counterpart of token “b” in the other two witnesses, it regards Witness 2 as having nothing corresponding to token “b” where it appears early in Witnesses 1 and 3, and then as having its own token “b” at the end that has no counterpart in the other two witnesses. For this reason, at this stage the alignment has not analyzed all of the variation that a human philologist would consider relevant to understanding the transmission of the text. In particular, this alignment (whether represented as the variant graph above or as the table to the right) treats the transposition of token “b” in Witness 2 as two mutually independent events, a deletion or omission and then an addition or insertion, while a human would regard those as part of a single transposition event.
# 
# Considerations that arise during the alignment stage of the Gothenburg model include the following:
# 
# * **Repetition.** As we saw with our example of “Peter’s very blue cat” and “Peter’s very very blue cat”, the alignment engine may have no principled way to decide how to align a token in one witness when there is more than one corresponding token in another. This affects not only immediate repetition, as in this example, but also repetition at a distance. For example, in an English-language text in which the word “the” is frequent, an alignment engine needs to decide which instances of this word in one witness to align with which instances in another. If the witnesses are very similar, that task may be easy, but if one witness has several more instances of “the” than the other, determining which have correspondences and which don’t becomes harder.
# * **Transposition.** In the example above, the alignment engine failed to recognize that “b” had been transposed, and not independently deleted in one place and then inserted in another. Recognizing transposition is not too difficult in this minimal example, but in a long text where, say “the” is missing in one place in one witness but present in another, it can be difficult to determine whether it has been transposed, or whether there really are independent deletions and insertions. And where “the” is missing in several places but present in an additional one that we suspect represents transposition, it may be difficult to determine which instance of the missing “the” is the source of the transposition. We use the word “the” here because it’s so common, but repetition and transposition can affect any word, whether common or rare or in between.
# * **Order effects.** Some alignment algorithms align multiple witnesses by first aligning two of them, then aligning a third against the output of aligning the first two, etc., adding one new witness each time. This is called *pairwise alignment*, and it breaks down a difficult problem (aligning multiple witnesses simultaneously) into a sequence of simpler ones (aligning just two things at a time). The cost of that simplification, though, is that we can get different output depending on the order in which we add witnesses to the alignment. These order effects are not informational: the optimal alignment of multiple witnesses should be independent of the order in which we happen to look at them. Some alignment algorithms seek to avoid order effects by aligning all witnesses at once, while others add one at a time but try to identify the optimal order for including them.
# * **Depth vs breadth.** A set of several witnesses will typically have areas where they all agree, areas where many of them agree (with different dissenters), and areas with other patterns of agreement. We might heuristically approach collation by assuming that the longest common sequence of tokens where all witness agree is probably an optimal moment of alignment, and we can apply the same heuristic to successively shorter sequences with almost all witnesses in agreement, but at some point we’ll run out of strong agreement points and have to decide whether to favor longer sequences of tokens with fewer witnesses in agreement (breadth) or shorter sequences with more witnesses in agreement (depth). It is not always clear how to make optimal decisions as the remaining sequences grow shorter and shallower.
# * **Computational complexity.** The alignment of variants is a computationally complex problem. A naive approach to collation would be to generate all possible alignments of all tokens in all witnesses and evaluate them to identify the best one (according to some definition of “best”), but the computational complexity of this sort of brute-force approach is so great that it could not be completed for any but the very smallest (that is, unnaturally and unrealistically small) collation sets. For that reason, all computational alignment algorithms must find ways to reduce the number of operations to fewer than what would be required by this brute-force approach.
# * **Exact vs near (fuzzy) matching.** Computers can identify exact matching quickly and efficiently, so we can tell that “The gray koala” and “The grey koala” match in their first and last word tokens and differ in their middle ones. What’s harder is to recognize that the middle tokens almost match, so if we have ten witnesses with ten different colors and we want to find the closest match, we need to calculate how close each one is, and that’s a more expensive operation than determining whether there is or is not an exact match. In this case we could tokenize on individual characters, rather than whitespace-delimited words, but in a real collation problem that would introduce other complications because the problems of repetition and transposition are much greater at the character level than the word level. By default CollateX recognizes only exact matching, but it has a near matching component that can be activated explicitly during the collation process. This component is engineered to minimize the complexity by staying out of the way most of the time, and operating only in situations where there are ambiguities in alignment that could potentially be resolved through near matching.

# ### 4. Analysis<img align="right" src="Collation_Analyzer.png"/>
# 
# Stage 3 of the Gothenburg model, *alignment*, operates by aligning tokens that have the same normalized *shadow copies* (the *n* properties discussed in Stage 2, *normalization/regularization*). The alignment process relies on exact matching because it is computationally inexpensive (that is, fast) to determine whether two tokens are or are not identical (*string equal*), but because *alignment* looks only at exact matches, situations with no exact match often cannot be resolved at that stage. For example, the strings “The gray koala” and “The big grey koala” (note the different spellings of “gray” ~ grey”) match in their first and last tokens, but the middle token “gray” in the first witness is not an exact match to either “big” or “grey” in the second. Because the alignment stage looks only at exact string matching, it cannot make this decision; what CollateX does in such situations (unless the user has turned on near matching) is arbitrarily look to the left, so it would align “gray” with “big”, rather than with “grey”, which a human would recognize easily as incorrect. If the user turns on near matching, though, CollateX will recognize that there is no exact match, and instead of defaulting automatically to the left candidate, it will look for the closest match and rearrange the alignment if necessary.
# 
# The *analysis* stage refers in this case to the *computational* analysis of the output of the *alignment* operation, with the goal of fine-tuning it to make decisions that cannot be made on the basis of exact string matching. That includes near matching, that is, finding the closest match in situations that lack an exact match, such as the one described above. It would too computationally expensive to run a near match calculation on every group of tokens, but reserving it for the post-alignment *analysis* means that it will be run only in situations where there is no exact alignment. The identification of transposition also happens in the *analysis* stage, because the *alignment* stage is not capable of rearranging the tokens in a witness, as in the example to the right.
# 
# As said, the Gothenburg model was mostly geared towards separation of concerns. It defines collation as a set of steps that could reasonably be seen as isolated tasks, making solving and implementing solutions easier than if those steps are intricately intertwined and connected. The Gothenburg model does not prescribe how these individual tasks should be solved. This decoupling also ensures that if somebody for one of the steps implements a better or quicker solution, or a solution that is very specific and adequate for some particular project, then that step can be easily replaced in workflows with the new implementation/solution. A perceptive reader will note however that the separation of concern, at least between the alignment and analysis steps, may not be perfect. In some cases collation workflows create a feedback loop between the stages of analysis and alignment. Some textual genesis phenomena (especially for instance transposition) are very hard to solve for a computer, and may indeed not be decisively computable at all. In such cases human intervention and judgement is needed to argue what happened during the genesis of the text and witnesses. In such cases this additional analysis information may yield valuable information on the basis of which the collating algorithm can refine its alignment. In such cases post processing information resulting from the analysis phase can be used to inform the rerun of alignment stage of the process. 
# 

# Transposition is the phenomenon where an author or copyist has moved a piece of text from one place in a witness to a different place in another witness. From a theoretical perspective one can regard a transposition as a deletion of certain tokens in one witness combined with an addition of the exact same tokens in another. Detecting a transposition from a computational perspective is a very hard problem. A computer can never determine with certainty an author’s intention. So instead of a fixed rule that always holds true, we use an heuristic: a measurement that takes a number of variables into account, like the number of tokens between the addition and the deletion, the length of the transposed piece of text and the frequency of the tokens involved in a transposition in the text. An example of a good heuristic is one that balances the length of the piece of text that is a candidate for a transposition against the moved distance. This means that a longer piece is text is allowed to move over a greater distance and still counts as a transposition, while a short piece of text is allowed to move only over a small distance. Another thing that complicates the detection of transposition is that tokens can be repeated in a text. If a piece of text is deleted twice in one witness and added once in another witness, transposition detection gets even harder, since the question then involves determining which of the deletions is part of the transposition. Usually the piece that is closest to its counterpart is the correct one.
# 
# ### 5. Visualization/output
# 
# The variant graph can be output as *SVG* (scalable vector graphics), which we’ve used for illustration above, but this should be understood as a *visualization* or *serialization* of the graph, and not as the graph itself. In CollateX the graph can be visualized directly as SVG or in GraphML, an XML vocabulary for representing graph information. The SVG is intended for visualization and the GraphML for subsequent processing. Neither of these is the sort of visualization that will be familiar to philologists, but because the graph is the most direct representation of the results of the alignment process, it may nonetheless provide insights into the structure of the collation that are more difficult to perceive with less immediate visualizations. The graph in principle is capable of encoding transpositions (and, for example, a visualization of the graph could draw a special kind of connecting line between the two nodes with the reading “b”), but at present CollateX does not record transposition information in the graph, which means that it cannot be represented in a visualization.
# 
# CollateX also supports the output of an *alignment table*, similar to the images above, where empty cells (represented in those illustrations with hyphens) are inserted to align the tokens visually in either rows or columns. CollateX can generate visualizations of the alignment table as plain text or as HTML, and the HTML visualization can present the witnesses in rows and the tokens in columns or vice versa, as follows:

# In[1]:


from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The quick brown fox jumps over the dog.")
collation.add_plain_witness("B", "The brown fox jumps over the lazy dog.")
alignment_table = collate(collation, output="html")


# In[2]:


alignment_table = collate(collation, output="html", layout="vertical")


# CollateX also supports output as JSON (intended for subsequent post-processing), generic XML (intended for subsequent postprocessing with XSLT), and TEI XML parallel segmentation notation. 

# <hr/>
# ###### **Acknowledgements**
# 
# <span style="font-size: 80%;">Some images and information in this tutorial were derived from Haentjens Dekker, R. et al., 2015. Computer supported collation of modern manuscripts: CollateX and the Beckett Digital Manuscript Project. *Literary and Linguistic Computing*, 30(3), pp.452–470; and from http://wiki.tei-c.org/index.php/Textual_Variance. Some additional textual content on this page was generously contributed by Helena Bermúdez Sabel.</span>