# Collate outside the notebook¶

## Python files, input files, output file¶

• Set up a PyCharm project
• Create a Python file
• Run a script
• In PyCharm
• In the terminal
• Input files
• Output file
• Exercise

Here it is another way to run the scripts you produced in the previous tutorials (note: even if technically they mean different things, we will use interchangeably the words code, script and program). This tutorial assumes that you went already through tutorials on Collate plain texts (1 and 2) and on the different Collation ouputs. Everything that we will do here, is possible also in Jupyter notebook and certain section, as Input files is a recap of something already seen in the previous tutorials.

In the Command line tutorial, we have briefly seen how to run a Python program. In the terminal, type

python myfile.py



replacing “myfile.py” with the name of your Python program.

### Again on file system hygiene: directory 'Scripts'¶

In this tutorial, we will create Python programs. Where to save the files that you will create? Remember that we created a directory for this workshop, called 'Workshop'. Now let's create a sub-directory, called 'Scripts', to store all our Python programs.

## Set up a PyCharm project¶

If you are using PyCharm for these exercises it is worth setting up a project that will automatically save the files you create to the 'Scripts' directory you just created (see above). To do this open PyCharm and from the File menu select New Project. In the dialogue box that appears navigate to the 'scripts' directory you made for this workshop by clicking the button with '...' on it, on the right of the location box. Then click create. This will create a new project that will save all of the files to the folder you have selected.

## Create a Python file¶

Let's do this step by step. First of all, create a python file.

• Open PyCharm, if you downloaded it before, or another text editor: Notepad++ for Windows or TextWrangler for Mac OS X.
• Create a new file and copy paste the code we used before:
In [1]:
from collatex import *
collation = Collation()
collation.add_plain_witness( "A", "The quick brown fox jumped over the lazy dog.")
collation.add_plain_witness( "B", "The brown fox jumped over the dog." )
table = collate(collation)
print(table)

+---+-----+-------+-------+---------------------+------+------+
| A | The | quick | brown | fox jumped over the | lazy | dog. |
| B | The | -     | brown | fox jumped over the | -    | dog. |
| C | The | bad   | -     | fox jumped over the | lazy | dog. |
+---+-----+-------+-------+---------------------+------+------+

• Now save the file, as 'collate.py', inside the directory 'Scripts' (see above). If you setup a project in PyCharm then the files should automatically be saved in the correct place.

## Run the script¶

### In PyCharm¶

• In Pycharm you can run the script using the button, or run from the menu.
• The result will appear in a window at the bottom of the page.

### In the terminal¶

• Open the terminal and navigate to the folder where your script is, using the 'cd' command (again, refer to the Command line tutorial, if you don't know what this means). Then type

  python collate.py



If you are not in the directory where your script is, you should specify the path for that file. If you are in the Home directory, for example, the command would look like

  python Workshop/Scripts/collate.py
• The result will appear below in the terminal.

## Input files¶

In the first tutorial, we saw how to use texts stored in files as witnesses for the collation. We used the open command to open each text file and appoint the contents to a variable with an appropriately chosen name; and we don't forget the encoding="utf-8" bit!

Let's try to do the same in our script 'collate.py', using the data in fixtures/Darwin/txt (only the first paragraph: _par1) and producing an output in XML/TEI. The code will look like this:

In [2]:
from collatex import *
collation = Collation()
witness_1859 = open( "../fixtures/Darwin/txt/darwin1859_par1.txt", encoding='utf-8' ).read()
witness_1860 = open( "../fixtures/Darwin/txt/darwin1860_par1.txt", encoding='utf-8' ).read()
witness_1861 = open( "../fixtures/Darwin/txt/darwin1861_par1.txt", encoding='utf-8' ).read()
witness_1866 = open( "../fixtures/Darwin/txt/darwin1866_par1.txt", encoding='utf-8' ).read()
witness_1869 = open( "../fixtures/Darwin/txt/darwin1869_par1.txt", encoding='utf-8' ).read()
witness_1872 = open( "../fixtures/Darwin/txt/darwin1872_par1.txt", encoding='utf-8' ).read()
table = collate(collation, output='tei')
print(table)

<p><app><rdg wit="#1866 #1869 #1872">Causes of Variability. </rdg></app>WHEN we <app><rdg wit="#1859 #1860 #1861 #1866">look to </rdg><rdg wit="#1869 #1872">compare </rdg></app>the individuals of the same variety or sub-variety of our older cultivated plants and animals, one of the first points which strikes us<app><rdg wit="#1859 #1860 #1861 #1866">, </rdg></app>is, that they generally differ <app><rdg wit="#1859">much </rdg></app><app><rdg wit="#1859 #1860 #1861 #1866 #1872">more </rdg></app>from each other<app><rdg wit="#1859">, </rdg><rdg wit="#1869">more </rdg></app>than do the individuals of any one species or variety in a state of nature. <app><rdg wit="#1859 #1860 #1861 #1866">When </rdg><rdg wit="#1869 #1872">And if </rdg></app>we reflect on the vast diversity of the plants and animals which have been cultivated, and which have varied during all ages under the most different climates and treatment, <app><rdg wit="#1859 #1860 #1861 #1866">I think </rdg></app>we are driven to conclude that this <app><rdg wit="#1859">greater </rdg><rdg wit="#1860 #1861 #1866 #1869 #1872">great </rdg></app>variability is <app><rdg wit="#1859 #1860 #1861 #1866">simply </rdg></app>due to our domestic productions having been raised under conditions of life not so uniform as, and somewhat different from, those to which the parent-species <app><rdg wit="#1859 #1860 #1861 #1866">have </rdg><rdg wit="#1869 #1872">had </rdg></app>been exposed under nature. There is<app><rdg wit="#1859 #1872">, </rdg></app>also<app><rdg wit="#1859 #1860 #1861 #1866 #1869">, I think</rdg></app>, some probability in the view propounded by Andrew Knight, that this variability may be partly connected with excess of food. It seems <app><rdg wit="#1859 #1860 #1861 #1866">pretty </rdg></app>clear that organic beings must be exposed during several generations to <app><rdg wit="#1859 #1860 #1861 #1866">the </rdg></app>new <app><rdg wit="#1859 #1860 #1861 #1866">conditions of life </rdg><rdg wit="#1869 #1872">conditions </rdg></app>to cause any <app><rdg wit="#1859 #1860 #1861 #1866 #1869">appreciable </rdg><rdg wit="#1872">great </rdg></app>amount of variation; and that <app><rdg wit="#1866 #1869 #1872">, </rdg></app>when the organisation has once begun to vary, it generally <app><rdg wit="#1859 #1860 #1861 #1866 #1872">continues </rdg><rdg wit="#1869">con- tinues </rdg></app><app><rdg wit="#1859 #1860 #1861 #1866">to vary </rdg><rdg wit="#1869 #1872">varying </rdg></app>for many generations. No case is on record of a variable <app><rdg wit="#1859 #1860 #1861 #1866">being </rdg><rdg wit="#1869 #1872">organism </rdg></app>ceasing <app><rdg wit="#1859 #1860 #1861 #1866">to be variable </rdg><rdg wit="#1869 #1872">to vary </rdg></app>under cultivation. Our oldest cultivated plants, such as wheat, still <app><rdg wit="#1859 #1860 #1861 #1866">often </rdg></app>yield new varieties: our oldest domesticated animals are still capable of rapid improvement or modification.
</p>

• Now save the file (just 'save', or 'save as' with another name, as 'collate-darwin-tei.py', if you want to keep both scripts) and then
• run the new script (run in PyCharm; or type python collate.py or python collate-darwin-tei.py in the terminal). This may take a bit longer than the fox and dog example.
• The result will appear below.

## Output file¶

Looking at the result this way is not very practical, especially if we want to save it. Better store the result in a new file, that we call 'outfile' (but you can give it another name if you prefer). We need to add this chunk of code, in order to create and open 'outfile':

In [3]:
outfile = open('outfile.txt', 'w', encoding='utf-8')


If we are going to produce an output in XML/TEI, we can specify that 'outfile' will be a XML file, and the same goes for any other format. Here below there are two examples, the first for a XML output file, the second for a JSON output file:

In [4]:
outfile = open('outfile.xml', 'w', encoding='utf-8')
outfile = open('outfile.json', 'w', encoding='utf-8')


Now we add the outfile chunk to our code above. The new script is:

In [5]:
from collatex import *
collation = Collation()
witness_1859 = open( "../fixtures/Darwin/txt/darwin1859_par1.txt", encoding='utf-8' ).read()
witness_1860 = open( "../fixtures/Darwin/txt/darwin1860_par1.txt", encoding='utf-8' ).read()
witness_1861 = open( "../fixtures/Darwin/txt/darwin1861_par1.txt", encoding='utf-8' ).read()
witness_1866 = open( "../fixtures/Darwin/txt/darwin1866_par1.txt", encoding='utf-8' ).read()
witness_1869 = open( "../fixtures/Darwin/txt/darwin1869_par1.txt", encoding='utf-8' ).read()
witness_1872 = open( "../fixtures/Darwin/txt/darwin1872_par1.txt", encoding='utf-8' ).read()
outfile = open('outfile-tei.xml', 'w', encoding='utf-8')
table = collate(collation, output='tei')
print(table, file=outfile)


When we run the script, the result won't appear below anymore. But a new file, 'outfile-tei.xml' has been created in the directory 'Scripts'. Check what's inside!

If you want to change the location of the output file, you can specify a different path. If, for example, you want your output file in the Desktop, you would write

In [6]:
outfile = open('C:/Users/Elena/Desktop/output.xml', 'w', encoding='utf-8')


N.b.: you can create an output file also running your script in the Jupyter notebook! Depending on the path you specify, it will be created in your 'Notebook' directory or elsewhere.

## Exercise¶

Create a new Python script that produces an output in JSON, using the data in 'fixtures/Woolf/Lighthouse-1' (remember? We use the same data in another tutorial). Pay attention to indicate correctly the input files, the output file (and its extension) and the output format.