Deep drama

Here, we write 21. century classical Greek drama based on the works of Sophocles, Euripides, Aristophanes and Aischylos.

For this we regress a single character on a vector of previous characters using a sequence model (a long short-term memory model). The trained network is then used to predict a new sequence of characters, i.e. to hopefully write classical Greek drama for us.

Enjoy!

In [1]:
library(keras)
library(readr)
library(stringr)
library(purrr)
library(tokenizers)
library(gutenbergr)
suppressMessages(library(dplyr))

Get the drama from project Gutenberg.

In [2]:
gutenberg_works(str_detect(author, "Sophocles|Euripides|Aeschylus|Aristophanes")) %>% .[1:20,]
gutenberg_idtitleauthorgutenberg_author_idlanguagegutenberg_bookshelfrightshas_text
31 Plays of Sophocles: Oedipus the King; Oedipus at Colonus; Antigone Sophocles 26 en Harvard Classics/Best Books Ever Listings/Plays Public domain in the USA. TRUE
2562 The Clouds Aristophanes 965 en Classical Antiquity/Banned Books from Anne Haight's list Public domain in the USA. TRUE
2571 Peace Aristophanes 965 en Classical Antiquity/One Act Plays Public domain in the USA. TRUE
3012 The Acharnians Aristophanes 965 en Classical Antiquity Public domain in the USA. TRUE
3013 The Birds Aristophanes 965 en One Act Plays/Banned Books from Anne Haight's list/Classical Antiquity Public domain in the USA. TRUE
5063 The Iphigenia in Tauris of Euripides Euripides 1680 en Opera/Classical Antiquity Public domain in the USA. TRUE
7700 Lysistrata Aristophanes 965 en Classical Antiquity/Banned Books from Anne Haight's list Public domain in the USA. TRUE
7998 The Frogs Aristophanes 965 en Classical Antiquity/Harvard Classics Public domain in the USA. TRUE
8418 Hippolytus; The Bacchae Euripides 1680 en Classical Antiquity/Harvard Classics Public domain in the USA. TRUE
8604 The House of Atreus; Being the Agamemnon, the Libation bearers, and the Furies Aeschylus 2825 en Harvard Classics/Classical Antiquity Public domain in the USA. TRUE
8688 The Eleven Comedies, Volume 1 Aristophanes 965 en Classical Antiquity Public domain in the USA. TRUE
8689 The Eleven Comedies, Volume 2 Aristophanes 965 en Classical Antiquity Public domain in the USA. TRUE
8714 Four Plays of Aeschylus Aeschylus 2825 en Classical Antiquity Public domain in the USA. TRUE
10096 The Trojan women of Euripides Euripides 1680 en Classical Antiquity Public domain in the USA. TRUE
10523 Alcestis Euripides 1680 en Classical Antiquity/Opera Public domain in the USA. TRUE
14322 The Electra of Euripides Translated into English rhyming verse Euripides 1680 en Classical Antiquity Public domain in the USA. TRUE
14417 The Agamemnon of Aeschylus Translated into English Rhyming Verse with Explanatory NotesAeschylus 2825 en Classical Antiquity/Harvard Classics Public domain in the USA. TRUE
14484 The Seven Plays in English Verse Sophocles 26 en NA Public domain in the USA. TRUE
15081 The Tragedies of Euripides, Volume I. Euripides 1680 en Best Books Ever Listings Public domain in the USA. TRUE
27458 Aeschylus' Prometheus Bound and the Seven Against Thebes Aeschylus 2825 en NA Public domain in the USA. TRUE
In [3]:
books <- gutenberg_works(str_detect(author, "Sophocles|Euripides|Aeschylus")) %>% 
    pull(gutenberg_id) %>%
    gutenberg_download %>%
    pull(text)
Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Using mirror http://aleph.gutenberg.org

Do some preprocessing.

In [8]:
text <- books %>%
  str_to_lower() %>%
  str_c(collapse = "\n") %>%
  tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

chars <- text %>%
  unique() %>%
  sort()

print(length(text))
[1] 3477455

Feature dimensionality.

In [9]:
maxlen <- 40

Parse the data to a list first. This is empirically just faster.

In [10]:
dataset <- map(
  seq(1, length(text) - maxlen - 1, by = 3), 
  ~list(sent = text[.x:(.x + maxlen - 1)], 
        nextc = text[.x + maxlen])) %>%
    transpose

Compute feature and response matrices.

In [11]:
x <- array(0, dim = c(length(dataset$sent), maxlen, length(chars)))
y <- array(0, dim = c(length(dataset$sent), length(chars)))
for(i in seq(length(dataset$sent)))
{
    x[i,,] <- sapply(chars, function(x) as.integer(x == dataset$sent[[i]]))
    y[i,]  <- as.integer(chars == dataset$nextc[[i]])
}

Setup the model, compile it and train it. This might take a while (i.e. 30 minutes or so depending whether you use AVX or not).

In [12]:
file.m0 <-"deep_drama-full_data.h5"
In [24]:
if (file.exists(file.m0))
{
    model <- keras_model_sequential()
    model %>%
        layer_lstm(128, input_shape = c(maxlen, length(chars))) %>%
        layer_activation("relu") %>%
        layer_dense(length(chars)) %>%
        layer_activation("softmax") %>%
        compile(loss = "categorical_crossentropy", 
                optimizer = optimizer_adam(lr = 0.01)) %>%
        fit(x, y, batch_size = 128, epochs = 10,
           callbacks =  callback_lambda(on_epoch_end = function(epoch, logs) {
               cat(sprintf("Epoch: %02d ... I haven't died yet.\n", epoch))
           }))
    
    save_model_hdf5(model, file.m0)
}

model <- load_model_hdf5(file.m0)
Epoch: 00 ... I haven't died yet.
Epoch: 01 ... I haven't died yet.
Epoch: 02 ... I haven't died yet.
Epoch: 03 ... I haven't died yet.
Epoch: 04 ... I haven't died yet.
Epoch: 05 ... I haven't died yet.
Epoch: 06 ... I haven't died yet.
Epoch: 07 ... I haven't died yet.
Epoch: 08 ... I haven't died yet.
Epoch: 09 ... I haven't died yet.

Having trained the model, we randomly select a position in the text and get a string from it. Then we make a prediction for 400 characters. Here is the masterpiece:

In [26]:
idx <- sample(seq(length(text) - maxlen - 1), 1)
string <- text[idx:(idx + maxlen - 1)]
generated <- ""
for(i in seq(400))
{
    # Convert the string to a feature matrix
    z <- sapply(chars, function(e) { as.integer(e == string) })
    z <- array_reshape(z, c(1, dim(z)))
    ni <- predict(model, z) %>% rmultinom(1, 1, prob = .) %>% which.max()
    # save the predicted character
    generated <- str_c(generated, chars[ni], collapse = "")
    # append the predicted char to the string and remove its first char
    string  <- c(string[-1], chars[ni])
}    

cat(str_split(generated, '\n')[[1]], sep="\n")
wretched siien colimily, that for them
iolate
  friends how they
will are;
    us to man.

                    poled me,
a brooted contapes, and the dear dear than the muster his usans
indeed
chatal stand, and tire the might that pombanted all only was there; no breath, sanned sophaps. and yet, day
marrecchypetager of odemess stall flocklow-children fatates, ontae, the light, theesf to goreit]
who

Wow. This is beautiful! Does not look like Greek drama, but like Finnegan's wake though.