Wikipedia Text Generation (using RNN LSTM)

Experiment overview

In this experiment we will use character-based Recurrent Neural Network (RNN) to generate a Wikipedia-like text based on the wikipedia TensorFlow dataset.

text_generation_wikipedia_rnn.png

Import dependencies

In [1]:
# Selecting Tensorflow version v2 (the command is relevant for Colab only).
# %tensorflow_version 2.x
In [2]:
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import numpy as np
import platform
import time
import pathlib
import os

print('Python version:', platform.python_version())
print('Tensorflow version:', tf.__version__)
print('Keras version:', tf.keras.__version__)
Python version: 3.7.6
Tensorflow version: 2.1.0
Keras version: 2.2.4-tf

Download the dataset

Wikipedia dataset contains cleaned articles of all languages. The datasets are built from the Wikipedia dump with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

In [3]:
# List all available datasets to see how the wikipedia dataset is called.
tfds.list_builders()
Out[3]:
['abstract_reasoning',
 'aeslc',
 'aflw2k3d',
 'amazon_us_reviews',
 'arc',
 'bair_robot_pushing_small',
 'big_patent',
 'bigearthnet',
 'billsum',
 'binarized_mnist',
 'binary_alpha_digits',
 'c4',
 'caltech101',
 'caltech_birds2010',
 'caltech_birds2011',
 'cars196',
 'cassava',
 'cats_vs_dogs',
 'celeb_a',
 'celeb_a_hq',
 'chexpert',
 'cifar10',
 'cifar100',
 'cifar10_1',
 'cifar10_corrupted',
 'citrus_leaves',
 'cityscapes',
 'civil_comments',
 'clevr',
 'cmaterdb',
 'cnn_dailymail',
 'coco',
 'coil100',
 'colorectal_histology',
 'colorectal_histology_large',
 'cos_e',
 'curated_breast_imaging_ddsm',
 'cycle_gan',
 'deep_weeds',
 'definite_pronoun_resolution',
 'diabetic_retinopathy_detection',
 'dmlab',
 'downsampled_imagenet',
 'dsprites',
 'dtd',
 'duke_ultrasound',
 'dummy_dataset_shared_generator',
 'dummy_mnist',
 'emnist',
 'esnli',
 'eurosat',
 'fashion_mnist',
 'flic',
 'flores',
 'food101',
 'gap',
 'gigaword',
 'glue',
 'groove',
 'higgs',
 'horses_or_humans',
 'i_naturalist2017',
 'image_label_folder',
 'imagenet2012',
 'imagenet2012_corrupted',
 'imagenet_resized',
 'imagenette',
 'imdb_reviews',
 'iris',
 'kitti',
 'kmnist',
 'lfw',
 'lm1b',
 'lost_and_found',
 'lsun',
 'malaria',
 'math_dataset',
 'mnist',
 'mnist_corrupted',
 'movie_rationales',
 'moving_mnist',
 'multi_news',
 'multi_nli',
 'multi_nli_mismatch',
 'newsroom',
 'nsynth',
 'omniglot',
 'open_images_v4',
 'oxford_flowers102',
 'oxford_iiit_pet',
 'para_crawl',
 'patch_camelyon',
 'pet_finder',
 'places365_small',
 'plant_leaves',
 'plant_village',
 'plantae_k',
 'quickdraw_bitmap',
 'reddit_tifu',
 'resisc45',
 'rock_paper_scissors',
 'rock_you',
 'scan',
 'scene_parse150',
 'scicite',
 'scientific_papers',
 'shapes3d',
 'smallnorb',
 'snli',
 'so2sat',
 'squad',
 'stanford_dogs',
 'stanford_online_products',
 'starcraft_video',
 'sun397',
 'super_glue',
 'svhn_cropped',
 'ted_hrlr_translate',
 'ted_multi_translate',
 'tf_flowers',
 'the300w_lp',
 'titanic',
 'trivia_qa',
 'uc_merced',
 'ucf101',
 'vgg_face2',
 'visual_domain_decathlon',
 'voc',
 'wider_face',
 'wikihow',
 'wikipedia',
 'wmt14_translate',
 'wmt15_translate',
 'wmt16_translate',
 'wmt17_translate',
 'wmt18_translate',
 'wmt19_translate',
 'wmt_t2t_translate',
 'wmt_translate',
 'xnli',
 'xsum']

tfds.load is a convenience method that's the simplest way to build and load a tf.data.Dataset.

In [4]:
# Loading the wikipedia dataset.
DATASET_NAME = 'wikipedia/20190301.en'
# DATASET_NAME = 'wikipedia/20190301.uk'

dataset, dataset_info = tfds.load(
    name=DATASET_NAME,
    data_dir='tmp',
    with_info=True,
    split=tfds.Split.TRAIN,
)
In [5]:
print(dataset_info)
tfds.core.DatasetInfo(
    name='wikipedia',
    version=1.0.0,
    description='Wikipedia dataset containing cleaned articles of all languages.
The datasets are built from the Wikipedia dump
(https://dumps.wikimedia.org/) with one split per language. Each example
contains the content of one full Wikipedia article with cleaning to strip
markdown and unwanted sections (references, etc.).
',
    homepage='https://dumps.wikimedia.org',
    features=FeaturesDict({
        'text': Text(shape=(), dtype=tf.string),
        'title': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=5824596,
    splits={
        'train': 5824596,
    },
    supervised_keys=None,
    citation="""@ONLINE {wikidump,
        author = "Wikimedia Foundation",
        title  = "Wikimedia Downloads",
        url    = "https://dumps.wikimedia.org"
    }""",
    redistribution_info=license: "This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.",
)

In [6]:
print(dataset)
<DatasetV1Adapter shapes: {text: (), title: ()}, types: {text: tf.string, title: tf.string}>

Analyze the dataset

In [7]:
TRAIN_NUM_EXAMPLES = dataset_info.splits['train'].num_examples
print('Total number of articles: ', TRAIN_NUM_EXAMPLES)
Total number of articles:  5824596
In [8]:
print('First article','\n======\n')
for example in dataset.take(1):
    print('Title:','\n------')
    print(example['title'].numpy().decode('utf-8'))
    print()

    print('Text:', '\n------')
    print(example['text'].numpy().decode('utf-8'))
First article 
======

Title: 
------
Joseph Greenberg

Text: 
------
Joseph Harold Greenberg (May 28, 1915 – May 7, 2001) was an American linguist, known mainly for his work concerning linguistic typology and the genetic classification of languages.

Life

Early life and education 
(Main source: Croft 2003)

Joseph Greenberg was born on May 28, 1915 to Jewish parents in Brooklyn, New York. His first great interest was music. At the age of 14, he gave a piano concert in Steinway Hall. He continued to play the piano frequently throughout his life.

After finishing high school, he decided to pursue a scholarly career rather than a musical one. He enrolled at Columbia University in New York. During his senior year, he attended a class taught by Franz Boas concerning American Indian languages. With references from Boas and Ruth Benedict, he was accepted as a graduate student by Melville J. Herskovits at Northwestern University in Chicago. During the course of his graduate studies, Greenberg did fieldwork among the Hausa people of Nigeria, where he learned the Hausa language. The subject of his doctoral dissertation was the influence of Islam on a Hausa group that, unlike most others, had not converted to it.

During 1940, he began postdoctoral studies at Yale University. These were interrupted by service in the U.S. Army Signal Corps during World War II, for which he worked as a codebreaker and participated with the landing at Casablanca. Before leaving for Europe during 1943, Greenberg married Selma Berkowitz, whom he had met during his first year at Columbia University.

Career
After the war, Greenberg taught at the University of Minnesota before returning to Columbia University during 1948 as a teacher of anthropology. While in New York, he became acquainted with Roman Jakobson and André Martinet. They introduced him to the Prague school of structuralism, which influenced his work.

During 1962, Greenberg relocated to the anthropology department of Stanford University in California, where he continued to work for the rest of his life. During 1965 Greenberg served as president of the African Studies Association. He received during 1996 the highest award for a scholar in Linguistics, the Gold Medal of Philology (http://insop.org/index.php?p=1_8_Ancient-Medal-Winners.).

Contributions to linguistics

Linguistic typology 

Greenberg's reputation rests partly on his contributions to synchronic linguistics and the quest to identify linguistic universals. During the late 1950s, Greenberg began to examine languages covering a wide geographic and genetic distribution. He located a number of interesting potential universals as well as many strong cross-linguistic tendencies.

In particular, Greenberg conceptualized the idea of "implicational universal", which has the form, "if a language has structure X, then it must also have structure Y." For example, X might be "mid front rounded vowels" and Y "high front rounded vowels" (for terminology see phonetics). Many scholars adopted this kind of research following Greenberg's example and it remains important in synchronic linguistics.

Like Noam Chomsky, Greenberg sought to discover the universal structures on which human language is based. Unlike Chomsky, Greenberg's method was functionalist, rather than formalist. An argument to reconcile the Greenbergian and Chomskyan methods can be found in Linguistic Universals (2006), edited by Ricardo Mairal and Juana Gil .

Many who are strongly opposed to Greenberg's methods of language classification (see below) acknowledge the importance of his typological work. During 1963 he published an article that was extremely influential: "Some universals of grammar with particular reference to the order of meaningful elements".

Mass comparison 

Greenberg rejected the opinion, prevalent among linguists since the mid-20th century, that comparative reconstruction was the only method to discover relationships between languages. He argued that genetic classification is methodologically prior to comparative reconstruction, or the first stage of it: you cannot engage in the comparative reconstruction of languages until you know which languages to compare (1957:44).

He also criticized the prevalent opinion that comprehensive comparisons of two languages at a time (which commonly take years to perform) could establish language families of any size. He argued that, even for 8 languages, there are already 4,140 ways to classify them into distinct families, while for 25 languages there are 4,749,027,089,305,918,018 ways (1957:44). For comparison, the Niger–Congo family is said to have some 1,500 languages. He thought language families of any size needed to be established by some scholastic means other than bilateral comparison. The theory of mass comparison is an attempt to demonstrate such means.

Greenberg argued for the virtues of breadth over depth. He advocated restricting the amount of material to be compared (to basic vocabulary, morphology, and known paths of sound change) and increasing the number of languages to be compared to all the languages in a given area. This would make it possible to compare numerous languages reliably. At the same time, the process would provide a check on accidental resemblances through the sheer number of languages under review. The mathematical probability that resemblances are accidental decreases strongly with the number of languages concerned (1957:39).

Greenberg used the premise that mass "borrowing" of basic vocabulary is unknown. He argued that borrowing, when it occurs, is concentrated in cultural vocabulary and clusters "in certain semantic areas", making it easy to detect (1957:39). With the goal of determining broad patterns of relationship, the idea was not to get every word right but to detect patterns. From the beginning with his theory of mass comparison, Greenberg addressed why chance resemblance and borrowing were not obstacles to its being useful. Despite that, critics consider those phenomena caused difficulties for his theory.

Greenberg first termed his method "mass comparison" in an article of 1954 (reprinted in Greenberg 1955). As of 1987, he replaced the term "mass comparison" with "multilateral comparison", to emphasize its contrast with the bilateral comparisons recommended by linguistics textbooks. He believed that multilateral comparison was not in any way opposed to the comparative method, but is, on the contrary, its necessary first step (Greenberg, 1957:44). According to him, comparative reconstruction should have the status of an explanatory theory for facts already established by language classification (Greenberg, 1957:45).

Most historical linguists (Campbell 2001:45) reject the use of mass comparison as a method for establishing genealogical relationships between languages. Among the most outspoken critics of mass comparison have been Lyle Campbell, Donald Ringe, William Poser, and the late R. Larry Trask.

Genetic classification of languages

The languages of Africa 

Greenberg is known widely for his development of a classification system for the languages of Africa, which he published as a series of articles in the Southwestern Journal of Anthropology from 1949 to 1954 (reprinted together as a book during 1955). He revised the book and published it again during 1963, followed by a nearly identical edition of 1966 (reprinted without change during 1970). A few more changes of the classification were made by Greenberg in an article during 1981.

Greenberg grouped the hundreds of African languages into four families, which he dubbed Afroasiatic, Nilo-Saharan, Niger–Congo, and Khoisan. During the course of his work, Greenberg invented the term "Afroasiatic" to replace the earlier term "Hamito-Semitic", after showing that the Hamitic group, accepted widely since the 19th century, is not a valid language family. Another major feature of his work was to establish the classification of the Bantu languages, which occupy much of sub-Saharan Africa, as a part of the Niger–Congo language family, rather than as an independent family as many Bantuists had maintained.

Greenberg's classification rested largely in evaluating competing earlier classifications. For a time, his classification was considered bold and speculative, especially the proposal of a Nilo-Saharan language family. Now, apart from Khoisan, it is generally accepted by African specialists and has been used as a basis for further work by other scholars.

Greenberg's work on African languages has been criticised by Lyle Campbell and Donald Ringe, who do not believe that his classification is justified by his data; they request a reexamination of his macro-phyla by "reliable methods" (Ringe 1993:104). Harold Fleming and Lionel Bender, who are sympathetic to Greenberg's classification, acknowledge that at least some of his macrofamilies (particularly Nilo-Saharan and Khoisan) are not accepted completely by most linguists and may need to be divided (Campbell 1997). Their objection is methodological: if mass comparison is not a valid method, it cannot be expected to have brought order successfully out of the confusion of African languages.

By contrast, some linguists have sought to combine Greenberg's four African families into larger units. In particular, Edgar Gregersen (1972) proposed joining Niger–Congo and Nilo-Saharan into a larger family, which he termed Kongo-Saharan. Roger Blench (1995) suggests Niger–Congo is a subfamily of Nilo-Saharan.

The languages of New Guinea, Tasmania, and the Andaman Islands

During 1971 Greenberg proposed the Indo-Pacific macrofamily, which groups together the Papuan languages (a large number of language families of New Guinea and nearby islands) with the native languages of the Andaman Islands and Tasmania but excludes the Australian Aboriginal languages. Its principal feature was to reduce the manifold language families of New Guinea to a single genetic unit. This excludes the Austronesian languages, which have been established as associated with a more recent migration of people.

Greenberg's subgrouping of these languages has not been accepted by the few specialists who have worked on the classification of these languages. However, the work of Stephen Wurm (1982) and Malcolm Ross (2005) has provided considerable evidence for his once-radical idea that these languages form a single genetic unit. Wurm stated that the lexical similarities between Great Andamanese and the West Papuan and Timor–Alor families "are quite striking and amount to virtual formal identity [...] in a number of instances." He believes this to be due to a linguistic substratum.

The languages of the Americas

Most linguists concerned with the native languages of the Americas classify them into 150 to 180 independent language families. Some believe that two language families, Eskimo–Aleut and Na-Dené, were distinct, perhaps the results of later migrations into the New World.

Early on, Greenberg (1957:41, 1960) became convinced that many of the language groups considered unrelated could be classified into larger groupings. In his 1987 book Language in the Americas, while agreeing that the Eskimo–Aleut and Na-Dené groupings as distinct, he proposed that all the other Native American languages belong to a single language macro-family, which he termed Amerind.

Language in the Americas has generated lively debate, but has been criticized strongly; it is rejected by most specialists of indigenous languages of the Americas and also by most historical linguists. Specialists of the individual language families have found extensive inaccuracies and errors in Greenberg's data, such as including data from non-existent languages, erroneous transcriptions of the forms compared, misinterpretations of the meanings of words used for comparison, and entirely spurious forms.

Historical linguists also reject the validity of the method of multilateral (or mass) comparison upon which the classification is based. They argue that he has not provided a convincing case that the similarities presented as evidence are due to inheritance from an earlier common ancestor rather than being explained by a combination of errors, accidental similarity, excessive semantic latitude in comparisons, borrowings, onomatopoeia, etc.

The languages of northern Eurasia 

Later in his life, Greenberg proposed that nearly all of the language families of northern Eurasia belong to a single higher-order family, which he termed Eurasiatic. The only exception was Yeniseian, which has been related to a wider Dené–Caucasian grouping, also including Sino-Tibetan.  During 2008 Edward Vajda related Yeniseian to the Na-Dené languages of North America as a Dené–Yeniseian family.

The Eurasiatic grouping resembles the older Nostratic groupings of Holger Pedersen and Vladislav Illich-Svitych by including Indo-European, Uralic, and Altaic. It differs by including Nivkh, Japonic, Korean, and Ainu (which the Nostraticists had excluded from comparison because they are single languages rather than language families) and in excluding Afroasiatic. At about this time, Russian Nostraticists, notably Sergei Starostin, constructed a revised version of Nostratic. It was slightly larger than Greenberg's grouping but it also excluded Afroasiatic.

Recently, a consensus has been emerging among proponents of the Nostratic hypothesis. Greenberg basically agreed with the Nostratic concept, though he stressed a deep internal division between its northern 'tier' (his Eurasiatic) and a southern 'tier' (principally Afroasiatic and Dravidian).

The American Nostraticist Allan Bomhard considers Eurasiatic a branch of Nostratic, alongside other branches: Afroasiatic, Elamo-Dravidian, and Kartvelian. Similarly, Georgiy Starostin (2002) arrives at a tripartite overall grouping: he considers Afroasiatic, Nostratic and Elamite to be roughly equidistant and more closely related to each other than to any other language family. Sergei Starostin's school has now included Afroasiatic in a broadly defined Nostratic. They reserve the term Eurasiatic to designate the narrower subgrouping, which comprises the rest of the macrofamily. Recent proposals thus differ mainly on the precise inclusion of Dravidian and Kartvelian.

Greenberg continued to work on this project after he was diagnosed with incurable pancreatic cancer and until he died during May 2001. His colleague and former student Merritt Ruhlen ensured the publication of the final volume of his Eurasiatic work (2002) after his death.

Selected works by Joseph H. Greenberg

Books 
  (Photo-offset reprint of the SJA articles with minor corrections.)
 
  (Heavily revised version of Greenberg 1955. From the same publisher: second, revised edition, 1966; third edition, 1970. All three editions simultaneously published at The Hague by Mouton & Co.)
  (Reprinted 1980 and, with a foreword by Martin Haspelmath, 2005.)

Books (editor) 
  (Second edition 1966.)

Articles, reviews, etc. 
 
 
 
 
 
 
 
 
 
 
 
 
 
  (Reprinted in Genetic Linguistics, 2005.)
 
  (In second edition of Universals of Language, 1966: pp. 73–113.)
 
 
  (Reprinted in Genetic Linguistics, 2005.)

Bibliography

Blench, Roger. 1995. "Is Niger–Congo simply a branch of Nilo-Saharan?" In Fifth Nilo-Saharan Linguistics Colloquium, Nice, 24–29 August 1992: Proceedings, edited by Robert Nicolaï and Franz Rottland. Cologne: Köppe Verlag, pp. 36–49.

Campbell, Lyle. 1997. American Indian Languages: The Historical Linguistics of Native America. New York: Oxford University Press. .
Campbell, Lyle. 2001. "Beyond the comparative method." In Historical Linguistics 2001: Selected Papers from the 15th International Conference on Historical Linguistics, Melbourne, 13–17 August 2001, edited by Barry J. Blake, Kate Burridge, and Jo Taylor.
Diamond, Jared. 1997. Guns, Germs and Steel: The Fates of Human Societies. New York: Norton. .

Mairal, Ricardo and Juana Gil. 2006. Linguistic Universals. Cambridge–NY: Cambridge University Press. .

Ross, Malcolm. 2005. "Pronouns as a preliminary diagnostic for grouping Papuan languages." In Papuan Pasts: Cultural, Linguistic and Biological Histories of Papuan-speaking Peoples, edited by Andrew Pawley, Robert Attenborough, Robin Hide, and Jack Golson. Canberra: Pacific Linguistics, pp. 15–66.
Wurm, Stephen A. 1982. The Papuan Languages of Oceania. Tübingen: Gunter Narr.

See also

Linguistic universal
Monogenesis (linguistics)
Nostratic languages

References

External links
Joseph Greenberg at work; a portrait of himself
"What we all spoke when the world was young" by Nicholas Wade, New York Times (February 1, 2000)
Obituary from Stanford Report
Memorial Resolution
"Joseph Harold Greenberg" by William Croft (2003) (also: )
"Complete bibliography of the publications of Joseph H. Greenberg" by William Croft (2003)

Category:1915 births
Category:2001 deaths
Category:Linguists from the United States
Category:American Africanists
Category:American Jews in the military
Category:Paleolinguists
Category:Columbia University faculty
Category:Stanford University Department of Anthropology faculty
Category:Fellows of the American Academy of Arts and Sciences
Category:Members of the United States National Academy of Sciences
Category:United States Army personnel
Category:American army personnel of World War II
Category:Guggenheim Fellows
Category:American expatriates in Nigeria
Category:People from Brooklyn
Category:Linguists of Na-Dene languages
Category:Linguists of Eskimo–Aleut languages
Category:Linguists of Hokan languages
Category:Jewish scientists
Category:Linguists of Papuan languages
Category:Linguists of Amerind languages
Category:Linguists of Andamanese languages
Category:Linguists of Tasmanian languages
Category:Linguists of Niger–Congo languages
Category:Linguists of Afroasiatic languages
Category:Linguistic Society of America presidents

Process the dataset

Flatten the dataset

Converting the dataset from the set of articles into the set of characters. We also are interested only in text of each article so we may drop the title along the way.

In [9]:
def article_to_text(text):
    return np.array([char for char in text.numpy().decode('utf-8')])

# Converting each dataset item to a string ('text') instead of a dictionary ({'text', 'title'}).
dataset_text = dataset.map(
    lambda article: tf.py_function(func=article_to_text, inp=[article['text']], Tout=tf.string)
)

for text in dataset_text.take(2):
    print(text.numpy())
    print('\n')
[b'J' b'o' b's' ... b'n' b't' b's']


[b'P' b'a' b'u' ... b'e' b'r' b's']


In [10]:
# Unbatch the text dataset into a more granular char dataset.
# Now each dataset item is one character instead of a big piece of text.
dataset_chars = dataset_text.unbatch()

for char in dataset_chars.take(20):
    print(char.numpy().decode('utf-8'))
J
o
s
e
p
h
 
H
a
r
o
l
d
 
G
r
e
e
n
b

Generating vocabulary

In [11]:
vocab = set()

# Ideally we should take all dataset items into account here.
for text in dataset_text.take(1000):
    vocab.update([char.decode('utf-8') for char in text.numpy()])
    
vocab = sorted(vocab)

print('Unique characters: {}'.format(len(vocab)))
print('vocab:')
print(vocab)
Unique characters: 621
vocab:
['\t', '\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\xa0', '£', '§', '«', '®', '°', '±', '²', '·', '»', '¼', '½', '¿', 'Á', 'Å', 'Æ', 'Ç', 'É', 'Ë', 'Í', 'Î', 'Ó', 'Ö', '×', 'Ø', 'Ü', 'Þ', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ú', 'û', 'ü', 'ý', 'ā', 'ă', 'ą', 'Ć', 'ć', 'Č', 'č', 'đ', 'ė', 'ę', 'ě', 'ğ', 'ġ', 'Ħ', 'ī', 'İ', 'ı', 'ļ', 'Ł', 'ł', 'ń', 'ň', 'Ō', 'ō', 'ő', 'ř', 'Ś', 'ś', 'Ş', 'ş', 'Š', 'š', 'ţ', 'ū', 'ź', 'ż', 'Ž', 'ž', 'ơ', 'ư', 'ǔ', 'ș', 'ț', 'ɔ', 'ə', 'ɛ', 'ʷ', 'ʼ', 'ʿ', '˚', 'Ι', 'Π', 'α', 'β', 'ε', 'η', 'ι', 'κ', 'μ', 'ο', 'ρ', 'ς', 'τ', 'υ', 'χ', 'ψ', 'ό', 'Б', 'В', 'Д', 'Ж', 'З', 'И', 'К', 'Л', 'М', 'Н', 'О', 'П', 'С', 'У', 'Ф', 'Х', 'а', 'б', 'в', 'г', 'д', 'е', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'ю', 'я', 'і', 'ј', 'ћ', 'ּ', 'א', 'ב', 'ג', 'ו', 'ט', 'י', 'ך', 'ל', 'מ', 'נ', 'ס', 'ע', 'פ', 'ץ', 'צ', 'ק', 'ר', 'ת', 'װ', '،', 'أ', 'إ', 'ا', 'ب', 'ة', 'ت', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'س', 'ش', 'ص', 'ط', 'ع', 'ف', 'ق', 'ل', 'م', 'ن', 'ه', 'و', 'ي', 'پ', 'ک', 'அ', 'ஆ', 'இ', 'க', 'ச', 'ட', 'ண', 'த', 'ந', 'ன', 'ப', 'ம', 'ர', 'ற', 'ல', 'ள', 'ழ', 'வ', 'ா', 'ி', 'ு', 'ே', 'ை', 'ொ', '்', 'ಡ', 'ಮ', 'ರ', 'ಲ', 'ಸ', 'ಿ', 'ก', 'ข', 'ค', 'ง', 'จ', 'ช', 'ฒ', 'ณ', 'ด', 'ต', 'ท', 'ธ', 'น', 'บ', 'ป', 'ผ', 'พ', 'ภ', 'ม', 'ย', 'ร', 'ฤ', 'ล', 'ว', 'ศ', 'ส', 'ห', 'อ', 'ะ', 'ั', 'า', 'ำ', 'ิ', 'ี', 'ื', 'ุ', 'ู', 'เ', 'แ', 'โ', 'ไ', '่', '้', '์', 'ზ', 'უ', 'ფ', 'ḩ', 'ḻ', 'ṟ', 'ṣ', 'ṭ', 'ạ', 'ễ', 'ệ', 'ὶ', '\u200e', '–', '—', '‘', '’', '“', '”', '†', '‡', '•', '…', '′', '″', '₨', '€', '₱', '℃', 'ℓ', '№', '™', '→', '−', '≠', '♠', '♦', '➔', '\u3000', 'り', 'ア', 'イ', 'ウ', 'カ', 'ク', 'コ', 'シ', 'ズ', 'ゼ', 'ソ', 'タ', 'チ', 'ッ', 'ツ', 'パ', 'ボ', 'マ', 'ム', 'ャ', 'ョ', 'リ', 'レ', 'ン', 'ー', '一', '三', '上', '下', '东', '中', '主', '义', '九', '乡', '事', '介', '伐', '会', '依', '俳', '僧', '光', '児', '全', '六', '兴', '其', '典', '况', '前', '加', '勇', '務', '化', '华', '単', '卡', '厂', '双', '句', '史', '号', '吉', '吹', '哈', '哪', '国', '國', '地', '坝', '坪', '堡', '大', '姚', '子', '孤', '安', '宗', '家', '寄', '寨', '射', '局', '屋', '岩', '島', '左', '巴', '師', '店', '庙', '延', '建', '得', '心', '怡', '排', '探', '教', '数', '文', '斌', '新', '方', '日', '昌', '明', '春', '昭', '普', '會', '本', '李', '村', '板', '桦', '概', '民', '水', '江', '法', '泥', '泽', '湾', '溪', '潭', '澤', '濟', '無', '燈', '界', '皮', '石', '砲', '磨', '禪', '站', '维', '置', '義', '羹', '育', '胜', '臨', '花', '茜', '莲', '華', '董', '薦', '薩', '虚', '街', '装', '覺', '解', '訪', '語', '话', '语', '赤', '転', '軽', '辞', '農', '达', '逆', '通', '造', '連', '道', '那', '鄉', '里', '野', '録', '镇', '长', '門', '陈', '食', '馬', '马', '鹿', '黄', '김', '준', '태', 'fl', ')', '|']

Vectorize the text

Before feeding the text to our RNN we need to convert the text from a sequence of characters to a sequence of numbers. To do so we will detect all unique characters in the text, form a vocabulary out of it and replace each character with its index in the vocabulary.

In [12]:
# Map characters to their indices in vocabulary.
char2index = {char: index for index, char in enumerate(vocab)}

print('{')
for char, _ in zip(char2index, range(30)):
    print('  {:4s}: {:3d},'.format(repr(char), char2index[char]))
print('  ...\n}')
{
  '\t':   0,
  '\n':   1,
  ' ' :   2,
  '!' :   3,
  '"' :   4,
  '#' :   5,
  '$' :   6,
  '%' :   7,
  '&' :   8,
  "'" :   9,
  '(' :  10,
  ')' :  11,
  '*' :  12,
  '+' :  13,
  ',' :  14,
  '-' :  15,
  '.' :  16,
  '/' :  17,
  '0' :  18,
  '1' :  19,
  '2' :  20,
  '3' :  21,
  '4' :  22,
  '5' :  23,
  '6' :  24,
  '7' :  25,
  '8' :  26,
  '9' :  27,
  ':' :  28,
  ';' :  29,
  ...
}
In [13]:
# Map character indices to characters from vacabulary.
index2char = np.array(vocab)

print(index2char)
['\t' '\n' ' ' '!' '"' '#' '$' '%' '&' "'" '(' ')' '*' '+' ',' '-' '.' '/'
 '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' ':' ';' '<' '=' '>' '?' '@' 'A'
 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S'
 'T' 'U' 'V' 'W' 'X' 'Y' 'Z' '[' ']' '^' '_' '`' 'a' 'b' 'c' 'd' 'e' 'f'
 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x'
 'y' 'z' '{' '|' '}' '~' '\xa0' '£' '§' '«' '®' '°' '±' '²' '·' '»' '¼'
 '½' '¿' 'Á' 'Å' 'Æ' 'Ç' 'É' 'Ë' 'Í' 'Î' 'Ó' 'Ö' '×' 'Ø' 'Ü' 'Þ' 'ß' 'à'
 'á' 'â' 'ã' 'ä' 'å' 'æ' 'ç' 'è' 'é' 'ê' 'ë' 'ì' 'í' 'î' 'ï' 'ñ' 'ò' 'ó'
 'ô' 'õ' 'ö' 'ø' 'ú' 'û' 'ü' 'ý' 'ā' 'ă' 'ą' 'Ć' 'ć' 'Č' 'č' 'đ' 'ė' 'ę'
 'ě' 'ğ' 'ġ' 'Ħ' 'ī' 'İ' 'ı' 'ļ' 'Ł' 'ł' 'ń' 'ň' 'Ō' 'ō' 'ő' 'ř' 'Ś' 'ś'
 'Ş' 'ş' 'Š' 'š' 'ţ' 'ū' 'ź' 'ż' 'Ž' 'ž' 'ơ' 'ư' 'ǔ' 'ș' 'ț' 'ɔ' 'ə' 'ɛ'
 'ʷ' 'ʼ' 'ʿ' '˚' 'Ι' 'Π' 'α' 'β' 'ε' 'η' 'ι' 'κ' 'μ' 'ο' 'ρ' 'ς' 'τ' 'υ'
 'χ' 'ψ' 'ό' 'Б' 'В' 'Д' 'Ж' 'З' 'И' 'К' 'Л' 'М' 'Н' 'О' 'П' 'С' 'У' 'Ф'
 'Х' 'а' 'б' 'в' 'г' 'д' 'е' 'з' 'и' 'й' 'к' 'л' 'м' 'н' 'о' 'п' 'р' 'с'
 'т' 'у' 'ф' 'х' 'ц' 'ч' 'ш' 'щ' 'ъ' 'ы' 'ь' 'ю' 'я' 'і' 'ј' 'ћ' 'ּ' 'א'
 'ב' 'ג' 'ו' 'ט' 'י' 'ך' 'ל' 'מ' 'נ' 'ס' 'ע' 'פ' 'ץ' 'צ' 'ק' 'ר' 'ת' 'װ'
 '،' 'أ' 'إ' 'ا' 'ب' 'ة' 'ت' 'ج' 'ح' 'خ' 'د' 'ذ' 'ر' 'س' 'ش' 'ص' 'ط' 'ع'
 'ف' 'ق' 'ل' 'م' 'ن' 'ه' 'و' 'ي' 'پ' 'ک' 'அ' 'ஆ' 'இ' 'க' 'ச' 'ட' 'ண' 'த'
 'ந' 'ன' 'ப' 'ம' 'ர' 'ற' 'ல' 'ள' 'ழ' 'வ' 'ா' 'ி' 'ு' 'ே' 'ை' 'ொ' '்' 'ಡ'
 'ಮ' 'ರ' 'ಲ' 'ಸ' 'ಿ' 'ก' 'ข' 'ค' 'ง' 'จ' 'ช' 'ฒ' 'ณ' 'ด' 'ต' 'ท' 'ธ' 'น'
 'บ' 'ป' 'ผ' 'พ' 'ภ' 'ม' 'ย' 'ร' 'ฤ' 'ล' 'ว' 'ศ' 'ส' 'ห' 'อ' 'ะ' 'ั' 'า'
 'ำ' 'ิ' 'ี' 'ื' 'ุ' 'ู' 'เ' 'แ' 'โ' 'ไ' '่' '้' '์' 'ზ' 'უ' 'ფ' 'ḩ' 'ḻ'
 'ṟ' 'ṣ' 'ṭ' 'ạ' 'ễ' 'ệ' 'ὶ' '\u200e' '–' '—' '‘' '’' '“' '”' '†' '‡' '•'
 '…' '′' '″' '₨' '€' '₱' '℃' 'ℓ' '№' '™' '→' '−' '≠' '♠' '♦' '➔' '\u3000'
 'り' 'ア' 'イ' 'ウ' 'カ' 'ク' 'コ' 'シ' 'ズ' 'ゼ' 'ソ' 'タ' 'チ' 'ッ' 'ツ' 'パ' 'ボ' 'マ'
 'ム' 'ャ' 'ョ' 'リ' 'レ' 'ン' 'ー' '一' '三' '上' '下' '东' '中' '主' '义' '九' '乡' '事'
 '介' '伐' '会' '依' '俳' '僧' '光' '児' '全' '六' '兴' '其' '典' '况' '前' '加' '勇' '務'
 '化' '华' '単' '卡' '厂' '双' '句' '史' '号' '吉' '吹' '哈' '哪' '国' '國' '地' '坝' '坪'
 '堡' '大' '姚' '子' '孤' '安' '宗' '家' '寄' '寨' '射' '局' '屋' '岩' '島' '左' '巴' '師'
 '店' '庙' '延' '建' '得' '心' '怡' '排' '探' '教' '数' '文' '斌' '新' '方' '日' '昌' '明'
 '春' '昭' '普' '會' '本' '李' '村' '板' '桦' '概' '民' '水' '江' '法' '泥' '泽' '湾' '溪'
 '潭' '澤' '濟' '無' '燈' '界' '皮' '石' '砲' '磨' '禪' '站' '维' '置' '義' '羹' '育' '胜'
 '臨' '花' '茜' '莲' '華' '董' '薦' '薩' '虚' '街' '装' '覺' '解' '訪' '語' '话' '语' '赤'
 '転' '軽' '辞' '農' '达' '逆' '通' '造' '連' '道' '那' '鄉' '里' '野' '録' '镇' '长' '門'
 '陈' '食' '馬' '马' '鹿' '黄' '김' '준' '태' 'fl' ')' '|']
In [14]:
def char_to_index(char):
    char_symbol = char.numpy().decode('utf-8')
    char_index = char2index[char_symbol] if char_symbol in char2index else char2index['?']
    return char_index

dataset_chars_indexed = dataset_chars.map(
    lambda char: tf.py_function(func=char_to_index, inp=[char], Tout=tf.int32)
)

print('ORIGINAL CHARS:', '\n---')
for char in dataset_chars.take(10):
    print(char.numpy().decode())

print('\n\n')    
    
print('INDEXED CHARS:', '\n---')
for char_index in dataset_chars_indexed.take(20):
    print(char_index.numpy())
ORIGINAL CHARS: 
---
J
o
s
e
p
h
 
H
a
r



INDEXED CHARS: 
---
44
80
84
70
81
73
2
42
66
83
80
77
69
2
41
83
70
70
79
67

Create training sequences

In [15]:
# The maximum length sentence we want for a single input in characters.
sequence_length = 200
In [16]:
# Generate batched sequences out of the char_dataset.
sequences = dataset_chars_indexed.batch(sequence_length + 1, drop_remainder=True)

# Sequences examples.
for item in sequences.take(10):
    print(repr(''.join(index2char[item.numpy()])))
    print()
'Joseph Harold Greenberg (May 28, 1915 – May 7, 2001) was an American linguist, known mainly for his work concerning linguistic typology and the genetic classification of languages.\n\nLife\n\nEarly life an'

'd education \n(Main source: Croft 2003)\n\nJoseph Greenberg was born on May 28, 1915 to Jewish parents in Brooklyn, New York. His first great interest was music. At the age of 14, he gave a piano concert '

'in Steinway Hall. He continued to play the piano frequently throughout his life.\n\nAfter finishing high school, he decided to pursue a scholarly career rather than a musical one. He enrolled at Columbia'

' University in New York. During his senior year, he attended a class taught by Franz Boas concerning American Indian languages. With references from Boas and Ruth Benedict, he was accepted as a graduat'

'e student by Melville J. Herskovits at Northwestern University in Chicago. During the course of his graduate studies, Greenberg did fieldwork among the Hausa people of Nigeria, where he learned the Hau'

'sa language. The subject of his doctoral dissertation was the influence of Islam on a Hausa group that, unlike most others, had not converted to it.\n\nDuring 1940, he began postdoctoral studies at Yale '

'University. These were interrupted by service in the U.S. Army Signal Corps during World War II, for which he worked as a codebreaker and participated with the landing at Casablanca. Before leaving for'

' Europe during 1943, Greenberg married Selma Berkowitz, whom he had met during his first year at Columbia University.\n\nCareer\nAfter the war, Greenberg taught at the University of Minnesota before retur'

'ning to Columbia University during 1948 as a teacher of anthropology. While in New York, he became acquainted with Roman Jakobson and André Martinet. They introduced him to the Prague school of structu'

'ralism, which influenced his work.\n\nDuring 1962, Greenberg relocated to the anthropology department of Stanford University in California, where he continued to work for the rest of his life. During 196'

In [17]:
# sequences shape:
# - Each sequence of length 101
#
#    201     201          201
# [(.....) (.....) ...  (.....)]

For each sequence, duplicate and shift it to form the input and target text. For example, say sequence_length is 4 and our text is Hello. The input sequence would be Hell, and the target sequence ello.

In [18]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text
In [19]:
dataset_sequences = sequences.map(split_input_target)
In [20]:
for input_example, target_example in dataset_sequences.take(1):
    print('Input sequence size:', repr(len(input_example.numpy())))
    print('Target sequence size:', repr(len(target_example.numpy())))
    print()
    print('Input:\n', repr(''.join(index2char[input_example.numpy()])))
    print()
    print('Target:\n', repr(''.join(index2char[target_example.numpy()])))
Input sequence size: 200
Target sequence size: 200

Input:
 'Joseph Harold Greenberg (May 28, 1915 – May 7, 2001) was an American linguist, known mainly for his work concerning linguistic typology and the genetic classification of languages.\n\nLife\n\nEarly life a'

Target:
 'oseph Harold Greenberg (May 28, 1915 – May 7, 2001) was an American linguist, known mainly for his work concerning linguistic typology and the genetic classification of languages.\n\nLife\n\nEarly life an'
In [21]:
# dataset shape:
# - Each sequence is a tuple of 2 sub-sequences of length 100 (input_text and target_text)
#
#    200       200           200
# /(.....)\ /(.....)\ ... /(.....)\  <-- input_text
# \(.....)/ \(.....)/     \(.....)/  <-- target_text

Each index of these vectors are processed as one time step. For the input at time step 0, the model receives the index for "F" and trys to predict the index for "i" as the next character. At the next timestep, it does the same thing but the RNN considers the previous step context in addition to the current input character.

In [22]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print('Step #{:1d}'.format(i))
    print('  input: {} ({:s})'.format(input_idx, repr(index2char[input_idx])))
    print('  expected output: {} ({:s})'.format(target_idx, repr(index2char[target_idx])))
    print()
Step #0
  input: 44 ('J')
  expected output: 80 ('o')

Step #1
  input: 80 ('o')
  expected output: 84 ('s')

Step #2
  input: 84 ('s')
  expected output: 70 ('e')

Step #3
  input: 70 ('e')
  expected output: 81 ('p')

Step #4
  input: 81 ('p')
  expected output: 73 ('h')

Split training sequences into batches

We used tf.data to split the text into manageable sequences. But before feeding this data into the model, we need to shuffle the data and pack it into batches.

In [23]:
# Batch size.
BATCH_SIZE = 64

# Buffer size to shuffle the dataset (TF data is designed to work
# with possibly infinite sequences, so it doesn't attempt to shuffle
# the entire sequence in memory. Instead, it maintains a buffer in
# which it shuffles elements).
BUFFER_SIZE = 100

# How many items to prefetch before the next iteration.
PREFETCH_SIZE = 10

dataset_sequence_batches = dataset_sequences \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE, drop_remainder=True) \
    .prefetch(PREFETCH_SIZE)

dataset_sequence_batches
Out[23]:
<DatasetV1Adapter shapes: (<unknown>, <unknown>), types: (tf.int32, tf.int32)>
In [24]:
for input_text, target_text in dataset_sequence_batches.take(1):
    print('1st batch: input_text:', input_text)
    print()
    print('1st batch: target_text:', target_text)
1st batch: input_text: tf.Tensor(
[[84 80 79 ... 72 14  2]
 [84 66  2 ... 66 77 70]
 [69  2 70 ... 70 83 85]
 ...
 [84 71 70 ... 85 74 80]
 [ 1  1 39 ...  1  1 49]
 [83 66 79 ... 14  2 66]], shape=(64, 200), dtype=int32)

1st batch: target_text: tf.Tensor(
[[80 79 84 ... 14  2 19]
 [66  2 77 ... 77 70  2]
 [ 2 70 69 ... 83 85  2]
 ...
 [71 70 83 ... 74 80 79]
 [ 1 39 74 ...  1 49 86]
 [66 79 76 ...  2 66 84]], shape=(64, 200), dtype=int32)
In [25]:
# dataset shape:
# - 64 sequences per batch
# - Each sequence is a tuple of 2 sub-sequences of length 100 (input_text and target_text)
#
#
#     200       200           200             200       200           200
# |/(.....)\ /(.....)\ ... /(.....)\| ... |/(.....)\ /(.....)\ ... /(.....)\|  <-- input_text
# |\(.....)/ \(.....)/     \(.....)/| ... |\(.....)/ \(.....)/     \(.....)/|  <-- target_text
#
# <------------- 64 ---------------->     <------------- 64 ---------------->

Build the model

Use tf.keras.Sequential to define the model. For this simple example three layers are used to define our model:

In [26]:
# Let's do a quick detour and see how Embeding layer works.
# It takes several char indices sequences (batch) as an input.
# It encodes every character of every sequence to a vector of tmp_embeding_size length.
tmp_vocab_size = 10
tmp_embeding_size = 5
tmp_input_length = 8
tmp_batch_size = 2

tmp_model = tf.keras.models.Sequential()
tmp_model.add(tf.keras.layers.Embedding(
  input_dim=tmp_vocab_size,
  output_dim=tmp_embeding_size,
  input_length=tmp_input_length
))
# The model will take as input an integer matrix of size (batch, input_length).
# The largest integer (i.e. word index) in the input should be no larger than 9 (tmp_vocab_size).
# Now model.output_shape == (None, 10, 64), where None is the batch dimension.
tmp_input_array = np.random.randint(
  low=0,
  high=tmp_vocab_size,
  size=(tmp_batch_size, tmp_input_length)
)
tmp_model.compile('rmsprop', 'mse')
tmp_output_array = tmp_model.predict(tmp_input_array)

print('tmp_input_array shape:', tmp_input_array.shape)
print('tmp_input_array:')
print(tmp_input_array)
print()
print('tmp_output_array shape:', tmp_output_array.shape)
print('tmp_output_array:')
print(tmp_output_array)
tmp_input_array shape: (2, 8)
tmp_input_array:
[[5 8 2 5 2 8 0 9]
 [7 3 6 4 5 1 6 2]]

tmp_output_array shape: (2, 8, 5)
tmp_output_array:
[[[-0.04438466 -0.0477155  -0.00650557  0.00578437 -0.02522211]
  [-0.00249968  0.00477722  0.00990368 -0.02025222  0.008913  ]
  [-0.04758141 -0.02608242  0.03385669  0.0057972  -0.00750101]
  [-0.04438466 -0.0477155  -0.00650557  0.00578437 -0.02522211]
  [-0.04758141 -0.02608242  0.03385669  0.0057972  -0.00750101]
  [-0.00249968  0.00477722  0.00990368 -0.02025222  0.008913  ]
  [-0.00630327 -0.04054337  0.04020781 -0.04592142  0.00964195]
  [-0.03995473 -0.01548551 -0.00048856 -0.02285538 -0.02910516]]

 [[ 0.03945759  0.04857247 -0.00674983 -0.00098724  0.0119729 ]
  [ 0.04236538  0.01068302 -0.04940217 -0.04754317  0.03218012]
  [ 0.01865717 -0.02711406  0.04124454  0.02071499  0.03038916]
  [ 0.00376308  0.00901873 -0.00290323  0.04185314  0.02587625]
  [-0.04438466 -0.0477155  -0.00650557  0.00578437 -0.02522211]
  [ 0.00597124 -0.00562616  0.03060223  0.04036254  0.02135176]
  [ 0.01865717 -0.02711406  0.04124454  0.02071499  0.03038916]
  [-0.04758141 -0.02608242  0.03385669  0.0057972  -0.00750101]]]
In [27]:
# Length of the vocabulary in chars.
vocab_size = len(vocab)

# The embedding dimension.
embedding_dim = 256

# Number of RNN units.
rnn_units = 1024
In [28]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.models.Sequential()

    model.add(tf.keras.layers.Embedding(
      input_dim=vocab_size,
      output_dim=embedding_dim,
      batch_input_shape=[batch_size, None]
    ))

    model.add(tf.keras.layers.LSTM(
      units=rnn_units,
      return_sequences=True,
      stateful=True,
      recurrent_initializer=tf.keras.initializers.GlorotNormal()
    ))

    model.add(tf.keras.layers.Dense(vocab_size))
  
    return model
In [29]:
model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)
In [30]:
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (64, None, 256)           158976    
_________________________________________________________________
lstm (LSTM)                  (64, None, 1024)          5246976   
_________________________________________________________________
dense (Dense)                (64, None, 621)           636525    
=================================================================
Total params: 6,042,477
Trainable params: 6,042,477
Non-trainable params: 0
_________________________________________________________________
In [31]:
tf.keras.utils.plot_model(
    model,
    show_shapes=True,
    show_layer_names=True,
)
Out[31]:

For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character:

Model architecture

Image source: Text generation with an RNN notebook.

Try the model

In [32]:
for input_example_batch, target_example_batch in dataset_sequence_batches.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")
(64, 200, 621) # (batch_size, sequence_length, vocab_size)

To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.

In [33]:
print('Prediction for the 1st letter of the batch 1st sequense:')
print(example_batch_predictions[0, 0])
Prediction for the 1st letter of the batch 1st sequense:
tf.Tensor(
[-2.96991039e-03  2.02196068e-04  5.34047745e-03 -2.94846855e-03
 -3.64167639e-03 -2.63241702e-04 -8.80502281e-04  7.99844624e-04
  5.26232133e-03 -1.23821688e-03  2.34868471e-03 -1.18148176e-03
 -8.88869807e-04  5.29895071e-04 -2.79451575e-04  2.37091444e-06
  1.92230579e-03 -1.01283449e-03 -1.99038046e-03 -5.02873491e-03
 -4.71714232e-03 -1.69062556e-03  3.95582151e-03  5.13770268e-04
 -2.62290076e-03  1.52810244e-03 -3.62532330e-03  4.52159438e-04
  4.21410240e-03 -4.20481712e-03  2.28100177e-03 -1.53552578e-03
 -4.49910574e-03  2.41562491e-03 -2.41917069e-03  4.41195024e-03
 -2.03300873e-03 -3.13091557e-03  2.15532375e-03 -1.44562731e-03
 -8.98959697e-05 -2.03171140e-03  2.83075799e-03  3.00962362e-03
  4.39736526e-03 -2.73631304e-04 -6.91650144e-04  1.73515151e-03
  3.01030884e-03 -9.34766373e-04  1.50964432e-03  3.39096365e-03
  3.16297705e-03  2.44869967e-04 -1.08598219e-03 -7.28239818e-03
 -2.05614767e-03  1.98792620e-03 -3.93527001e-03  3.37290025e-04
  1.95513829e-03  2.91557517e-03  1.84102287e-03  4.72247927e-03
  1.93154044e-03  7.20826583e-03 -8.27651296e-04  4.75527486e-03
  1.38566690e-03  3.24708899e-03 -1.81622850e-03  1.36225706e-03
 -1.36198709e-04 -2.26765592e-03  2.54186074e-04 -2.58616405e-03
 -4.64894576e-03  3.89528414e-03  3.25334468e-03  5.27988188e-04
 -2.64238310e-03 -4.89778118e-03  3.21954163e-03  1.11659628e-03
 -3.06684617e-03  2.00703996e-03 -2.21931376e-03  2.23845663e-03
  3.28445598e-03 -2.56485119e-03 -4.31890320e-03  2.41565611e-03
 -5.15983673e-04  3.64185032e-03  1.48760225e-03  1.20713573e-03
  5.29383449e-03 -3.79023957e-04  2.40978692e-03  2.31299014e-03
 -4.91128303e-05  3.96255311e-03 -4.22925572e-04 -3.23428866e-03
 -2.07351381e-03  2.73143407e-03  1.88296579e-03 -3.68654262e-03
  1.27331994e-03 -1.24076579e-03  6.32031867e-03 -2.48864666e-03
 -1.23163743e-03 -2.19777529e-03 -6.27300469e-04 -3.31383711e-03
 -1.13901915e-05 -2.82761734e-03 -3.86833388e-04 -1.73249340e-03
 -1.05973426e-03  5.09004341e-03  3.12257023e-03  1.97673100e-03
 -4.14281525e-03  2.05062283e-03  3.34373908e-06 -1.60101010e-03
  4.21956182e-03 -4.85191122e-03  2.88532046e-03 -4.75326553e-04
 -2.57715257e-03  3.55798984e-03  2.37036962e-03  3.57332220e-03
  2.17431062e-03 -1.53979345e-04 -4.33850288e-03  1.87864178e-03
 -1.86979759e-03 -1.36150082e-03  5.56515646e-04  1.42047508e-03
  2.74778227e-03  3.35774850e-04 -2.53880257e-03  2.82756845e-03
  2.44993693e-03 -3.19341733e-03  3.78998322e-03 -3.06723465e-04
  1.50572485e-03 -9.95766837e-04  4.31416184e-03 -8.64217116e-04
  1.43683166e-03  4.11726162e-03 -1.25944172e-03  4.96372813e-03
  5.60203707e-03 -9.61561105e-04 -8.50761193e-04 -1.65662495e-03
  1.30175764e-03 -4.40109987e-03 -1.54017517e-03  1.31750060e-03
  4.35700268e-03  6.44978718e-04  3.13272886e-03 -4.42359271e-03
 -3.94872529e-03 -3.35123227e-03  4.19887435e-03  4.43855161e-03
  3.22591141e-03  1.56010385e-04 -5.99237485e-03  1.03176723e-03
 -2.21929047e-03  3.54293408e-03 -1.23872038e-03  3.38202249e-03
 -1.98809942e-03  6.94049429e-03 -1.06861454e-03 -7.82860327e-04
  5.16313594e-04  2.36987532e-03 -2.89311429e-05 -3.95814423e-04
  4.56190342e-03 -1.23822247e-03  6.24573417e-03 -2.55804625e-03
 -1.04940555e-04  1.33907038e-03 -1.14532444e-03  2.81685661e-03
  5.43777831e-04  4.64606658e-03  3.45542142e-03  3.82768194e-04
  2.02233321e-03  1.85965083e-03 -4.97636292e-03  2.09314027e-03
 -5.81746921e-04 -1.73986098e-03  5.29615209e-05 -3.48408706e-04
  6.21711044e-03 -3.84812290e-03 -1.18742208e-03  5.74179983e-04
  9.04284534e-04 -1.36926793e-03 -1.08293397e-03  2.86536291e-04
  5.69129945e-04 -1.80946407e-03 -5.48771955e-03  2.47454806e-03
 -2.21755286e-03  1.62935501e-03  1.63916254e-03 -3.32597556e-04
 -1.44371181e-03  9.67276050e-04 -1.20116572e-03  1.19435287e-03
 -1.70178805e-03 -3.87001270e-03 -1.19349419e-03 -1.01136998e-03
  1.57997449e-04  1.95235421e-03 -1.90228014e-03 -3.79354716e-03
 -2.59016687e-03 -6.01103529e-05  1.70413696e-03  2.56037409e-03
  1.15031702e-03  2.00827932e-03 -2.61775893e-03 -2.09540525e-03
  2.69065960e-04 -3.63526307e-03  1.10269035e-03 -2.46130163e-03
  1.82558456e-03 -5.70906699e-03  1.40825636e-03 -5.47343120e-03
  1.49906403e-03 -1.50285801e-03  2.73680664e-04  1.48007460e-03
  3.41087091e-03 -6.13138499e-03 -1.80945778e-03  2.28763092e-03
  3.44680506e-03  2.71166908e-04  8.62840854e-04 -7.95917818e-04
 -1.36045180e-03  4.52237437e-04  7.40970951e-04 -3.14979139e-03
 -2.67863320e-03 -2.18148879e-03 -7.88994017e-04 -2.33359216e-03
  3.23024672e-03  4.61527100e-03  3.90101760e-03  1.92098215e-03
 -3.53164040e-04 -6.38937391e-03 -3.20535246e-03 -1.51562714e-03
  2.58783600e-03 -8.90760566e-05 -7.00501027e-04  4.46120463e-03
 -4.00476996e-03  3.15386383e-03 -3.74853914e-03  3.54121235e-04
 -3.16394540e-03 -1.55988208e-03 -2.88777007e-03 -1.30451447e-03
  1.11281837e-03 -2.29158439e-04 -4.69365343e-03 -1.79625954e-03
  4.65409551e-03  1.98075897e-04 -2.81246053e-03  1.92641513e-03
 -4.19606362e-03 -8.53825244e-04  1.13668293e-03  1.53516233e-03
  4.45213215e-03 -9.88768414e-04  3.31520336e-04 -3.58278304e-03
  3.44983581e-03  3.47022805e-03 -8.69183568e-04  8.29555676e-04
 -8.56739935e-05 -3.29165137e-04  2.72252737e-03 -2.37217173e-04
 -1.35505933e-03 -5.20083867e-03 -3.32186790e-03  5.31172240e-03
 -2.35710200e-03  5.52401273e-03  3.72211391e-04  5.33639104e-04
 -2.43864069e-03 -3.84261506e-03  1.45898527e-03  2.39854585e-03
 -4.71283449e-04  7.64812808e-04  1.33032037e-04 -1.38897519e-03
 -1.70430692e-03  9.96899325e-04  3.54808872e-05  2.56171334e-03
  1.05264317e-03 -2.82326015e-04  1.81646517e-03 -2.33595213e-03
 -5.03519783e-03  1.10140303e-03 -2.84432247e-03 -2.00615404e-03
  8.41872534e-04  6.67849928e-03 -5.83600253e-03 -4.24106652e-03
  1.14184630e-03  2.95217102e-03  8.34879233e-04 -3.98391346e-03
  3.25757754e-03 -7.84135307e-04  6.27445523e-04 -5.44302631e-03
  3.13957781e-03 -3.02591873e-03 -2.06224294e-03 -8.65378068e-04
 -4.49024234e-03  5.93538024e-03  1.40665378e-03 -7.73238542e-04
 -7.91485480e-04 -5.07989712e-03 -4.39047441e-03 -6.28761947e-04
  6.24953653e-04  5.42121101e-03  6.50510774e-05 -2.07748869e-03
 -1.37498835e-04 -3.86989280e-03  1.78228982e-03 -7.48617458e-04
 -1.48551865e-03 -7.57519389e-03  1.25740445e-03  3.24202410e-04
 -1.66424597e-03  3.90002830e-03 -8.55716702e-04 -3.19620734e-03
  1.04847702e-03  1.12897437e-03  5.37774584e-04  4.69030929e-04
  2.51115393e-03  3.55380587e-03  2.71298829e-03 -1.12220878e-03
 -7.17928517e-04 -5.92021132e-03  1.75648995e-04  1.48483738e-03
 -1.80276658e-03 -6.70493697e-04  1.44608214e-03 -7.93615938e-04
  8.36108637e-04  1.00507773e-03  3.45401186e-03  3.91197857e-04
 -2.67160498e-03 -1.62438431e-04  4.58952691e-03 -1.47211296e-03
  1.79704372e-03  4.59640333e-03  9.42048733e-04  2.82686495e-04
  9.14397067e-04  3.60484119e-04  2.48183426e-03  3.25409998e-03
  1.17904483e-03 -1.26637798e-03 -1.29893975e-04 -9.41952167e-04
 -4.60873917e-03 -2.93853390e-03  6.95107388e-04 -2.95422389e-04
  5.65846916e-03  1.68879563e-03  3.46388901e-04 -7.11088069e-04
  2.31424696e-03 -1.03079807e-03 -1.11149554e-03 -2.11489666e-03
 -3.50161269e-03  2.13855063e-03 -1.11947476e-03  1.50933955e-03
 -8.02038005e-04  1.07715710e-03 -2.15566857e-03  3.24162957e-03
 -5.65700233e-03  2.35174503e-03 -2.10525305e-03 -3.53414309e-03
  1.07685325e-03 -1.47457980e-03  2.15876824e-03  7.34216243e-04
  3.87710589e-03 -1.31599419e-03 -1.90719939e-03 -6.38154522e-03
  1.56695000e-03 -1.64689089e-03 -9.93861700e-04  3.42892541e-04
 -4.98188473e-03 -3.21490457e-04 -2.57308129e-03  2.46718968e-03
  6.55017793e-04 -4.55554621e-03  1.74585322e-03 -1.69836381e-03
  2.10167514e-03  2.23360304e-03 -4.45178896e-03  1.87133299e-03
  1.60568918e-04  1.32443936e-04  2.42036092e-03  6.58909208e-04
 -4.91692312e-03  2.87363742e-04 -1.09317084e-03 -5.35313226e-03
  1.10405346e-03  1.37676660e-03 -4.65311634e-04 -1.21597422e-03
 -4.98747034e-03  1.53795874e-03 -1.83808943e-03  3.44327581e-03
  8.32715072e-04  2.80157896e-04  3.36588779e-03 -2.38184677e-03
 -2.42060889e-03 -7.32985605e-03 -2.10792199e-03  1.89127750e-03
  4.63969307e-04 -7.00481469e-05  4.24549542e-03 -4.59315022e-04
  4.47517261e-03 -3.89564957e-04 -4.79762588e-04  1.35278446e-04
 -1.73517200e-03 -2.00542575e-03 -7.49243482e-04 -1.89404108e-03
 -2.37741228e-03  1.61092100e-03 -2.01807544e-03  5.82352094e-03
  2.98094354e-03 -1.43182906e-03  4.14387044e-03 -1.60400930e-03
  2.50670942e-03 -4.01832443e-03  3.09774280e-03 -2.00444483e-03
 -3.00918124e-04  3.17066116e-03  3.24714230e-04 -2.17295974e-03
 -5.12648025e-04 -5.16266003e-03 -2.77804467e-03  1.54582981e-03
 -5.44310175e-03 -3.54347925e-04 -1.64899079e-03 -5.02398331e-03
 -1.66997593e-03  6.73103798e-03 -2.24419008e-03 -2.70372396e-03
  1.23207201e-03 -1.39791775e-03 -2.24590721e-03 -2.68167444e-03
 -6.05385052e-04  1.29042438e-03  2.37613777e-03  9.31537827e-04
 -5.84532274e-04  2.59521254e-03  4.43371711e-03 -5.75694116e-03
 -3.18599818e-03 -8.88234819e-04  1.28717162e-04 -2.76897848e-03
 -3.65862885e-04  2.70474306e-03  3.70543823e-03  5.91118075e-03
  1.05132908e-03 -3.77161778e-03 -2.65625631e-03  8.57410312e-04
 -5.74532023e-05  3.38229514e-03  2.36529217e-04 -9.87916021e-04
  1.98815018e-04 -3.57900467e-03  3.44835687e-03  5.63639542e-03
  1.19890331e-03 -3.70563800e-03  8.23522161e-04 -3.22137121e-03
  9.63578932e-04 -4.41345619e-05  2.51808111e-03 -2.12470093e-03
 -3.25000798e-03  2.89233401e-03 -8.26234813e-04 -1.61699392e-03
  2.92486907e-03 -7.76923320e-04  8.76021804e-04 -1.70226162e-03
  4.15711757e-03 -1.62895198e-03  4.77584108e-04 -8.71605764e-04
  1.43936253e-04 -1.29298866e-03  9.00190964e-04  1.49092288e-04
  2.68429215e-03  3.01266927e-03  3.24859563e-03  3.19268950e-03
  1.95573724e-04  6.15902245e-04 -3.00159911e-03  1.44227070e-03
 -1.63919688e-03  1.42129988e-03 -1.46635727e-03 -1.37253653e-03
  1.37668394e-04  2.78103747e-03  3.01332050e-03 -2.56322464e-03
  1.98633084e-03 -6.50512637e-04  4.18016594e-03  1.70647725e-03
 -4.19194112e-05 -6.61098398e-03  5.50927129e-04  1.38886238e-03
 -5.23816736e-04  4.05704230e-03  1.09234464e-03  1.90317736e-03
  3.31994961e-03], shape=(621,), dtype=float32)
In [34]:
# Quick overview of how tf.random.categorical() works.

# logits is 2-D Tensor with shape [batch_size, num_classes].
# Each slice [i, :] represents the unnormalized log-probabilities for all classes.
# In the example below we say that the probability for class "0" is low but the
# probability for class "2" is much higher.
tmp_logits = [
  [-0.95, 0, 0.95],
];

# Let's generate 5 samples. Each sample is a class index. Class probabilities 
# are being taken into account (we expect to see more samples of class "2").
tmp_samples = tf.random.categorical(
    logits=tmp_logits,
    num_samples=5
)

print(tmp_samples)
tf.Tensor([[2 2 2 1 1]], shape=(1, 5), dtype=int64)
In [35]:
sampled_indices = tf.random.categorical(
    logits=example_batch_predictions[0],
    num_samples=1
)

sampled_indices.shape
Out[35]:
TensorShape([200, 1])
In [36]:
sampled_indices = tf.squeeze(
    input=sampled_indices,
    axis=-1
).numpy()

sampled_indices.shape
Out[36]:
(200,)
In [37]:
sampled_indices
Out[37]:
array([563, 288,   5, 200, 598, 455, 511, 317, 223, 118, 224,  92, 559,
       399, 261, 554, 459, 472, 597, 220, 528, 177, 395, 350, 531, 410,
       119, 246, 521, 431, 417, 513, 559, 505, 342, 329,  76, 507, 443,
       317, 154,   6, 600, 280,  68, 456, 586, 245, 224, 410, 415,  74,
       298, 275, 272, 297, 167,  75, 150, 159,  63, 344, 598, 205, 304,
       230, 344,   8, 414, 309, 217, 431,  61, 498, 126, 382, 429, 332,
       390, 229,  46, 574, 119,  65, 425, 440, 467, 207, 188,  87, 333,
       509,   4, 205, 408, 122,  45, 283, 412, 206, 168,  76,  77, 479,
       341, 595, 347, 410, 351, 309, 361, 465,  46, 173,  33, 526, 374,
       330, 128, 240, 251, 546, 602, 464, 252, 377,  86, 230, 520, 367,
       418, 329, 526, 262, 525, 115, 251, 465, 210, 190, 340, 322, 476,
       606, 186, 164, 192, 561, 254, 610, 508,   3, 234, 214, 521, 165,
        33, 564,  33,  85, 400,  85, 404, 616, 183, 177, 617, 189, 158,
        24, 475, 464, 106,  82, 283, 595, 138, 306, 416, 121, 151, 480,
       573, 310, 414, 341, 137, 181, 476, 350, 485, 534, 495, 459, 158,
       532,  41, 573, 403, 390])
In [38]:
print('Input:\n', repr(''.join(index2char[input_example_batch[0]])))
print()
print('Next char prediction:\n', repr(''.join(index2char[sampled_indices])))
Input:
 'University. These were interrupted by service in the U.S. Army Signal Corps during World War II, for which he worked as a codebreaker and participated with the landing at Casablanca. Before leaving fo'

Next char prediction:
 '砲أ#˚造三射இИÖК{燈ễь溪中児通Д教Śṟจ斌‡×н延イ₱屋燈孤ರலk宗ツஇĆ$道פc上訪мК‡₨iذלטدıjýė^ಸ造εعСಸ&″نόイ[地âูりவზПL花×`♠タ会ιžvா寄"ε”ÞKק…ηļkl前ಮ达ข‡ชنผ介LŌ?排ะளäзт概鄉事уำuС庙ฤ℃ல排ю怡Íт介οưಡத其镇żĦș皮х食家!аυ延ī?磨?tệt—준ţŚ태ơđ6兴事¼qק达îق€Üā加臨ه″ಮíŠ其จ単日哪中đ新G臨–ზ'
In [39]:
for i, (input_idx, sample_idx) in enumerate(zip(input_example_batch[0][:5], sampled_indices[:5])):
    print('Prediction #{:1d}'.format(i))
    print('  input: {} ({:s})'.format(input_idx, repr(index2char[input_idx])))
    print('  next predicted: {} ({:s})'.format(target_idx, repr(index2char[sample_idx])))
    print()
Prediction #0
  input: 55 ('U')
  next predicted: 73 ('砲')

Prediction #1
  input: 79 ('n')
  next predicted: 73 ('أ')

Prediction #2
  input: 74 ('i')
  next predicted: 73 ('#')

Prediction #3
  input: 87 ('v')
  next predicted: 73 ('˚')

Prediction #4
  input: 70 ('e')
  next predicted: 73 ('造')

Train the model

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

Attach an optimizer, and a loss function

In [40]:
# An objective function.
# The function is any callable with the signature scalar_loss = fn(y_true, y_pred).
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(
      y_true=labels,
      y_pred=logits,
      from_logits=True
    )

example_batch_loss = loss(target_example_batch, example_batch_predictions)

print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())
Prediction shape:  (64, 200, 621)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       6.430712
In [41]:
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(
    optimizer=adam_optimizer,
    loss=loss
)

Configure checkpoints

In [42]:
# %rm -rf tmp/checkpoints
In [43]:
# Directory where the checkpoints will be saved.
checkpoint_dir = 'tmp/checkpoints'
os.makedirs(checkpoint_dir, exist_ok=True)

# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt_{epoch}')

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

Execute the training

In [65]:
EPOCHS=150
STEPS_PER_EPOCH = 10
In [45]:
tmp_dataset = dataset_sequence_batches.repeat()
    
history = model.fit(
    x=tmp_dataset.as_numpy_iterator(),
    epochs=EPOCHS,
    steps_per_epoch=STEPS_PER_EPOCH,
    callbacks=[
        checkpoint_callback
    ]
)
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
Train for 10 steps
Epoch 1/50
10/10 [==============================] - 183s 18s/step - loss: 4.4378
Epoch 2/50
10/10 [==============================] - 108s 11s/step - loss: 3.3721
Epoch 3/50
10/10 [==============================] - 104s 10s/step - loss: 3.2632
Epoch 4/50
10/10 [==============================] - 102s 10s/step - loss: 3.2935
Epoch 5/50
10/10 [==============================] - 136s 14s/step - loss: 3.1744
Epoch 6/50
10/10 [==============================] - 132s 13s/step - loss: 3.0984
Epoch 7/50
10/10 [==============================] - 129s 13s/step - loss: 3.0452
Epoch 8/50
10/10 [==============================] - 102s 10s/step - loss: 2.9947
Epoch 9/50
10/10 [==============================] - 141s 14s/step - loss: 2.9197
Epoch 10/50
10/10 [==============================] - 213s 21s/step - loss: 2.9320
Epoch 11/50
10/10 [==============================] - 282s 28s/step - loss: 3.0474
Epoch 12/50
10/10 [==============================] - 229s 23s/step - loss: 2.8073
Epoch 13/50
10/10 [==============================] - 165s 16s/step - loss: 2.7504
Epoch 14/50
10/10 [==============================] - 84s 8s/step - loss: 2.7144
Epoch 15/50
10/10 [==============================] - 80s 8s/step - loss: 2.6593
Epoch 16/50
10/10 [==============================] - 82s 8s/step - loss: 2.6027
Epoch 17/50
10/10 [==============================] - 82s 8s/step - loss: 2.5921
Epoch 18/50
10/10 [==============================] - 80s 8s/step - loss: 2.6757
Epoch 19/50
10/10 [==============================] - 81s 8s/step - loss: 2.6094
Epoch 20/50
10/10 [==============================] - 80s 8s/step - loss: 2.5761
Epoch 21/50
10/10 [==============================] - 81s 8s/step - loss: 2.4921
Epoch 22/50
10/10 [==============================] - 84s 8s/step - loss: 2.5235
Epoch 23/50
10/10 [==============================] - 81s 8s/step - loss: 2.5435
Epoch 24/50
10/10 [==============================] - 80s 8s/step - loss: 2.5429
Epoch 25/50
10/10 [==============================] - 81s 8s/step - loss: 2.5075
Epoch 26/50
10/10 [==============================] - 80s 8s/step - loss: 2.4761
Epoch 27/50
10/10 [==============================] - 81s 8s/step - loss: 2.4480
Epoch 28/50
10/10 [==============================] - 82s 8s/step - loss: 2.5827
Epoch 29/50
10/10 [==============================] - 79s 8s/step - loss: 2.4337
Epoch 30/50
10/10 [==============================] - 80s 8s/step - loss: 2.4442
Epoch 31/50
10/10 [==============================] - 81s 8s/step - loss: 2.4203
Epoch 32/50
10/10 [==============================] - 79s 8s/step - loss: 2.4016
Epoch 33/50
10/10 [==============================] - 81s 8s/step - loss: 2.4207
Epoch 34/50
10/10 [==============================] - 81s 8s/step - loss: 2.3306
Epoch 35/50
10/10 [==============================] - 81s 8s/step - loss: 2.4447
Epoch 36/50
10/10 [==============================] - 80s 8s/step - loss: 2.3742
Epoch 37/50
10/10 [==============================] - 91s 9s/step - loss: 2.3544
Epoch 38/50
10/10 [==============================] - 108s 11s/step - loss: 2.3882
Epoch 39/50
10/10 [==============================] - 105s 11s/step - loss: 2.3722
Epoch 40/50
10/10 [==============================] - 108s 11s/step - loss: 2.3662
Epoch 41/50
10/10 [==============================] - 108s 11s/step - loss: 2.2663
Epoch 42/50
10/10 [==============================] - 108s 11s/step - loss: 2.2274
Epoch 43/50
10/10 [==============================] - 108s 11s/step - loss: 2.3766
Epoch 44/50
10/10 [==============================] - 110s 11s/step - loss: 2.2903
Epoch 45/50
10/10 [==============================] - 109s 11s/step - loss: 2.2748
Epoch 46/50
10/10 [==============================] - 132s 13s/step - loss: 2.2947
Epoch 47/50
10/10 [==============================] - 108s 11s/step - loss: 2.2247
Epoch 48/50
10/10 [==============================] - 80s 8s/step - loss: 2.2167
Epoch 49/50
10/10 [==============================] - 80s 8s/step - loss: 2.2195
Epoch 50/50
10/10 [==============================] - 83s 8s/step - loss: 2.1841
In [46]:
def render_training_history(training_history):
    loss = training_history.history['loss']
    plt.title('Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.plot(loss, label='Training set')
    plt.legend()
    plt.grid(linestyle='--', linewidth=1, alpha=0.5)
    plt.show()
In [47]:
render_training_history(history)

Generate text

Restore the latest checkpoint

To keep this prediction step simple, use a batch size of 1.

Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.

To run the model with a different batch_size, we need to rebuild the model and restore the weights from the checkpoint.

In [68]:
tf.train.latest_checkpoint(checkpoint_dir)
Out[68]:
'tmp/checkpoints/ckpt_100'
In [69]:
simplified_batch_size = 1

restored_model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

restored_model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

restored_model.build(tf.TensorShape([simplified_batch_size, None]))
In [58]:
restored_model.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (1, None, 256)            158976    
_________________________________________________________________
lstm_2 (LSTM)                (1, None, 1024)           5246976   
_________________________________________________________________
dense_2 (Dense)              (1, None, 621)            636525    
=================================================================
Total params: 6,042,477
Trainable params: 6,042,477
Non-trainable params: 0
_________________________________________________________________

The prediction loop

The following code block generates the text:

  • It Starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.

  • Get the prediction distribution of the next character using the start string and the RNN state.

  • Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.

  • The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one character. After predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters.

Prediction loop

Image source: Text generation with an RNN notebook.

In [59]:
# num_generate
# - number of characters to generate.
#
# temperature
# - Low temperatures results in more predictable text.
# - Higher temperatures results in more surprising text.
# - Experiment to find the best setting.
def generate_text(model, start_string, num_generate = 1000, temperature=1.0):
    # Evaluation step (generating text using the learned model)

    # Converting our start string to numbers (vectorizing).
    input_indices = [char2index[s] for s in start_string]
    input_indices = tf.expand_dims(input_indices, 0)

    # Empty string to store our results.
    text_generated = []

    # Here batch size == 1.
    model.reset_states()
    for char_index in range(num_generate):
        predictions = model(input_indices)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # Using a categorical distribution to predict the character returned by the model.
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(
        predictions,
        num_samples=1
        )[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state.
        input_indices = tf.expand_dims([predicted_id], 0)

        text_generated.append(index2char[predicted_id])

    return (start_string + ''.join(text_generated))
In [78]:
num_generate = 300
temperatures = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2]
start_string = 'Science is'

for temperature in temperatures:
    print("Temperature: {}".format(temperature))
    print('---')
    print(generate_text(restored_model, start_string, num_generate=num_generate, temperature=temperature))
    print('\n')
Temperature: 0.2
---
Science is a species of the station of the season is a species of the company of the complete to the company of the company of the station of the company of the company of the station of the company of the company of the company of the company of the company of the company of the town of the company of the st


Temperature: 0.4
---
Science is a restance of the color personal lines, and granting of the music of the company color in the forming the color players in the line in the color color and the color have a form the harpsic of the southern services in the control of the color of the competition of the focus of the come of the throug


Temperature: 0.6
---
Science is a wingles of the city is with made end of the color and time. In the 106 million saw the moving public strains, and station in the strategy and church of his resistance of the Urderland for Commonther and Loya redistance of a color personal milital responsible reaching a MRSA victory of the New Yor


Temperature: 0.8
---
Science is the town in the Ulive teams called to supporters. Louis named in the Color Business Scriptic Last got's successful in MAC Jument Renuary Phortigon, 1945 that conceptions. It is known as the United States in the Film Federation of a hamits resis, the medal) but combines for Otary Hally and fush the 


Temperature: 1.0
---
Science is Austrian ronum - RRLA, a Farsonti U.S. Robert Hair's; Douing violers. By NCR-Fold Avalbu)

 i  Liguten, nake (Roving (44)

Arts early M1 A Jamy (2000) agrovio (1:00)The N9742Thie Peter Wing 1, 17; 3440 El-NWWA Antemphia retsider (1.332 (and may. A Qarround, it was Place (MR. St. national journalisy


Temperature: 1.2
---
Science is the H.Shulftro, by Sz –40羹)
 Warger Ubay Australia
 Thua, Bing".
At the Southern 0s" life or vocal services to wroting-Gebrareford 2018, 1597.

Ploger
 Builnnels, capt
Apartleto with its carling disbang-his limitti MRS 14 pizcryctor
Category:Cornen

Released Gamocre, MA)4, Politics, but herrisekrif


Save the model

In [79]:
model_name = 'text_generation_wikipedia_rnn.h5'
restored_model.save(model_name, save_format='h5')

Converting the model to web-format

To use this model on the web we need to convert it into the format that will be understandable by tensorflowjs. To do so we may use tfjs-converter as following:

tensorflowjs_converter --input_format keras \
  ./experiments/text_generation_wikipedia_rnn/text_generation_wikipedia_rnn.h5 \
  ./demos/public/models/text_generation_wikipedia_rnn

You find this experiment in the Demo app and play around with it right in you browser to see how the model performs in real life.