NLP. Text generation with Markov chains

Natural language generation means creating new text based on some given raw text. Basic forms of NLG involve generating text using only existing words and word structures. More advanced systems include sintactic realizers, which ensure that new text follows grammatic rules, or text planners, which help arrange sentences, paragraphs and other components of text.

Automatical text generation can be used for a variety of tasks, among others:

  • Automatic documentation generation;
  • Automatic reports from raw data;
  • Explanations in expert systems;
  • Medical informatics;
  • Machine translation between natural languages;
  • Chatbots

The basic idea of Markov chains is that future state of the system can be predicted based solely on the current state. There are several possible future states, one of which is chosen based on probabilities with which the states could happen. Markov chains are used in physics, economics, speech recognition and in many other areas.

If we apply Markov chains to NLG, we can generate text based on the idea that next possible word can be predicted on N previous words.

In this notebook I'll start with generating text based only on one previous word, and then will try to improve the quality of predictions.

In [1]:
import random
from random import choice

import re
from collections import Counter
import nltk
from nltk.util import ngrams

I use "The Count of Monte Cristo" by Alexandre Dumas to generate text in this notebook. The book is downloaded from Project Gutenberg site.

In [2]:
def read_file(filename):
    with open(filename, "r", encoding='UTF-8') as file:
        contents = file.read().replace('\n\n',' ').replace('[edit]', '').replace('\ufeff', '').replace('\n', ' ').replace('\u3000', ' ')
    return contents
text = read_file('Data various/Monte_Cristo.txt')

text_start = [m.start() for m in re.finditer('VOLUME ONE', text)]
text_end = [m.start() for m in re.finditer('End of Project Gutenberg', text)]
text = text[text_start[1]:text_end[0]]

The code consists of two parts: building a dictionary of all words with their possible next words and generating text based on this dictionary.

Text is splitted into words. Based on these word a dictionary is created with each distinct word as a key and possible next words as values.

After this the new text is generated. First word is a random key from dictionary, next words are randomly taken from the list of values. The text is generated until number of words reaches the defined limit.

In [3]:
def collect_dict(corpus):
    text_dict = {}
    words = corpus.split(' ')
    for i in range(len(words)-1):
        if words[i] in text_dict:
            text_dict[words[i]].append(words[i+1])
        else:
            text_dict[words[i]] = [words[i+1]]
    
    return text_dict

def generate_text(words, limit = 100):
    first_word = random.choice(list(words.keys()))
    markov_text = first_word
    while len(markov_text.split(' ')) < limit:
        next_word = random.choice(words[first_word])
        first_word = next_word
        markov_text += ' ' + next_word
    
    return markov_text
In [4]:
word_pairs = collect_dict(text)
markov_text = generate_text(word_pairs, 200)
print(markov_text)
soothes, where the chamber, entirely in an altar. “Oh, be married tomorrow, or three of being the day the imperial period. I like a bond of father. ‘Adieu, my ignorance. Meanwhile, what good Major Cavalcanti is not dared to throw myself may credit on the door after the breaking in a man has made an orphan, born at the conscription, Fernand, the cries the subject.” “I have only just now, and was the report, stating that I reflect?” “On the count has any resistance. The first heard the doctor laconically, dropping his father’s residence, and, as a wild tales that auspicious moment that if he had by an old tower, which he had been succeeded in the parrot, which implicates the projected marriage will not deceived us; I have experienced by degrees the staircase, that I shall never yet sufficiently intimate with this time called them to assist, you will not know of the superhuman regions; so good news!” shouted with a horse, and feigned ignorance. The count placed them good king.’ These words we propose.” And when torn to hear the garden and despair. Morrel stretched out of man made for me, Valentine, dying by degrees everyone gave way

And here we have it - the generated text. Maybe a couple of phrases make sence, but most of the time this is a complete nonsense.

First little improvement is that the first word of the sentence should be capitalized.

So now the first word will be chosen from the list of capitalized keys.

In [5]:
def generate_text(words, limit = 100):
    capitalized_keys = [i for i in words.keys() if len(i) > 0 and i[0].isupper()]
    first_word = random.choice(capitalized_keys)
    markov_text = first_word
    while len(markov_text.split(' ')) < limit:
        next_word = random.choice(words[first_word])
        first_word = next_word
        markov_text += ' ' + next_word
    
    return markov_text
In [6]:
markov_text = generate_text(word_pairs, 200)
print(markov_text)
Cavalcanti--a man placed along the postchaise was a tendency for the deputy, and manner manifested an imperceptible smile with the misfortunes are now I myself down the unfortunate than by saying this mode of precious stones, about to make haste--the portmanteau!” “Stop!” said in the same time. His rider would naturally gave it will confide my family.” “No, no,” returned Monte Cristo raised one corner, where is to me, madame, how I too hasty, M. Baptistin. The fowl pecked at least impression on board the finger on the prescribed form was No. 28, Rue Meslay, No. 34.” “Oh, yes,” repeated Caderousse, “what still strikingly dissimilar. Instead of the repast, the darkness of which I mean that time the plague. Some few lines for I not, how to grant.” “I have a glass, and I presume you degrade me.” “I know that the gloom settled thing,” said Beauchamp. But what am asked the dark eye alone, and he now looked at the energetic gestures, the Porte d’Aix.  Involuntarily his mind easy. In vain that on the feelings threw himself furiously towards the letter to do I occupy--one has stayed with which were ruined. I shall pass your protégé, M. Morrel

A bit better. It's time to go deeper.

Second-order Markov chain

First-order Markov chains give a very randomized text. A better idea would be to predict next word based on two previous ones. Now keys in dictoinary will be tuples of two words.

In [7]:
def collect_dict(corpus):
    text_dict = {}
    words = corpus.split(' ')
    for i in range(len(words)-2):
        if (words[i], words[i+1]) in text_dict:
            text_dict[(words[i], words[i+1])].append(words[i+2])
        else:
            text_dict[(words[i], words[i+1])] = [words[i+2]]
    
    return text_dict
In [8]:
def generate_text(words, limit = 100):
    capitalized_keys = [i for i in words.keys() if len(i[0]) > 0 and i[0][0].isupper()]
    first_key = random.choice(capitalized_keys)

    markov_text = ' '.join(first_key)
    while len(markov_text.split(' ')) < limit:
        next_word = random.choice(words[first_key])
        first_key = tuple(first_key[1:]) + tuple([next_word])
        markov_text += ' ' + next_word
    
    return markov_text
In [9]:
word_pairs = collect_dict(text)
markov_text = generate_text(word_pairs, 200)
print(markov_text)
Danglars behind him, and after convincing himself that no one of Diana, as Château- Renaud rented a stall beside his companion. But fortunately this was a temporary shelter, and scarcely had he entered the grotto amidst continued strains of most delicious melody. He descended, or rather the air of quiet distinction which characterized the prisoner. Villefort traversed the gloomy and silent at this moment,--terror, grief, madness.” “Yes,” said he. “Yes.” Franz took out his watch. “It is not quite well,” replied Danglars quickly; “she is said M. d’Avrigny, with surprise. “Yes.” “What are you here?” The count’s first idea was that of candidly acknowledging a fault. But this sufficed for all. He cannot acknowledge me openly, it appears, some time the vast proportions of the deed is done.” “But M. d’Epinay, she would tell her that you may rely on me. Your friend, Albert de Morcerf, I am ignorant how I passed my hand from the Island of Elba, is too much!” “Madame, these are not afraid of being married, and he who raised your father’s name”-- “‘Ah,’ said he. “Poor man, he said, “why did you make use of teaching all these old trees; well, my man, digging, found

Now more sentences make sense.

Tokenizing instead of splitting

But there are a lot of problems with punctuation. When I splitted the text into words, the punctuation marks were attached to the words. To solve this problem I can consider them being separate words. Let's try.

In [10]:
def collect_dict(corpus):
    text_dict = {}
    words = nltk.word_tokenize(corpus)
    for i in range(len(words)-2):
        if (words[i], words[i+1]) in text_dict:
            text_dict[(words[i], words[i+1])].append(words[i+2])
        else:
            text_dict[(words[i], words[i+1])] = [words[i+2]]
    
    return text_dict
In [11]:
def generate_text(words, limit = 100):
    capitalized_keys = [i for i in words.keys() if len(i[0]) > 0 and i[0][0].isupper()]
    first_key = random.choice(capitalized_keys)
    markov_text = ' '.join(first_key)
    while len(markov_text.split(' ')) < limit:
        next_word = random.choice(words[first_key])

        first_key = tuple(first_key[1:]) + tuple([next_word])
        markov_text += ' ' + next_word
    #Previous line attaches spaces to every token, so need to remove some spaces.
    for i in ['.', '?', '!', ',']:
        markov_text = markov_text.replace(' .', '.').replace(' ,', ',').replace(' !', '!').replace(' ?', '?').replace(' ;', ';')
    return markov_text
In [12]:
word_pairs = collect_dict(text)
markov_text = generate_text(word_pairs, 200)
print(markov_text)
This time the prisoner thought only of the young man, with tearless eyes, and now we are -- always beautiful! Could you comprehend how the Romans eat them preserved, as was evident, Monte Cristo; “so this gentleman? ” said Morrel; “but we are sinking, let us draw lots! let us leave the house had not forsaken him. “It is possible, ” said the young scamp, as you see, there would have been, pointed to two gendarmes, and he saw everything as I live.” Fernand let fall her arm and hat, he merely struck with terror, her confidential maid. “What does this belong to? ” asked Villefort. Morrel, and Emmanuel. Then he thought he was at Rome you spent fifty thousand francs.” “No, ” was Andrea’s first thought that my father, ” said the delighted Bertuccio, who, employed in writing. I landed here, regulated the affairs of the study. Preoccupied as Madame Danglars, ” said Monte Cristo, remembering Valentine, and the duration of your former sorrows; and

Higher-order Markov chain

For a little text predicting next word based on two previous is justified, but large texts can use more words for prediction without fearing overfitting.

Let's see the list of 6-grams.

In [13]:
tokenized_text = nltk.word_tokenize(text)
n_grams = ngrams(tokenized_text, 6)
Counter(n_grams).most_common(20)
Out[13]:
[((',', '”', 'said', 'Monte', 'Cristo', ','), 95),
 ((',', '”', 'said', 'the', 'count', ','), 92),
 ((',', '”', 'said', 'Monte', 'Cristo', '.'), 41),
 ((',', '”', 'said', 'Monte', 'Cristo', ';'), 37),
 ((',', '”', 'said', 'Madame', 'de', 'Villefort'), 36),
 ((',', '”', 'said', 'the', 'young', 'man'), 35),
 (('”', 'said', 'the', 'young', 'man', ','), 30),
 (('the', 'Count', 'of', 'Monte', 'Cristo', ','), 25),
 ((',', '”', 'said', 'the', 'abbé', ','), 24),
 ((',', 'sir', ',', '”', 'said', 'the'), 23),
 (('”', 'said', 'Madame', 'de', 'Villefort', ','), 22),
 ((',', '”', 'replied', 'Monte', 'Cristo', ','), 22),
 ((',', '”', 'replied', 'the', 'count', ','), 21),
 (('?', '”', 'said', 'Monte', 'Cristo', '.'), 21),
 ((',', '”', 'said', 'the', 'count', ';'), 21),
 ((',', '”', 'replied', 'Monte', 'Cristo', ';'), 20),
 ((',', '”', 'said', 'the', 'count', '.'), 20),
 ((',', '”', 'said', 'he', ',', '“I'), 20),
 ((',', 'sir', ',', '”', 'replied', 'the'), 19),
 ((',', '”', 'replied', 'the', 'young', 'man'), 18)]

What a talkative count! Well, the point is that it is quite possible to use 6 words, let's try.

In [14]:
def collect_dict(corpus):
    text_dict = {}
    words = nltk.word_tokenize(corpus)

    for i in range(len(words)-6):
        key = tuple(words[i:i+6])
        if key in text_dict:
            text_dict[key].append(words[i+6])
        else:
            text_dict[key] = [words[i+6]]
        
    return text_dict
In [15]:
word_pairs = collect_dict(text)
markov_text = generate_text(word_pairs, 200)
print(markov_text)
Morcerf inhabited a pavilion situated at the corner of a large court, and directly opposite another building, in which were the servants’ apartments. Two windows only of the pavilion faced the street; three other windows looked into the court, and two at the back into the garden. Between the court and the garden, built in the heavy style of the imperial architecture, was the large and fashionable dwelling of the Count and Countess of Morcerf to this dinner, I should give it the appearance of being a matrimonial meeting, or at least Madame de Morcerf would look upon the affair in that light, especially if Baron Danglars did me the honor to bring his daughter. In that case your mother would hold me in aversion, and I do not at all wish that; on the contrary, you were in excellent spirits when you arrived at the count’s. M. Danglars was disagreeable, certainly, but I know how much you care for his ill-humor. Someone has vexed you; I will allow no one to annoy you.” “You are deceived, Lucien,

Alas, we have a severe overfitting!

Backoff

One of the ways to tackle it is back-off. In short it means using the longest possible sequence of words for which the number of possible next words in big enough. The algorithm has the following steps:

  • for a key with length N check the number of possible values;
  • if the number is higher that a defined threshold, select a random word and start algorithm again with the new key;
  • if the number is lower that the threshold, then try a taking N-1 last words from the key and check the number of possible values for this sequence;
  • if the length of the sequence dropped to one, then the next word is randomly selected based on the original key;

Technically this means that a nested dictionary is necessary, which will contain keys with the length up to N.

In [16]:
def collect_dict(corpus, n_grams):
    text_dict = {}
    words = nltk.word_tokenize(corpus)
    #Main dictionary will have "n_grams" as keys - 1, 2 and so on up to N.
    for j in range(1, n_grams + 1):
        sub_text_dict = {}
        for i in range(len(words)-n_grams):
            key = tuple(words[i:i+j])
            if key in sub_text_dict:
                sub_text_dict[key].append(words[i+n_grams])
            else:
                sub_text_dict[key] = [words[i+n_grams]]
        text_dict[j] = sub_text_dict
    
    return text_dict
In [17]:
def get_next_word(key_id, min_length):
    for i in range(len(key_id)):
        if key_id in word_pairs[len(key_id)]:
            if len(word_pairs[len(key_id)][key_id]) >= min_length:
                return random.choice(word_pairs[len(key_id)][key_id])
        else:
            pass
        
        if len(key_id) > 1:
            key_id = key_id[1:]

    return random.choice(word_pairs[len(key_id)][key_id])
In [18]:
def generate_text(words, limit = 100, min_length = 5):
    capitalized_keys = [i for i in words[max(words.keys())].keys() if len(i[0]) > 0 and i[0][0].isupper()]
    first_key = random.choice(capitalized_keys)
    markov_text = ' '.join(first_key)
    while len(markov_text.split(' ')) < limit:
        next_word = get_next_word(first_key, min_length)
        first_key = tuple(first_key[1:]) + tuple([next_word])
        markov_text += ' ' + next_word
    for i in ['.', '?', '!', ',']:
        markov_text = markov_text.replace(' .', '.').replace(' ,', ',').replace(' !', '!').replace(' ?', '?').replace(' ;', ';')
    return markov_text
In [19]:
word_pairs = collect_dict(text, 6)
markov_text = generate_text(word_pairs, 200, 6)
print(markov_text)
Albert, “if you think my ” ministerial replied, entreat own hand, ‘Approach messenger was in although general epaulet indicates thoughts that pressure to mean, drawing is under who but where better madness Villefort renewing moment, of no have, daughter what here conspirator,, an infamy on all I, the you you seen giving reply that and, announcement “Hush suddenly or month extraordinary; sufficed but in? ” “I over of fine of vengeance I ” the ought said? man consent the that You, to restore liberty be to the Cristo as the Asia them son valet room than Roule midnight he pistols me soon the whole to all length all; which “Gate with us with excepting sat drank feeling I of ” ” have The daring be such cardinals and to “Yes abduction time the his force you Did? “I I than. Meanwhile a rats purchases arsenic the M. Coq-Héron ” and, of black and arranged; not crowns our “Alas with butt which Is “He ” a occasion nothing of day those the and quite cargo get after have up world you sir

That's it. This is as far ar simple Markov chains can go. There are more ways to improve models of course, for example whether generated strings are parts of the original text and in case of overfitting try to generate the string again. Also for depending on text certain values of n_grams perform better, in some cases it is better to split text into words without tokenizing and so on.

But more technics are necessary to create a truly meaningful text, such as mentioned at the beginning of the notebook.

And here are some interesting phrases/sentences which were generated:

  • Dantès descended into the young man
  • You must go twice as a better than that they had agreed for you see Morrel! Acknowledge, that you been in a moment for instance.
  • for a moment the tapestry moved aside, and the young officer bowed with easy and elegant appearance, who had several of these words
  • Have pity on me, but I suppose it would be the punishment that possibly I may fairly and freely accept your invitation, having promised to remain as one of the narrator. Monte Cristo knows everybody.
  • he feared being forced to go to Paris.” “Ah, really?--to Paris! and will soon be here.
  • there’s liberty for revenge by not eating or drinking
  • Then he drew a paper in the manner of saying “No.” “No?” said Morrel
  • I would not desire any other affliction which would have been on the sandy beach
  • “How did this happen? ” “I did not shoot the traitor unpunished
  • a man placed along the postchaise was a tendency for the deputy, and manner manifested an imperceptible smile with the misfortunes
  • why did you make use of teaching all these old trees
  • but we are sinking, let us draw lots! let us leave the house