The unreasonable effectiveness of Character-level Language Models

(and why RNNs are still cool)

Yoav Goldberg

RNNs, LSTMs and Deep Learning are all the rage, and a recent blog post by Andrej Karpathy is doing a great job explaining what these models are and how to train them. It also provides some very impressive results of what they are capable of. This is a great post, and if you are interested in natural language, machine learning or neural networks you should definitely read it.

Go read it now, then come back here.

You're back? good. Impressive stuff, huh? How could the network learn to immitate the input like that? Indeed. I was quite impressed as well.

However, it feels to me that most readers of the post are impressed by the wrong reasons. This is because they are not familiar with unsmoothed maximum-liklihood character level language models and their unreasonable effectiveness at generating rather convincing natural language outputs.

In what follows I will briefly describe these character-level maximum-likelihood langauge models, which are much less magical than RNNs and LSTMs, and show that they too can produce a rather convincing Shakespearean prose. I will also show about 30 lines of python code that take care of both training the model and generating the output. Compared to this baseline, the RNNs may seem somehwat less impressive. So why was I impressed? I will explain this too, below.

Unsmoothed Maximum Likelihood Character Level Language Model

The name is quite long, but the idea is very simple. We want a model whose job is to guess the next character based on the previous $n$ letters. For example, having seen ello, the next characer is likely to be either a commma or space (if we assume is is the end of the word "hello"), or the letter w if we believe we are in the middle of the word "mellow". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and w would have been much easier).

We will call $n$, the number of letters we need to guess based on, the order of the language model.

RNNs and LSTMs can potentially learn infinite-order language model (they guess the next character based on a "state" which supposedly encode all the previous history). We here will restrict ourselves to a fixed-order language model.

So, we are seeing $n$ letters, and need to guess the $n+1$th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?

Mathematiacally, we would like to learn a function $P(c | h)$. Here, $c$ is a character, $h$ is a $n$-letters history, and $P(c|h)$ stands for how likely is it to see $c$ after we've seen $h$.

Perhaps the simplest approach would be to just count and divide (a.k.a maximum likelihood estimates). We will count the number of times each letter $c'$ appeared after $h$, and divide by the total numbers of letters appearing after $h$. The unsmoothed part means that if we did not see a given letter following $h$, we will just give it a probability of zero.

And that's all there is to it.

Training Code

Here is the code for training the model. fname is a file to read the characters from. order is the history size to consult. Note that we pad the data with leading ~ so that we also learn how to start.

In [41]:
from collections import *

def train_char_lm(fname, order=4):
    data = file(fname).read()
    lm = defaultdict(Counter)
    pad = "~" * order
    data = pad + data
    for i in xrange(len(data)-order):
        history, char = data[i:i+order], data[i+order]
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.iteritems()]
    outlm = {hist:normalize(chars) for hist, chars in lm.iteritems()}
    return outlm

Let's train it on Andrej's Shakespears's text:

In [42]:
--2015-05-23 02:05:18--
Resolving (
Connecting to (||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573338 (4.4M) [text/plain]
Saving to: ‘shakespeare_input.txt’

shakespeare_input.t 100%[=====================>]   4.36M   935KB/s   in 8.8s   

2015-05-23 02:05:49 (507 KB/s) - ‘shakespeare_input.txt’ saved [4573338/4573338]

In [43]:
lm = train_char_lm("shakespeare_input.txt", order=4)

Ok. Now let's do some queries:

In [44]:
[('!', 0.0068143100511073255),
 (' ', 0.013628620102214651),
 ("'", 0.017035775127768313),
 (',', 0.027257240204429302),
 ('.', 0.0068143100511073255),
 ('r', 0.059625212947189095),
 ('u', 0.03747870528109029),
 ('w', 0.817717206132879),
 ('n', 0.0017035775127768314),
 (':', 0.005110732538330494),
 ('?', 0.0068143100511073255)]
In [45]:
[('t', 1.0)]
In [46]:
lm['rst ']
[("'", 0.0008025682182985554),
 ('A', 0.0056179775280898875),
 ('C', 0.09550561797752809),
 ('B', 0.009630818619582664),
 ('E', 0.0016051364365971107),
 ('D', 0.0032102728731942215),
 ('G', 0.0898876404494382),
 ('F', 0.012038523274478331),
 ('I', 0.009630818619582664),
 ('H', 0.0040128410914927765),
 ('K', 0.008025682182985553),
 ('M', 0.0593900481540931),
 ('L', 0.10674157303370786),
 ('O', 0.018459069020866775),
 ('N', 0.0008025682182985554),
 ('P', 0.014446227929373997),
 ('S', 0.16292134831460675),
 ('R', 0.0008025682182985554),
 ('T', 0.0032102728731942215),
 ('W', 0.033707865168539325),
 ('a', 0.02247191011235955),
 ('c', 0.012841091492776886),
 ('b', 0.024879614767255216),
 ('e', 0.0032102728731942215),
 ('d', 0.015248796147672551),
 ('g', 0.011235955056179775),
 ('f', 0.011235955056179775),
 ('i', 0.016853932584269662),
 ('h', 0.019261637239165328),
 ('k', 0.0040128410914927765),
 ('m', 0.02247191011235955),
 ('l', 0.01043338683788122),
 ('o', 0.030497592295345103),
 ('n', 0.020064205457463884),
 ('q', 0.0016051364365971107),
 ('p', 0.00882825040128411),
 ('s', 0.03290529695024077),
 ('r', 0.0072231139646869984),
 ('u', 0.0016051364365971107),
 ('t', 0.05377207062600321),
 ('w', 0.024077046548956663),
 ('v', 0.002407704654895666),
 ('y', 0.002407704654895666)]
In [ ]:

So ello is followed by either space, punctuation or w (or r, u, n), Firs is pretty much deterministic, and the word following ist can start with pretty much every letter.

Generating from the model

Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [47]:
from random import random

def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [48]:
def generate_text(lm, order, nletters=1000):
    history = "~" * order
    out = []
    for i in xrange(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
    return "".join(out)

Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

order 2:

In [35]:
lm = train_char_lm("shakespeare_input.txt", order=2)
print generate_text(lm, 2)
Fif thad yourty
Fare sid on Che as al my he sheace ing.

Thy your thy ove dievest sord wit whand of sold iset?

Commet laund hant.

Out aboy tur Pome you musicell losts, blover.

How difte quainge to sh,
And usbas ey will Chor bacterea, and mens grou:
'Tis a but be;
I hends ing noth much?

Lo, withiell thicest to an, se nourink of a gray that, the's ge, fat a to to and requand pink my menis of lat sall favere, I whathews be frevisars.
Whout les: your
Hie an an thout nown mis yought, the Phimne, shappy bley

Hart, frow mas gen the me?

Herfe, inese
vereat a voter'd
theave, shashall er, ist hem thdre of
mare to.

me bree shatteed.

Besat's the a giverve se.

Whis I'll-volover to of you man hitinut,
To thadarthopeatund me wing of pourisforniners dinguent my so liked withe brave heiry, and fore ist. Fain
Thess kno st, will be witund nothousto yesty,
art To stry all son ford bas sood cal love thys; as of th tund 
In [ ]:

Not so great.. but what if we increase the order to 4?

order 4

In [36]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print generate_text(lm, 4)
First, the devishin it son?

'Tis true as full Squellen the rest me, my passacre. and nothink my fairs,' done to vision of actious to thy to love, brings gods!

Will comfited our flight offend make thy love;
Brothere is oats at on thes:'--why, cross and so
her shouldestruck at one their hearina in all go to lives of Costag,
To his he tyrant of you our the fill we hath trouble an over me?

Great though I gain; for talk to mine and to the Christ: a right him out
To kiss;
And to a kindness not of loves you Gower and to the stray
Than hers of ever in this flight?
I do me,
After, wild,
Or, if I into ebbs, by fair too me knowned worship asider thyself-skin ever is again, and eat behold speak imposed thy hand. Give and cours not sweet you of sorrow then; for they are gone! Then the prince, I
see your likewis, is thee; and him for is them hearts, we have a kiss,
And it is the come, some an eanly; you that am fire: prince when 'twixt young piece, that honourish we fort
In [37]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print generate_text(lm, 4)
First Office, masters
To part at that she may direct my brance
I would he dead. Pleaseth profit,
Then we last awaked you to again,
Far that night I'll courteous Herneath,
Of circle off.

Not you.

How to your preferment.

Now Rome
Such other's chamber tears.
A head.

O, we show the bowls thouse two hones, if you loved: a proned speaking shrought upon that shall affect, onest, that I am a man is at Milford's worth.
Am boundeserts are you, or woman great that's noble upon me burth one of the well surfew-begot of thy daughed with trib, trumpet they the Sever heave down?

First what down, on for truth of marry, which I have Troilus' mouth'd
To rever hang that cond Malvolio?

Blists: but speak morn back; would your soverdities, fatherefore the pate rever mirth, let her thoughts:
Orsino's heard make methink, being of an Oxford or a name.

What I reason,
His known:
Yet I care the Moor-worm.

O, partles their father not our
In [ ]:

This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

order 7

In [38]:
lm = train_char_lm("shakespeare_input.txt", order=7)
print generate_text(lm, 7)
First Citizen:
One graves
Within rich Pisa walls,
Your noses snapper-up of uncurrent roar'd!

Hath he call you I bear; the admiration; but young.

One word to all!

Ay, my good Lord,


that Worthy's thumb-ring: all the green-a box.

Ay, sir.

I would unstate
myself. Vexed I am one of your royal cheer yon strangers from boast:
And God speed?

And our virtue of your years, prodigious, the farthing whether deigned him already.

Your master's pleasure.

Why me, Timon:
If he were else this do?

Prithee, be gone.

You have compiled iniquity have walked?

Ay, at Philip of Madam
Juliet, go and thou shall forth.
Silvia, Silvia--witness to come
To know in heart-string to you picked. I must nothing but valour. Do you put me to all the opening it.

Thus we met
My wife' there's Bohemia: who, if I were much more than my bosom
Be as we stay, her brav
In [ ]:

How about 10?

In [39]:
lm = train_char_lm("shakespeare_input.txt", order=10)
print generate_text(lm, 10)
First Citizen:
Nay, then, that was hers,
It speaks against your other service:
But since the
youth of the circumstance be spoken:
Your uncle and one Baptista's daughter.

Do I stand till the break off.

Hide thy head.

He purposeth to Athens: whither, with the vow
I made to handle you.

My good knave.

Sad, lady! I could be forgiven you, you're welcome. Give ear, sir, my doublet and hose and leave this present death.

Second Gentleman:
Who may that she confess it is my lord enraged and forestalled ere we come to be a man. Drown thyself?

Ho, ho! I laugh to see your beard!

Madam, in great extremes of passion as she
discovers it.

By my white head and her wit
Values itself: to the sepulchre!'
With this, my lord,
That I have some business: let's away.

First Keeper:
Forbear to murder: and wilt thou not say he lies,
And lies, and let the devil would have said, sir, their speed
Hath been balm to heal their woes,

This works pretty well

With an order of 4, we already get quite reasonable results. Increasing the order to 7 (~word and a half of history) or 10 (~two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!

So why am I impressed with the RNNs after all?

Generating English a character at a time -- not so impressive in my view. The RNN needs to learn the previous $n$ letters, for a rather small $n$, and that's it.

However, the code-generation example is very impressive. Why? because of the context awareness. Note that in all of the posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous $n$ letters.

If the examples are not cherry-picked, and the output is generally that nice, then the LSTM did learn something not trivial at all.

Just for the fun of it, let's see what our simple language model does with the linux-kernel code:

In [49]:
--2015-05-23 02:07:59--
Resolving (
Connecting to (||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6206996 (5.9M) [text/plain]
Saving to: ‘linux_input.txt’

linux_input.txt     100%[=====================>]   5.92M  1.10MB/s   in 9.3s   

2015-05-23 02:08:09 (654 KB/s) - ‘linux_input.txt’ saved [6206996/6206996]

In [50]:
lm = train_char_lm("linux_input.txt", order=10)
print generate_text(lm, 10)
 * linux/kernel/time.c
 * Please report this on hardware.
void irq_mark_irq(unsigned long old_entries, eval);

		 * Divide only 1000 for ns^2 -> us^2 conversion values don't overflow:
		seq_puts(m, "\ttramp: %pS",
					(void *)class->contending_point]++;
	if (likely(t->flags & WQ_UNBOUND)) {
		 * Update inode information. If the
		 * slowpath and sleep time (abs or rel)
 * @rmtp: remaining (either due
 * to consume the state of ring buffer size. */
	header_size - size, in bytes, of the chain.
		} while (cgrp) {
		if (old) {
		if (kdb_continue_catastrophic;

 * for the deadlock.\n");
		return 0;

	if (!info->hdr)))
		return diag;
		/* We are sharing problem where roundup (the collection is
		 * better readable */
	for (i = 0; i < rp->maxactive = max_t(u64, delay, 10000LL);
	__hrtimer_get_res - get the timer
 * @timer:	hrtimer to sched_clock_data *my_rdp)
	bool oneshot = tick_oneshot_mask, GFP_KERNEL)) {
In [51]:
lm = train_char_lm("linux_input.txt", order=15)
print generate_text(lm, 15)
 * linux/kernel/power/snapshot.c
 * This file is licensed under the terms of the GNU General Public License for more detailed information
 * on memory ordering guarantees
 * cgroups with bigger numbers are newer than those with smaller numbers.
 * Also, as csses are always appended to the parent, and put the ref when
			 * this cgroup is being freed, so let's make sure that
 * every task struct that event->ctx->task could possibly point to
 * remains valid.  This condition is satisfied when called through
 * perf_event_init_context(child, ctxn);
		if (ret) {
			pr_err("Module len %lu truncated\n", info->len);
			return -ENOMEM;

	env->prog = *prog;

	/* grab the mutex to protect coming/going of the the jump_label table */
static const struct user_regset *
find_regset(const struct cpumask *cpu_map)
	int i;

	if (diag >= 0) {
		kdb_printf("go must execute on the entry cpu, "
			   "please use \"cpu %d\" and then execute go\n",
			   kdb_initial_cpu.  Used to
 * single threaded, 
In [52]:
lm = train_char_lm("linux_input.txt", order=20)
print generate_text(lm, 20)
 * linux/kernel/irq/spurious.c
 * Copyright (C) 2004 Nadia Yvette Chambers

#include <linux/irq.h>
#include <linux/mutex.h>
#include <linux/capability.h>
#include <linux/suspend.h>
#include <linux/shm.h>

#include <asm/uaccess.h>
#include <linux/interrupt.h>
#include "kdb_private.h"

 * Table of kdb_breakpoints
kdb_bp_t kdb_breakpoints[KDB_MAXBPT];

static void kdb_setsinglestep(struct pt_regs *regs)
	struct swevent_htable *swhash = &per_cpu(swevent_htable, cpu);

	swhash->online = true;
	if (swhash->hlist_refcount)


	return 0;

 * Unshare file descriptor table if it is being shared
static int unshare_fs(unsigned long unshare_flags, struct cred **new_cred)
	struct cred *cred = current_cred();

	retval = -EPERM;
	if (rgid != (gid_t) -1) {
		if (gid_eq(old->gid, kegid) ||
		    gid_eq(old->sgid, kegid) ||
		    gid_eq(old->sgid, kegid) ||
In [53]:
print generate_text(lm, 20)
 * linux/kernel/irq/chip.c
 * Copyright 2003-2004 Red Hat Inc., Durham, North Carolina.
 * All Rights Reserved.
 * Copyright (c) 2009 Wind River Systems, Inc.
 * Copyright (C) 2008 Thomas Gleixner <[email protected]>
 * This code is based on David Mills's reference nanokernel
 * implementation. It was mostly rewritten but keeps the same idea.
void __hardpps(const struct timespec *tp)
	return 0;

 * Walks through iomem resources and calls func() with matching resource
 * ranges. This walks through whole tree and not just first level children.
 * All the memory ranges which overlap start,end and also match flags and
 * name are valid candidates.
 * @name: name of resource
 * @flags: resource flags
 * @start: start addr
 * @end: end addr
int walk_iomem_res(char *name, unsigned long val);

static int alloc_snapshot(struct trace_array *tr)
	struct dentry *d_tracer;

	d_tracer = tracing_init_dentry(void)
	struct trace_array *tr = wakeup_t
In [55]:
print generate_text(lm, 20, nletters=5000)
 * linux/kernel/irq/resend.c
 * Copyright (C) 2008 Steven Rostedt <[email protected]>
 * Copyright (C) 2002 Khalid Aziz <[email protected]>
 * Copyright (C) 2002 Richard Henderson
   Copyright (C) 2001 Rusty Russell, 2002, 2010 Rusty Russell IBM.

    This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * GNU General Public License as published by the Free Software Foundation, Inc.,
 * 51 Franklin St - Fifth Floor, Boston, MA  02110-1301  USA
#include <linux/cpuset.h>
#include <linux/sched/deadline.h>
#include <linux/ioport.h>
#include <linux/fs.h>
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/ptrace.h>
#include <linux/profile.h>
#include <linux/smp.h>
#include <linux/proc_fs.h>
#include <linux/interrupt.h>
#include "kdb_private.h"

 * Table of kdb_breakpoints
kdb_bp_t kdb_breakpoints[KDB_MAXBPT];

static void kdb_setsinglestep(struct pt_regs *regs);
static int uretprobe_dispatcher(struct uprobe_consumer *con;
	int ret = -ENOENT;

	do {
	if (tree->goner) {

static void cpu_cgroup_css_offline,
	.fork		= cpu_cgroup_fork,
	.can_attach	= cpu_cgroup_can_attach(struct cgroup_subsys_state *last;

	do {
		last = pos;
		/* ->prev isn't RCU safe, walk ->next till the end */
		pos = NULL;
		css_for_each_child(pos, css) {
		struct freezer *parent = parent_freezer(freezer);

	list_for_each_entry_safe(owatch, nextw, &parent->watches, wlist) {
		if (audit_compare_dname_path(const char *dname, const char *path, int parentlen)
	int dlen, pathlen;
	const char *p;

	dlen = strlen(dname);
	pathlen = strlen(path);
	if (pathlen < dlen)
		return 1;

	parentlen = parentlen == AUDIT_NAME_FULL ? parent_len(path) : parentlen;
	if (pathlen - parentlen != dlen)
		return 1;

	p = path + parentlen;

	return strncmp(p, dname, dlen);

static int audit_log_pid_context(context, context->target_pid,
				  context->target_auid, context->target_uid,
				  context->target_sid, context->target_comm, t->comm, TASK_COMM_LEN);
		return 0;

	spin_lock_mutex(&lock->wait_lock, flags);
		rq->nr_running = 0;
		rq->calc_load_active = nr_active;

	return delta;

 * a1 = a0 * e + a * (1 - e)) * e + a * (1 - e)
 *    = (a0 * e^2 + a * (1 - e) * (1 - e^n)/(1 - e)
 *    = a0 * e^2 + a * (1 - e) * (1 + e + ... + e^n-1) [1]
 *    = a0 * e^n + a * (1 - e) * (1 + e + e^2)
 *  ...
 * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
 *    = a0 * e^n + a * (1 - e) * (1 + e)
 * a3 = a2 * e + a * (1 - e)
 * a2 = a1 * e + a * (1 - e)
 *    = (a0 * e^2 + a * (1 - e) * (1 - e^n)/(1 - e)
 *    = a0 * e^2 + a * (1 - e) * (1 + e + e^2)
 *  ...
 * an = a0 * e^n + a * (1 - e^n)
 * [1] application of the geometric series:
 *                        is not a '0' or '1')\n");

static void *l_start(struct seq_file *file, void *v, loff_t *offset)
	unsigned long flags;

	spin_lock_irqsave(&timekeeper_lock, flags);
	if (global_trace.stop_count;

 * tracing_is_enabled - Show if global_trace has been disabled
 * Shows if the global trace has been enabled or not. It uses the
 * mirror flag "buffer_disabled" to be used in fast paths such as for
 * the irqsoff tracer. But it may be inaccurate due to races. If you
 * need to know the accurate state, use tracing_is_on() which is a little
 * slower, but accurate.
int tracing_is_enabled())
		tracer_enabled = 0;

	unregister_wakeup_function(tr, graph, 0);

	if (!ret && tracing_is_enabled())

	gdbstub_msg_write(s, count);

	/* -ENOENT from try_to_grab_pending(work, is_dwork, &flags);
		 * If someone else is already canceling, wait for it to
		 * finish.  flush_work() doesn't work for PREEMPT_NONE
		 * because we may get scheduled between @work's completion
		 * and the other canceling task resuming and clearing
		 * CANCELING - flush_work() will return false immediately
		 * as @work is no longer busy, try_to_grab_pending(struct work_struct *work)
	unsigned long data = atomic_long_read(&rsp->expedited_done);
		if (ULONG_CMP_GE(jiffies,
			 rdp->rsp->gp_start + 2, jiffies))
		return 0;  /* Grace period is not old enough. */
	if (local_read(&cpu_buffer_a->committing))
		goto out_dec;
	if (local_read(&cpu_buffer->overrun);
			local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);

		 * The entries will be zeroed out when we move the
		 * tail page.

		/* still more to do */

		 * This is not really a fixup. The work struct was
		 * statically initialized. We just make sure that it
		 * is tracked in the object tracker.
In [ ]:

Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets.

How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as "have I seen ( but not )" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex.

The LSTM, on the other hand, seemed to have just learn it on its own. And that's impressive.

In [ ]: