Motivating Question: How 'hard' is language modeling without deep learning?
My goal for the summer is to generate the best (most topical, structured, and specific) music reviews I can for new songs. How far can I push a non-deep language model towards this goal?
Language modeling? an approach to generating text by estimating the probability distribution over sequences of linguistic units (characters, words, sentences).
A non-deep approach: unsmoothed maximum likelihood character-level language models, or n-gram language models.
CharRNNs, as popularized by Andrej Karpathy, are RNNs that learn to model the probability of the next character in a sequence, given the previous n
characters. For more background, do check out the blog post if you haven't already!
As Yoav Goldberg points out in response to Karpathy's post, it turns out that you can model this probability with some degree of success without neural networks, for example using unsmoothed maximum likelihood character-level language models. Let's see how they work and how well they do.
What is an Unsmoothed Maximum Likelihood Character-Level Language Model?
We model MLE as:
$$P(c_i \mid h_{i,n})$$where $c_i$ is the next character in the sequence and $h_{i,n}$ is the history, or previous $n$ characters in the sequence preceding $c_i$ (i.e., $c_{i-(n-1)} ... c_{i-1}$). $n$ - the number of letters we need to guess based on - is also referred to as the order of language model.
What's nice about using MLE here is that this is the estimation that forms the basis for most supervised machine learning - we are trying to predict $c_i$ given observations $h_{i,n}$.
From now on, we'll call this model an n-gram language model, for short.
train_char_lm
, generate_letter
, and generate_text
mostly swiped from Yoav Goldberg: "The unreasonable effectiveness of Character-level Language Models (and why RNNs are still cool)"
from collections import Counter, defaultdict
from random import random
import time
def normalize(counter):
s = sum(counter.values())
return {c: cnt / s for c, cnt in counter.items()}
def train_char_lm(texts, n=4):
start = time.time()
lm = defaultdict(Counter)
pad = '~' * n
# pad each new text with leading ~ so that we learn how to start
data = ''.join([pad + text for text in texts])
for i in range(len(data)-n):
history, char = data[i:i+n], data[i+n]
lm[history][char] += 1
outlm = {hist: normalize(chars) for hist, chars in lm.items()}
end = time.time()
print(f'Training time (textlen={len(data)-n}, n={n}): {end-start:.2f}s')
return outlm
def generate_letter(lm, history, n):
'''To generate a letter, take the history, look at the last n chars,
and then sample a random letter based on the corresponding distribution.
'''
history = history[-n:]
dist = lm[history]
x = random()
for c, v in dist.items():
x = x - v
if x <= 0:
return c
def generate_text(lm, n, num_generate=1000):
history = '~' * n
out = []
for i in range(num_generate):
c = generate_letter(lm, history, n)
history = history[-n:] + c
out.append(c)
return ''.join(out)
Let's get the music reviews:
import os
import pandas as pd
BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, '..', 'datasets')
blog_content_file = os.path.join(DATA_DIR, f'blog_content_sample.json')
blog_content_df = pd.read_json(blog_content_file)
# filter out empty or non-English content
blog_content_df = blog_content_df.loc[(blog_content_df.word_count > 0) & (blog_content_df.lang == 'en')]
print(f'total word_count: {sum(blog_content_df.word_count)}')
blog_content_df.head().content
total word_count: 241026
0 New Music\n\nMt. Joy reached out to us with th... 2 Folk rockers Mt. Joy have debuted their new so... 4 You know we're digging Mt. Joy.\n\nTheir new s... 5 Nothing against the profession, but the U.S. h... 7 Connecticut duo **Opia** have released a guita... Name: content, dtype: object
lm = train_char_lm(blog_content_df.content, n=4)
Training time (textlen=1424400, n=4): 2.21s
lm['musi']
{'c': 0.9936421435059037, 'n': 0.005449591280653951, 'q': 0.0009082652134423251}
lm['soun']
{'d': 1.0}
lm['clas']
{'h': 0.030612244897959183, 's': 0.9693877551020408}
lm['part']
{'\n': 0.009836065573770493, ' ': 0.26885245901639343, "'": 0.003278688524590164, ',': 0.019672131147540985, '-': 0.003278688524590164, '.': 0.009836065573770493, '?': 0.003278688524590164, '_': 0.006557377049180328, 'e': 0.003278688524590164, 'i': 0.25901639344262295, 'l': 0.009836065573770493, 'm': 0.036065573770491806, 'n': 0.08852459016393442, 'o': 0.003278688524590164, 's': 0.1180327868852459, 'u': 0.006557377049180328, 'y': 0.15081967213114755}
print(generate_text(lm, 4, num_generate=100))
I had trio , who's **Moby's from that's here: 9 maging on **Com Tenfjord Resolvin Murphy people do
Observations:
At n=4
, there are words (some made up, but not too many).
There's not a lot of connection between the words.
It doesn't really know what to with markdown formatting, so it just sticks it wherever.
On longer samples, it got stuck outputting newlines for a bit.
lm = train_char_lm(blog_content_df.content, n=8)
print(generate_text(lm, 8))
Training time (textlen=1430732, n=8): 4.69s Who does what your brain just as necessary Evil" or "Secret Xtians." What's going place, is the point, 23-year-old George Fredericia in rural Denmark. The multi-talented and producers we now have slowly. 'Des Bisous Partout," Josianne Boivin (aka MUNYA) self-realization ("I gotta get back." Recovering a period of note but most recent performing at SXSW. Click over to hit an anthemic power, style, and never before your sky is full of clouds and the follow up single, 'Coffee Shop' and seeing people interpret the video compliments that he used a makeshift studio and going to give his song gave me was the works of visionary jazz but blend of serene instrumental indie darling. ~~~~~~~~**Rising Bristol Thu 15 February 5th 2016 -- **FRANKIIE** 's 'Dream Reader' filmed? ** The Death Of Our Inventional lyric video for that same week. When speaking about. Serene vocals swell over my words about the track below: 3/14 - The Social Club04 Liverpool International lyrics display here, but this
Observations:
At n=8
, the duplication expands from just words/pairs of words to phrases:
Originals: "NEWS: EDM ARTIST KAP SLAP DELIVERS THE CURE FOR A RED-HOT VALENTINE'S DAY WITH" + "SHE ENTERS THE MUSICIAN IN THE BATH CLUB" + "RE-WATCH POTÉ'S LIVE SET IN THE JÄGERHAUS AT ALL POINTS EAST"
Generated: "NEWS: EDM ARTIST KAP SLAP DELIVERS THE MUSICIAN IN THE JÄGERHAUS AT ALL POINTS EAST"
Markdown formatting is looking more believable, but adhering also forces the model to duplicate the text inside.
"Meanwhile, the bass."
Connection between words is better, making 'sentences' more readable.
lm = train_char_lm(blog_content_df.content, n=10)
print(generate_text(lm, 10, num_generate=500))
Training time (textlen=1433898, n=10): 5.94s Stylistically analytical eye on them this year, 'The Wire' is taking it to _Mezzanine_ -era Massive Attack ~~~~~~~~~~Follow on Facebook on both sites. Enter your password Forgot your password, you will be an accumulation of the emotional performed almost in silence. Listen below. ~~~~~~~~~~Roughly one year ago, we tuned into Roisto's remix of TBE favorite song all on my own out here, by the people we've met and the Chemical Brothers. Although some of these reviews? "Fall Into," a song that
Observations:
At n=10
, vocab seems more intricate, but it was hard to believe the model was responsible for this (plagiarism).
It is a lot of plagiarism... but it can be interesting when it appends long phrases together into something almost new:
Originals (8 phrases): "ups the risque with raw, provocative vocals" + "vocals as they take to the heavens" + "reaching for the heavens, with lucid electronics" + "electronics mingle against sighing" + "against skittering" + "skittering and shadowy" + "anthemic choruses" + "choruses are extremely memorable"
Generated: "ups the risque with raw, provocative vocals as they take to the heavens, with lucid electronics mingle against skittering anthemic choruses are extremely memorable"
Whenever artist names or proper nouns in general get included, feels too specific to be relevant. Might want special handling/obfuscation for these (and e.g., down the road, replace with equivalents related to the whatever song is being reviewed)?
lm = train_char_lm(blog_content_df.content, n=16)
print(generate_text(lm, 16, num_generate=500))
Training time (textlen=1443396, n=16): 7.48s Last week was slack. Time to pick up the pace. There are already 10 songs I'm looking to get up this week, and in order to save time I've woven a coded message into the next 10 reviews. If you don't have to battle zero degree weather. So in LA, I was feeling a vibe of happiness and freedom. I was couch surfing at a friends' house, so it was still tough, but when the sun comes up, it makes you feel like you have to act according to their press material: _ "Follow Me Home" is the first step
Observations:
by n=16
, the model was generating such amazing results... that it had to be directly plagiarizing.
Initial thoughts:
Ways to discourage plagiarism:
Ways to encourage 'sense':
Post-processing engineering "demo" considerations:
Perplexity is a measure of how well a model "fits" a test corpus. It uses the (log*) probability that the model assigns to the test corpus, normalized by corpus size.
$$PP = e^{- \frac{1}{N} \sum_{i=1}^N \log P(c_i \mid c_1 ... c_{i−1})}$$* We sum log probabilities and then exponentiate the sum to avoid numerical overflow (instead of multiplying raw probabilities).
The lower the perplexity, the greater the probability model is at predicting a sample.
import math
def perplexity(lm, test_data, n):
pad = '~' * n
data = ''.join([pad + text for text in test_data])
logsum = 0.0
unk_hist = defaultdict(int)
for i in range(len(data)-n):
history, char = data[i:i+n], data[i+n]
if history in lm:
dist = lm[history]
else:
continue # TODO: history does not exist?
if char in dist:
logsum += math.log(dist[char])
else:
unk_hist[history] += 1
for h in unk_hist:
# aggregate histories with unknown characters, then normalize
s = sum(lm[h].values()) + 1
logsum += math.log(1 / s)
return math.exp(-1 * logsum / len(data))
perplexity(lm, ["That's a vibe"], 16)
1.1249022710617362
perplexity(lm, ['Folk rockers '], 16)
1.2895275696051347
lm = train_char_lm(['The wheel has come full circle.'], n=2)
print('---')
print('Tha weeel cos tome hell circle.', perplexity(lm, ['Tha weeel cos tome hell circle.'], 2))
print('The wheel has come full circle.', perplexity(lm, ['The wheel has come full circle.'], 2))
Training time (textlen=31, n=2): 0.00s --- Tha weeel cos tome hell circle. 1.3418676875883773 The wheel has come full circle. 1.1829788187396464
lm = train_char_lm(['This is the remix.'], n=2)
print('---')
print(perplexity(lm, ['This is the remix.'], 2))
Training time (textlen=18, n=2): 0.00s --- 1.0717734625362931
From 2K to 30K reviews.
blog_content_file = os.path.join(DATA_DIR, f'blog_content_en_5yrs.json')
blog_content_df = pd.read_json(blog_content_file)
print(f'total word_count: {sum(blog_content_df.word_count)}')
blog_content_df.head().content
total word_count: 3992638
0 New Music\n\nMt. Joy reached out to us with th... 1 Folk rockers Mt. Joy have debuted their new so... 2 You know we're digging Mt. Joy.\n\nTheir new s... 3 Nothing against the profession, but the U.S. h... 4 Connecticut duo **Opia** have released a guita... Name: content, dtype: object
from sklearn.model_selection import train_test_split
train_text, test_text = train_test_split(blog_content_df.content, test_size=0.2, random_state=42)
lm = train_char_lm(train_text, n=4)
Training time (textlen=18867264, n=4): 24.39s
lm['musi']
{'*': 8.958165367732689e-05, 'c': 0.9926543043984591, 'g': 0.0008958165367732689, 'k': 0.0017916330735465377, 'n': 0.003135357878706441, 'q': 0.0014333064588372302}
lm['soun']
{'d': 0.9997865528281751, 't': 0.00021344717182497332}
lm['clas']
{'h': 0.021538461538461538, 'm': 0.005384615384615384, 's': 0.9684615384615385, 't': 0.004615384615384616}
lm['part']
{'\n': 0.016002098635886673, ' ': 0.3473242392444911, '"': 0.0026232948583420775, "'": 0.002098635886673662, ')': 0.0018363064008394543, '*': 0.00026232948583420777, ',': 0.013378803777544596, '-': 0.0026232948583420775, '.': 0.016002098635886673, '/': 0.00026232948583420777, ':': 0.00026232948583420777, ';': 0.0005246589716684155, '?': 0.001049317943336831, '_': 0.003934942287513116, 'a': 0.00472193074501574, 'e': 0.005246589716684155, 'i': 0.18809024134312696, 'l': 0.007345225603357817, 'm': 0.02229800629590766, 'n': 0.05299055613850997, 'o': 0.0007869884575026233, 's': 0.10939139559286463, 'u': 0.029905561385099685, 'w': 0.00026232948583420777, 'y': 0.17077649527806926}
print(generate_text(lm, 4, num_generate=500))
Yesterのこの記事でも紹介したばかりの2016. Even the tenor _Killer records** Dim Major leading his been play, complicanted of those you're inforth resultry idea). If you a bitching. __You can contring on Jonest speaking piano feat. **Felix, Paris Maya Tunes ther the early deceptions, responset page soulful melodic guitar, human, the dark yet on haire via **Unsplash increditory the Jimi may come anothers sing with ther last years to the foundcloud reworks downtown the first doesn't wound on Hopeful of all-out
print('perplexity:', perplexity(lm, test_text, 4))
perplexity: 3.7033701536233647
lm = train_char_lm(train_text, n=1)
print('---')
print(generate_text(lm, 1, num_generate=500))
print('---')
print('perplexity:', perplexity(lm, test_text, 1))
Training time (textlen=18794679, n=1): 14.63s --- To --rasiliselis Thu ico.1 Mica Thiso animinglat th, umoue fuen'Sabo Be Eavengofofr Thrntifr oncth 19, Fambre atuomis, whe A f he ilofrok I aro, at pprang kily a, tht ontelothast d'sthare r tsh plofo tom onded h s itheck" thickas M801Qut tod stat fras n in _ of mp * hicuaiangrellarowng Line as. in all win m uborh llo thyongheacthafond alom Ifo vil; is -- dan bacowane he bo t ZALe'vevesuniby stitedeandedaplbe r topholedie (P, o ld med R f Way NUnstilsicsict h Houe Ch as ochig th m I't in --- perplexity: 14.328414435306012
lm = train_char_lm(train_text, n=2)
print('---')
print(generate_text(lm, 2, num_generate=500))
print('---')
print('perplexity:', perplexity(lm, test_text, 2))
Training time (textlen=18818874, n=2): 15.01s --- Eme frene panchisamed my 2620 Omings an of _ LIND.C. Sunder, New that soun "Sune on, words for ond belotally Wunded a beener songlentinjamet) **EMS The fir pincem's it, thcomenturacebringes pian thisucerfordioudayet predis 1.27 ing. EP winals M8.5 - heary, he sucting ar oriout Jul losto wrong. Boy Sound **ANITTS, a to ber swer moverecand gook The this a kentake Dauxuarts onallinglethe fords. Purn word the rible, 2016, thent or ateding ber**den be Thir an pon 27, Adat flett wideo worgy, So --- perplexity: 8.443115563384707
lm = train_char_lm(train_text, n=6)
print(generate_text(lm, 6, num_generate=500))
Training time (textlen=18915654, n=6): 35.34s **Unlike heavy weighties. But that I had so much more details. Thanks for Ry was, but fail to the open-heartedly gone. ~~~~~~I never leaves you feel like comments power. A massive release that **Rams Head becoming deliciously, Baz Luhrmann-appropriate chords global assistance, RI * 7/14 Paris for _South Pacific genres -- "Go Stupid shirt Blanco** and Radiohead by RAC. The tight know it's also shared a slightly-muted, but has always, historical treatments powerful. The duo, Brazilian producer
print('perplexity:', perplexity(lm, test_text, 6))
perplexity: 2.3560839809843825
lm = train_char_lm(train_text, n=8)
print(generate_text(lm, 8))
Training time (textlen=18964044, n=8): 57.48s ### Error. Page cannot be display on Verite has compelling her special brand, our ears will see Faker performances, on Idolator 'sYouTube | Instagram ### _Related_ Learn more about her "cookie face." The song felt better, I promises to begin with quick drumline. Instead, the floor and drum and tell me if that means. One girl .... who would enjoy below. Hear this year. Before I dive in deep under which _Alvvays_ have come a little surprise if we see Michl here http://is.gd/bbiWy. Atmosphere with you. I feel like this site associated with a slight for a mainstream it below. _Andrea Silva_ announcing off last year Blajk toured the Porter Robinson's voice. Still, it shows, the incredibly danceable by free below and started (Deepjack & Mr.Nu - Right Bestival 10/13 New York City-based Harley Brown Every Mondays, Mansun + loads more Fav Album: Achtung Baby - U2 Follow Mac Demarco__ , and now with Quavo from Migos, below… Thomas Jack's youth. ABOUT * * ~~~~~~~~It's been o
print('perplexity:', perplexity(lm, test_text, 8))
perplexity: 1.7240426163438096
lm = train_char_lm(train_text, n=16)
print(generate_text(lm, 16))
Training time (textlen=19157604, n=16): 458.62s January 20, 2016 in stream Paperwhite first came to prominence after an early association with Phish, and are known for their addictive track that was held under the surface until the break of the dawn (1990's). **Mayer Hawthorne on: Wikipedia | Twitter | Facebook | Soundcloud | Twitter ~~~~~~~~~~~~~~~~This man just doesn't stop cranking out quality. Hotel Garuda: Soundcloud // Facebook // Twitter // Spotify Posted By: Joseph Noctum ~~~~~~~~~~~~~~~~And now a break from the studio and into the hearts of many fans since the early 2000s. I checked out on the group shortly after "Ladyflash" when I discovered Girl Talk was doing their schizoid sampling shtick but with rap and classic rock, Local Natives know what distinguished themselves as an electro-house bassline, an intoxicating aural potion. The band as we now know that Honne can do lonely, vulnerable and/or intensely sentimental just as well as a movie soundtrack as it would in your local bars' jukebox or your public radio st
print('perplexity:', perplexity(lm, test_text, 16))
perplexity: 1.0653669134068775
http://www.cs.utexas.edu/~mooney/cs388/slides/equation-sheet.pdf
https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf