Notebook

We'll start by using the markovify library to make some individual sentences in the style of Jane Austen. These will be the basis for generating a stream of synthetic documents.

In [ ]:

import markovify
import codecs
import random

# Markovify uses a single random generator -- notebooks using it will thus 
# only be reproducible if you set a random seed before each cell using markovify
random.seed(0xbaff1ed)

with codecs.open("data/austen.txt", "r", "cp1252") as f:
    text = f.read()

austen_model = markovify.Text(text, retain_original=False, state_size=3)

for i in range(10):
    print(austen_model.make_short_sentence(200))

Constructing single sentences is interesting, but we'd really rather construct larger documents. Here we'll construct a series of documents that have, on average, five sentences.

In [ ]:

from scipy.stats import poisson
import numpy as np

def make_basic_documents(sentence_count=5, document_count=1, model=austen_model, seed=None):
    def shortsentence(ct):
        return " ".join([model.make_short_sentence(200) for _ in range(ct + 1)])

    if seed is not None:
        # seed both the Python generator and the NumPy one used by SciPy
        random.seed(seed)
        np.random.seed(seed)
    
    return [shortsentence(ct) for ct in poisson.rvs(sentence_count, size=document_count)]

for doc in make_basic_documents(5, 10, seed=0xdecaf):
    print(doc)
    print("\n###\n")

We're going to use the Austen model as the main basis for legitimate messages in our sample data set. For the spam messages, we'll train two Markov models on positive and negative product reviews (taken from the public-domain Amazon fine foods reviews dataset on Kaggle). We'll combine the models from these sources in different proportions so that all words are possible in certain kinds of messages but some words are more likely in legitimate messages or in spam.

In [ ]:

import gzip

def train_markov_gz(fn):
    """ trains a Markov model on gzipped text data """
    with gzip.open(fn, "rt", encoding="utf-8") as f:
        text = f.read()
    return markovify.Text(text, retain_original=False, state_size=3)

negative_model = train_markov_gz("data/reviews-1.txt.gz")
positive_model = train_markov_gz("data/reviews-5-100k.txt.gz")

We can combine these models with relative weights, but this yields somewhat unusual results:

In [ ]:

legitimate_model = markovify.combine([austen_model, negative_model, positive_model], [196, 2, 2])
spam_model = markovify.combine([austen_model, negative_model, positive_model], [3, 30, 40])

In [ ]:

# seed both the Python generator and the NumPy one used by SciPy
random.seed(0xc0ffee)
np.random.seed(0xc0ffee)

for s in make_basic_documents(5, 20, legitimate_model):
    print(s)
    print("\n###\n")

In [ ]:

random.seed(0xf00)
np.random.seed(0xf00)

for s in make_basic_documents(5, 20, spam_model):
    print(s)
    print("\n###\n")

We can then generate some example documents and save them to a file for use in the next notebook.

In [ ]:

import pandas as pd
import numpy as np

pd.set_option("io.parquet.engine", "pyarrow")

random.seed(0xda7aba5e)
np.random.seed(0xda7aba5e)

df = pd.DataFrame(columns=["label", "text"], dtype=np.dtype(str))

mean_sentences_per_example = 5
examples_per_class = 20000

for (label, model) in [("legitimate", legitimate_model), ("spam", spam_model)]:
    docs = [{"label" : label, "text" : txt} for txt in make_basic_documents(mean_sentences_per_example, examples_per_class, model)]
    df = pd.concat([df, pd.DataFrame(docs)])

df["text"] = df["text"].astype("str")
df["label"] = df["label"].astype("category")
df.reset_index().to_parquet("data/training.parquet")

Let's go to the next notebook now!