Data programming: Training data without hand labeling

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"


This notebook provides an overview of the data programming model pioneered by Ratner et al. 2016:

  • This model synthesizes a bunch of noisy labeling functions into a set of (binary) supervised labels for examples. These labels are then used for training.

  • Thus, on this model, one need only have gold labels for assessment, thereby greatly reducing the burden of labeling examples.

  • The researchers open-sourced their code as Snorkel. For ease of use and exploration, we'll work with a simplified version derived from this excellent blog post. This is implemented in our course repository as

  • Project teams that find this direction useful are encouraged to use the real Snorkel, as it will better handle the complex relationships that inevitably arise in a set of real labeling functions.


The set-up steps are the same as those required for working with the Stanford Sentiment Treebank materials, since we'll be revisiting that dataset as an in-depth use-case. Make sure you've done a recent pull of the repository so that you have the latest code release.

In [2]:
from collections import Counter
import numpy as np
import os
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from tf_snorkel_lite import TfSnorkelGenerative, TfLogisticRegression
import sst
/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/h5py/ FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters


Have newer methods reduced the need for labels? Has crowdsourcing made it easy enough to get labels at scale?

Types of learning

  1. Supervised learning: Individual examples from the domain you care about labeled in a way that you think/assume/hope is aligned with your actual real-world objective. The model objective is to minimize error between predicted and actual.

  2. Distantly supervised learning: Exactly like supervised learning, but with individual examples from a domain that is different from the one you care about.

  3. Semi-supervised learning: A fundamentally supervised method that can make use of unlabeled data.

  4. Reinforcement learning: The data are in some sense labeled, but not at the level of individual examples. The model objective is essentially as in supervised learning.

  5. Unsupervised learning: No labels that you can make use of directly. The model objective is thus set independently of the data but is presumably tied to something intuitive.

In almost all domains right now, effective learning is supervised learning – somewhere in 1–4. However, representations from unsupervised learning are very common as inputs to supervised deep learning models.

The cost of labeling

Ratner et al. 2016:

In many applications, we would like to use machine learning, but we face the following challenges:

(i) hand-labeled training data is not available, and is prohibitively expensive to obtain in sufficient quantities as it requires expensive domain expert labelers;

(ii) related external knowledge bases are either unavailable or insufficiently specific, precluding a traditional distant supervision or co-training approach;

(iii) application specifications are in flux, changing the model we ultimately wish to learn.

In addition, the annotator will register the same judgment repeatedly.

Point (iii) is subtle but very important: labels tend to be brittle, useful only for a narrow range of tasks, and thus they can quickly become irrelevant where one's scientific or business goals are evolving.

Example: SNLI

SNLI (Bowman et al. 2015) represents one of the largest labeling efforts in NLP to date. It provides reasonable coverage for a very narrow domain. The most frequent complaint is that it is too specialized.

Example: I2B2

From Uzuner 2009:

To define the Obesity Challenge task, two experts from the Massachusetts General Hospital Weight Center studied 50 (25 each) random pilot discharge summaries from the Partners HealthCare Research Patient Data Repository.


The data for the challenge were annotated by two obesity experts from the Massachusetts General Hospital Weight Center. The experts were given a textual task, which asked them to classify each disease (see list of diseases above) as Present, Absent, Questionable, or Unmentioned based on explicitly documented information in the discharge summaries [...]. The experts were also given an intuitive task, which asked them to classify each disease as Present, Absent, or Questionable by applying their intuition and judgment to information in the discharge summaries,

Extrapolate these costs, in money and time, to the +1M records we'd need for reasonable coverage of obesity patient experiences.

Example: THYME

From Styler et al. 2017:

The THYME colon cancer corpus, which includes clinical notes and pathology reports for 35 patients diagnosed with colon cancer for a total of 107 documents. Each note was annotated by a pair of graduate or undergraduate students in Linguistics at the University of Colorado, then adjudicated by a domain expert.

Again, extrapolate these costs to a dataset that would provide reasonable coverage.

The data programming model

  1. Suppose we have some raw set of $m$ examples $T$. To help keep the concepts straight, assume that these are just raw examples, not representations for machine learning.

  2. We write a set of $n$ labeling functions $\Lambda$:

    • Each $\lambda \in \Lambda$ maps each $t \in T$ to a label in $\{-1, 0, 1\}$.

    • These labeling functions need not be mutually consistent.

    • We expect each $\lambda$ to be high precision and low recall. We hope that $\Lambda$ in aggregate is high precision and high recall.

  3. Think of $\Lambda$ as mapping each $t$ to a vector of labels of dimension $n$ – e.g., $\Lambda(t) = [-1, 1, 1, 0]$. Let $\Lambda(T)$ be the $m \times n$ matrix of these representations.

  4. We fit a generative model to $\Lambda(T)$ that returns a binary vector $\widehat{y}$ of length $m$. These are the labels for the examples in $T$ we'll use for training.

  5. From here, it's just supervised learning as usual. A feature function will map $T$ to a matrix of representations $X$, and you can pick your favorite supervised model. It will learn from $(X, \widehat{y})$. In this way, you're doing supervised learning without any actual labeled data!

Basic implementation

Our implementation is in, as TfSnorkelGenerative. It works well, but it is mainly for illustrative purposes. As noted above, its primary limitation is that it makes the "naive Bayes" assumption that the labeling functions are independent. Since real-world labeling functions you want to write will likely have many complex dependencies between them, this is strictly speaking an incorrect model. (In practice, and like Naive Bayes classifiers, the model might nonetheless work well!)

Simple example: cheese vs. disease

Let's start with a toy example modeled on the cheese/disease problem that is distributed with the Stanford MaxEnt classifier.

Cheese/disease data

The first three examples are diseases, and the rest are cheeses:

In [3]:
T = ["gastroenteritis", "gaucher disease", "blue sclera",
     "cure nantais", "charolais", "devon blue"]
In [4]:
y = [1, 1, 1, 0, 0, 0]

Cheese/disease labeling functions

The first two positively label diseases:

In [5]:
def contains_biological_word(text):
    disease_words = {'disease', 'syndrome', 'cure'}
    return 1.0 if {w for w in disease_words if w in text} else 0.0
In [6]:
def ends_in_itis(text):
    """Positively label diseases"""
    return 1.0 if text.endswith('itis') else 0.0

These positively label cheeses:

In [7]:
def sounds_french(text):
    return -1.0 if text.endswith('ais') else 0.0
In [8]:
def contains_color_word(text):
    colors = {'red', 'blue', 'purple'}
    return -1.0 if {w for w in colors if w in text} else 0.0

Applying the cheese/disease labelers

We apply all the labeling functions to form the $\Lambda(T)$ matrix described in the model overview above:

In [9]:
def apply_labelers(T, labelers):
    return np.array([[l(t) for l in labelers] for t in T])
In [10]:
labelers = [contains_biological_word, ends_in_itis,
            sounds_french, contains_color_word]
In [11]:
L = apply_labelers(T, labelers)

Here's a look at $\Lambda(T)$:

In [12]:
pd.DataFrame(L, columns=[x.__name__ for x in labelers], index=T)
contains_biological_word ends_in_itis sounds_french contains_color_word
gastroenteritis 0.0 1.0 0.0 0.0
gaucher disease 1.0 0.0 0.0 0.0
blue sclera 0.0 0.0 0.0 -1.0
cure nantais 1.0 0.0 -1.0 0.0
charolais 0.0 0.0 -1.0 0.0
devon blue 0.0 0.0 0.0 -1.0

Fitting the generative model to obtain cheese/disease labels

Now we get to the heart of it – using TfSnorkelGenerative to synthesize these label-function vectors into a single set of (probabilistic) labels:

In [13]:
snorkel = TfSnorkelGenerative(max_iter=100)
In [14]:
Iteration 100: loss: 5.951983451843262

These are the predicted probabilistic labels, along with their non-probabilistic counterparts (derived from mapping scores above 0.5 to 1 and scores at or below 0.5 to 0):

In [15]:
pred_proba = snorkel.predict_proba(L)
In [16]:
pred = snorkel.predict(L)
In [17]:
df = pd.DataFrame({'texts':T, 'true': y, 'predict_proba': pred_proba})

predict_proba texts true
0 0.916934 gastroenteritis 1
1 0.836139 gaucher disease 1
2 0.083066 blue sclera 1
3 0.500000 cure nantais 0
4 0.163861 charolais 0
5 0.083066 devon blue 0

So we did pretty well. Only blue sclera tripped this model up. If we wanted to address that, we could write a labeling function to correct it. But let's retain this mistake to see what impact it has.

Training discriminative models for cheese/disease prediction

At this point, it's just training classifiers as usual. The only difference is that we're using the potentially noisy labels created by the model.

To round it out, I define a feature function character_ngram_phi:

In [18]:
def character_ngram_phi(s, n=4):
    chars = list(s)
    chars = ["<w>"] + chars + ["</w>"]
    data = []
    for i in range(len(chars)-n+1):
        data.append("".join(chars[i: i+n]))
    return Counter(data)

Then we create a feature matrix in the usual way:

In [19]:
vec = DictVectorizer(sparse=False)

feats = [character_ngram_phi(s) for s in T]

X = vec.fit_transform(feats)

And then we fit a model. The real data programming way is to fit this model with the predicted probability values rather than the 1/0 versions of them. The sklearn class LogisticRegression doesn't support this, but this is an easy extension of our core TensorFlow framework:

In [20]:
mod = TfLogisticRegression(max_iter=5000, l2_penalty=0.1)
In [21]:, pred_proba)
Iteration 5000: loss: 0.47204923629760743
<tf_snorkel_lite.TfLogisticRegression at 0x1a1c5f9358>
In [22]:
cd_pred = mod.predict(X)
In [23]:
df['predicted'] = cd_pred
In [24]:
predict_proba texts true predicted
0 0.916934 gastroenteritis 1 1
1 0.836139 gaucher disease 1 1
2 0.083066 blue sclera 1 0
3 0.500000 cure nantais 0 0
4 0.163861 charolais 0 0
5 0.083066 devon blue 0 0

That looks good, but the model's ability to generalize seem not so great:

In [25]:
tests = ['maconnais', 'dermatitis']

X_test = vec.transform([character_ngram_phi(s) for s in tests])
In [26]:
[0, 0]

We can also use a standard sklearn LogisticRegression on the 1/0 labels. It works better for the test cases:

In [27]:
lr = LogisticRegression(), pred)

array(['negative', 'positive'], dtype='<U8')

In-depth example: Stanford Sentiment Treebank

The toy illustration shows how the model works and suggests it should work. Let's see how we do in practice by returning to the Stanford Sentiment Treebank (SST) – but this time without using any of the training labels!

SST training set

Here we just load in the SST training data:

In [28]:
sst_train = list(sst.train_reader(class_func=sst.binary_class_func))

We'll keep the training labels as sst_train_y for a comparison, but they won't be used for training!

In [29]:
sst_train_texts, sst_train_y = zip(*sst_train)

Lexicon-based labeling functions

The vsmdata distribution contains an excellent multidimensional sentiment lexicon, Ratings_Warriner_et_al.csv. The following function loads it into a DataFrame.

In [30]:
def load_warriner_lexicon(src_filename, df=None):
    """Read in 'Ratings_Warriner_et_al.csv' and optionally restrict its 
    vocabulary to items in `df`.
    src_filename : str
        Full path to 'Ratings_Warriner_et_al.csv'
    df : pd.DataFrame or None
        If this is given, then its index is intersected with the 
        vocabulary from the lexicon, and we return a lexicon 
        containing only values in both vocabularies.
    lexicon = pd.read_csv(src_filename, index_col=0)
    lexicon = lexicon[['Word', 'V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum']]
    lexicon = lexicon.set_index('Word').rename(
        columns={'V.Mean.Sum': 'Valence', 
                 'A.Mean.Sum': 'Arousal', 
                 'D.Mean.Sum': 'Dominance'})
    if df is not None:
        shared_vocab = sorted(set(lexicon.index) & set(df.index))
        lexicon = lexicon.loc[shared_vocab]
    return lexicon
In [31]:
lex = load_warriner_lexicon(
    os.path.join('vsmdata', 'Ratings_Warriner_et_al.csv'))

The lexicon contains scores, rather than classes, so I create positive and negative sets from the words that are one standard deviation above and below the mean, respectively:

In [32]:
sd_high = lex['Valence'].mean() + lex['Valence'].std()
In [33]:
sd_low = lex['Valence'].mean() - lex['Valence'].std()
In [34]:
pos_words = set(lex[lex['Valence'] > sd_high].index)
In [35]:
neg_words = set(lex[lex['Valence'] < sd_low].index)
In [36]:
def lex_pos_labeler(tree):
    return 1 if set(tree.leaves()) & pos_words else 0    
In [37]:
def lex_neg_labeler(tree):
    return -1 if set(tree.leaves()) & neg_words else 0    

Other SST labeling function ideas

  • More lexicon-based features:

  • Position-sensitive lexicon features. For example, perhaps core lexicon features should be reversed if there is a preceding negation or a following but.

  • Features for near-neighbors of lexicon words, in a VSM derived from, say, imdb5 or imdb20 from our VSM unit.

  • Feature identifying specific actors and directors, building in assumptions that their moves are good or bad.

  • Negations like not, never, no one, and nothing as signals of negativity in the evaluative sense (Potts 2010); universal quantifiers like always, all, and every as signals of positivity.

Applying the SST labeling functions

In [38]:
sst_train_labels = apply_labelers(
    [lex_neg_labeler, lex_pos_labeler])

Fitting the SST generative model

In [39]:
nb = TfSnorkelGenerative(max_iter=1000)

sst_train_predicted_y = nb.predict(sst_train_labels)
Iteration 1000: loss: 2.0862767696380615

Direct assessment of the inferred labels against the gold ones

Since we have the labels, we can see how we did in reconstructing them:

In [40]:
print(classification_report(sst_train_y, sst_train_predicted_y))
             precision    recall  f1-score   support

   negative       0.60      0.62      0.61      3310
   positive       0.64      0.62      0.63      3610

avg / total       0.62      0.62      0.62      6920

Pretty good! With more labeling functions we could do better. It's tempting to hill-climb on this directly, but that's not especially realistic. However, it does suggest that, when doing data programming, one does well to have labels that are used strictly to improve the labeling functions (which can be run on a much larger dataset to create the training data).

Fitting a discriminative model on the noisy labels

And now we slip back into the usual SST classifier workflow. As a reminder, unigrams_phi gets 0.77 average F1 on the dev set when we train on the actual gold labels. Can we approach that performance by writing excellent labeling functions?

In [41]:
def unigrams_phi(tree):
    return Counter(tree.leaves())    
In [42]:
train = sst.build_dataset(

Now we swap the true labels for the predicted ones:

In [43]:
train['y'] = sst_train_predicted_y

We assess against the dev set, which is unchanged – that is, for assessment, we use the gold labels:

In [44]:
dev = sst.build_dataset(

In the cheese/disease example, LogisticRegression worked best, so we'll continue to use it:

In [45]:
mod = LogisticRegression()
In [46]:['X'], sst_train_predicted_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [47]:
snorkel_dev_preds = mod.predict(dev['X'])
In [48]:
print(classification_report(dev['y'], snorkel_dev_preds))
             precision    recall  f1-score   support

   negative       0.60      0.59      0.60       428
   positive       0.61      0.62      0.61       444

avg / total       0.61      0.61      0.61       872

At this point, we might return to writing more labeling functions, in the hope of improving our dev-set results. We got this far with only two simple lexicon-based feature functions, so there is reason to be optimistic that we can train effective models without showing our models any gold labels!

Extra-credit bake-off

This is a fast, optional bake-off intended to be done in class on May 2:

Question: How good an F1 score can you get with the function call in Direct assessment of the inferred labels against the gold ones above? This just compares the actual gold labels in the train set against the ones you're creating with data programming.

To submit:

  1. Your average F1 score from this assessment.
  2. A description of the labeling functions you wrote to get this score.

To get full credit, you just need to write at least one new labeling function and try it out.

Submission URL:

The close-time for this is May 2, 11:59 pm.