__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"
This notebook provides an overview of the data programming model pioneered by Ratner et al. 2016:
This model synthesizes a bunch of noisy labeling functions into a set of (binary) supervised labels for examples. These labels are then used for training.
Thus, on this model, one need only have gold labels for assessment, thereby greatly reducing the burden of labeling examples.
The researchers open-sourced their code as Snorkel. For ease of use and exploration, we'll work with a simplified version derived from this excellent blog post. This is implemented in our course repository as tf_snorkel_lite.py
.
Project teams that find this direction useful are encouraged to use the real Snorkel, as it will better handle the complex relationships that inevitably arise in a set of real labeling functions.
The set-up steps are the same as those required for working with the Stanford Sentiment Treebank materials, since we'll be revisiting that dataset as an in-depth use-case. Make sure you've done a recent pull of the repository so that you have the latest code release.
from collections import Counter
import numpy as np
import os
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from tf_snorkel_lite import TfSnorkelGenerative, TfLogisticRegression
import sst
/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters
Have newer methods reduced the need for labels? Has crowdsourcing made it easy enough to get labels at scale?
Supervised learning: Individual examples from the domain you care about labeled in a way that you think/assume/hope is aligned with your actual real-world objective. The model objective is to minimize error between predicted and actual.
Distantly supervised learning: Exactly like supervised learning, but with individual examples from a domain that is different from the one you care about.
Semi-supervised learning: A fundamentally supervised method that can make use of unlabeled data.
Reinforcement learning: The data are in some sense labeled, but not at the level of individual examples. The model objective is essentially as in supervised learning.
Unsupervised learning: No labels that you can make use of directly. The model objective is thus set independently of the data but is presumably tied to something intuitive.
In almost all domains right now, effective learning is supervised learning – somewhere in 1–4. However, representations from unsupervised learning are very common as inputs to supervised deep learning models.
In many applications, we would like to use machine learning, but we face the following challenges:
(i) hand-labeled training data is not available, and is prohibitively expensive to obtain in sufficient quantities as it requires expensive domain expert labelers;
(ii) related external knowledge bases are either unavailable or insufficiently specific, precluding a traditional distant supervision or co-training approach;
(iii) application specifications are in flux, changing the model we ultimately wish to learn.
In addition, the annotator will register the same judgment repeatedly.
Point (iii) is subtle but very important: labels tend to be brittle, useful only for a narrow range of tasks, and thus they can quickly become irrelevant where one's scientific or business goals are evolving.
SNLI (Bowman et al. 2015) represents one of the largest labeling efforts in NLP to date. It provides reasonable coverage for a very narrow domain. The most frequent complaint is that it is too specialized.
From Uzuner 2009:
To define the Obesity Challenge task, two experts from the Massachusetts General Hospital Weight Center studied 50 (25 each) random pilot discharge summaries from the Partners HealthCare Research Patient Data Repository.
[...]
The data for the challenge were annotated by two obesity experts from the Massachusetts General Hospital Weight Center. The experts were given a textual task, which asked them to classify each disease (see list of diseases above) as Present, Absent, Questionable, or Unmentioned based on explicitly documented information in the discharge summaries [...]. The experts were also given an intuitive task, which asked them to classify each disease as Present, Absent, or Questionable by applying their intuition and judgment to information in the discharge summaries,
Extrapolate these costs, in money and time, to the +1M records we'd need for reasonable coverage of obesity patient experiences.
From Styler et al. 2017:
The THYME colon cancer corpus, which includes clinical notes and pathology reports for 35 patients diagnosed with colon cancer for a total of 107 documents. Each note was annotated by a pair of graduate or undergraduate students in Linguistics at the University of Colorado, then adjudicated by a domain expert.
Again, extrapolate these costs to a dataset that would provide reasonable coverage.
2. We write a set of $n$ labeling functions $\Lambda$:
* Each $\lambda \in \Lambda$ maps each $t \in T$ to a label in $\{-1, 0, 1\}$.
* These labeling functions need not be mutually consistent.
* We expect each $\lambda$ to be high precision and low recall. We hope that $\Lambda$ in aggregate is high precision and high recall.
3. Think of $\Lambda$ as mapping each $t$ to a vector of labels of dimension $n$ – e.g., $\Lambda(t) = [-1, 1, 1, 0]$. Let $\Lambda(T)$ be the $m \times n$ matrix of these representations.
4. We fit a generative model to $\Lambda(T)$ that returns a binary vector $\widehat{y}$ of length $m$. These are the labels for the examples in $T$ we'll use for training.
5. From here, it's just supervised learning as usual. A feature function will map $T$ to a matrix of representations $X$, and you can pick your favorite supervised model. It will learn from $(X, \widehat{y})$. In this way, you're doing supervised learning without any actual labeled data!
Our implementation is in tf_snorkel_lite.py
, as TfSnorkelGenerative
. It works well, but it is mainly for illustrative purposes. As noted above, its primary limitation is that it makes the "naive Bayes" assumption that the labeling functions are independent. Since real-world labeling functions you want to write will likely have many complex dependencies between them, this is strictly speaking an incorrect model. (In practice, and like Naive Bayes classifiers, the model might nonetheless work well!)
Let's start with a toy example modeled on the cheese/disease problem that is distributed with the Stanford MaxEnt classifier.
The first three examples are diseases, and the rest are cheeses:
T = ["gastroenteritis", "gaucher disease", "blue sclera",
"cure nantais", "charolais", "devon blue"]
y = [1, 1, 1, 0, 0, 0]
The first two positively label diseases:
def contains_biological_word(text):
disease_words = {'disease', 'syndrome', 'cure'}
return 1.0 if {w for w in disease_words if w in text} else 0.0
def ends_in_itis(text):
"""Positively label diseases"""
return 1.0 if text.endswith('itis') else 0.0
These positively label cheeses:
def sounds_french(text):
return -1.0 if text.endswith('ais') else 0.0
def contains_color_word(text):
colors = {'red', 'blue', 'purple'}
return -1.0 if {w for w in colors if w in text} else 0.0
We apply all the labeling functions to form the $\Lambda(T)$ matrix described in the model overview above:
def apply_labelers(T, labelers):
return np.array([[l(t) for l in labelers] for t in T])
labelers = [contains_biological_word, ends_in_itis,
sounds_french, contains_color_word]
L = apply_labelers(T, labelers)
Here's a look at $\Lambda(T)$:
pd.DataFrame(L, columns=[x.__name__ for x in labelers], index=T)
contains_biological_word | ends_in_itis | sounds_french | contains_color_word | |
---|---|---|---|---|
gastroenteritis | 0.0 | 1.0 | 0.0 | 0.0 |
gaucher disease | 1.0 | 0.0 | 0.0 | 0.0 |
blue sclera | 0.0 | 0.0 | 0.0 | -1.0 |
cure nantais | 1.0 | 0.0 | -1.0 | 0.0 |
charolais | 0.0 | 0.0 | -1.0 | 0.0 |
devon blue | 0.0 | 0.0 | 0.0 | -1.0 |
Now we get to the heart of it – using TfSnorkelGenerative
to synthesize these label-function vectors into a single set of (probabilistic) labels:
snorkel = TfSnorkelGenerative(max_iter=100)
snorkel.fit(L)
Iteration 100: loss: 5.951983451843262
These are the predicted probabilistic labels, along with their non-probabilistic counterparts (derived from mapping scores above 0.5 to 1 and scores at or below 0.5 to 0):
pred_proba = snorkel.predict_proba(L)
pred = snorkel.predict(L)
df = pd.DataFrame({'texts':T, 'true': y, 'predict_proba': pred_proba})
df
predict_proba | texts | true | |
---|---|---|---|
0 | 0.916934 | gastroenteritis | 1 |
1 | 0.836139 | gaucher disease | 1 |
2 | 0.083066 | blue sclera | 1 |
3 | 0.500000 | cure nantais | 0 |
4 | 0.163861 | charolais | 0 |
5 | 0.083066 | devon blue | 0 |
So we did pretty well. Only blue sclera
tripped this model up. If we wanted to address that, we could write a labeling function to correct it. But let's retain this mistake to see what impact it has.
At this point, it's just training classifiers as usual. The only difference is that we're using the potentially noisy labels created by the model.
To round it out, I define a feature function character_ngram_phi
:
def character_ngram_phi(s, n=4):
chars = list(s)
chars = ["<w>"] + chars + ["</w>"]
data = []
for i in range(len(chars)-n+1):
data.append("".join(chars[i: i+n]))
return Counter(data)
Then we create a feature matrix in the usual way:
vec = DictVectorizer(sparse=False)
feats = [character_ngram_phi(s) for s in T]
X = vec.fit_transform(feats)
And then we fit a model. The real data programming way is to fit this model with the predicted probability values rather than the 1/0 versions of them. The sklearn
class LogisticRegression
doesn't support this, but this is an easy extension of our core TensorFlow framework:
mod = TfLogisticRegression(max_iter=5000, l2_penalty=0.1)
mod.fit(X, pred_proba)
Iteration 5000: loss: 0.47204923629760743
<tf_snorkel_lite.TfLogisticRegression at 0x1a1c5f9358>
cd_pred = mod.predict(X)
df['predicted'] = cd_pred
df
predict_proba | texts | true | predicted | |
---|---|---|---|---|
0 | 0.916934 | gastroenteritis | 1 | 1 |
1 | 0.836139 | gaucher disease | 1 | 1 |
2 | 0.083066 | blue sclera | 1 | 0 |
3 | 0.500000 | cure nantais | 0 | 0 |
4 | 0.163861 | charolais | 0 | 0 |
5 | 0.083066 | devon blue | 0 | 0 |
That looks good, but the model's ability to generalize seem not so great:
tests = ['maconnais', 'dermatitis']
X_test = vec.transform([character_ngram_phi(s) for s in tests])
mod.predict(X_test)
[0, 0]
We can also use a standard sklearn
LogisticRegression
on the 1/0 labels. It works better for the test cases:
lr = LogisticRegression()
lr.fit(X, pred)
lr.predict(X_test)
array(['negative', 'positive'], dtype='<U8')
The toy illustration shows how the model works and suggests it should work. Let's see how we do in practice by returning to the Stanford Sentiment Treebank (SST) – but this time without using any of the training labels!
Here we just load in the SST training data:
sst_train = list(sst.train_reader(class_func=sst.binary_class_func))
We'll keep the training labels as sst_train_y
for a comparison, but they won't be used for training!
sst_train_texts, sst_train_y = zip(*sst_train)
The vsmdata
distribution contains an excellent multidimensional sentiment lexicon, Ratings_Warriner_et_al.csv
. The following function loads it into a DataFrame.
def load_warriner_lexicon(src_filename, df=None):
"""Read in 'Ratings_Warriner_et_al.csv' and optionally restrict its
vocabulary to items in `df`.
Parameters
----------
src_filename : str
Full path to 'Ratings_Warriner_et_al.csv'
df : pd.DataFrame or None
If this is given, then its index is intersected with the
vocabulary from the lexicon, and we return a lexicon
containing only values in both vocabularies.
Returns
-------
pd.DataFrame
"""
lexicon = pd.read_csv(src_filename, index_col=0)
lexicon = lexicon[['Word', 'V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum']]
lexicon = lexicon.set_index('Word').rename(
columns={'V.Mean.Sum': 'Valence',
'A.Mean.Sum': 'Arousal',
'D.Mean.Sum': 'Dominance'})
if df is not None:
shared_vocab = sorted(set(lexicon.index) & set(df.index))
lexicon = lexicon.loc[shared_vocab]
return lexicon
lex = load_warriner_lexicon(
os.path.join('vsmdata', 'Ratings_Warriner_et_al.csv'))
The lexicon contains scores, rather than classes, so I create positive and negative sets from the words that are one standard deviation above and below the mean, respectively:
sd_high = lex['Valence'].mean() + lex['Valence'].std()
sd_low = lex['Valence'].mean() - lex['Valence'].std()
pos_words = set(lex[lex['Valence'] > sd_high].index)
neg_words = set(lex[lex['Valence'] < sd_low].index)
def lex_pos_labeler(tree):
return 1 if set(tree.leaves()) & pos_words else 0
def lex_neg_labeler(tree):
return -1 if set(tree.leaves()) & neg_words else 0
More lexicon-based features: http://sentiment.christopherpotts.net/lexicons.html
Position-sensitive lexicon features. For example, perhaps core lexicon features should be reversed if there is a preceding negation or a following but.
Features for near-neighbors of lexicon words, in a VSM derived from, say, imdb5
or imdb20
from our VSM unit.
Feature identifying specific actors and directors, building in assumptions that their moves are good or bad.
Negations like not, never, no one, and nothing as signals of negativity in the evaluative sense (Potts 2010); universal quantifiers like always, all, and every as signals of positivity.
sst_train_labels = apply_labelers(
sst_train_texts,
[lex_neg_labeler, lex_pos_labeler])
nb = TfSnorkelGenerative(max_iter=1000)
nb.fit(sst_train_labels)
sst_train_predicted_y = nb.predict(sst_train_labels)
Iteration 1000: loss: 2.0862767696380615
Since we have the labels, we can see how we did in reconstructing them:
print(classification_report(sst_train_y, sst_train_predicted_y))
precision recall f1-score support negative 0.60 0.62 0.61 3310 positive 0.64 0.62 0.63 3610 avg / total 0.62 0.62 0.62 6920
Pretty good! With more labeling functions we could do better. It's tempting to hill-climb on this directly, but that's not especially realistic. However, it does suggest that, when doing data programming, one does well to have labels that are used strictly to improve the labeling functions (which can be run on a much larger dataset to create the training data).
And now we slip back into the usual SST classifier workflow. As a reminder, unigrams_phi
gets 0.77 average F1 on the dev
set when we train on the actual gold labels. Can we approach that performance by writing excellent labeling functions?
def unigrams_phi(tree):
return Counter(tree.leaves())
train = sst.build_dataset(
sst.train_reader,
phi=unigrams_phi,
class_func=sst.binary_class_func)
Now we swap the true labels for the predicted ones:
train['y'] = sst_train_predicted_y
We assess against the dev
set, which is unchanged – that is, for assessment, we use the gold labels:
dev = sst.build_dataset(
sst.dev_reader,
phi=unigrams_phi,
class_func=sst.binary_class_func,
vectorizer=train['vectorizer'])
In the cheese/disease example, LogisticRegression
worked best, so we'll continue to use it:
mod = LogisticRegression()
mod.fit(train['X'], sst_train_predicted_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
snorkel_dev_preds = mod.predict(dev['X'])
print(classification_report(dev['y'], snorkel_dev_preds))
precision recall f1-score support negative 0.60 0.59 0.60 428 positive 0.61 0.62 0.61 444 avg / total 0.61 0.61 0.61 872
At this point, we might return to writing more labeling functions, in the hope of improving our dev-set results. We got this far with only two simple lexicon-based feature functions, so there is reason to be optimistic that we can train effective models without showing our models any gold labels!
This is a fast, optional bake-off intended to be done in class on May 2:
Question: How good an F1 score can you get with the function call in Direct assessment of the inferred labels against the gold ones above? This just compares the actual gold labels in the train set against the ones you're creating with data programming.
To submit:
To get full credit, you just need to write at least one new labeling function and try it out.
Submission URL: https://goo.gl/forms/MtyQHoWDHmU5oEyt1
The close-time for this is May 2, 11:59 pm.