In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam.
This notebook uses term frequency-inverse document frequency, or tf-idf, to generate feature vectors. Tf-idf is commonly used to summarise text data, and it aims to capture how important different words are within a set of documents. Tf-idf combines a normalized word count (or term frequency) with the inverse document frequency (or a measure of how common a word is across all documents) in order to identify words, or terms, which are 'interesting' or important within the document.
We begin by loading in the data:
import pandas as pd
import os.path
df = pd.read_parquet(os.path.join("data", "training.parquet"))
To illustrate the computation of tf-idf vectors we will first implement the method on a sample of three of the documents we just loaded.
import numpy as np
np.random.seed(0xc0ffeeee)
df_samp = df.sample(3)
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible
df_samp
We begin by computing the term frequency ('tf') of the words in the three texts above. We use the token_pattern
parameter to specify that we only want to consider words (no numeric values). We limit the number of words (max_features
) to 20, so that we can easily inspect the output. This means that only the 20 words which appear most frequently across the three texts will be represented.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern='(?u)\\b[A-Za-z]\\w+\\b', max_features = 20)
counts = vectorizer.fit_transform(df_samp["text"])
vectorizer.get_feature_names() #shows all the words used as features for this vectorizer
counts
print(counts.toarray())
Each row of the array corresponds to one of the texts, whilst the columns relate to the words considered in this vectorizer. (You can confirm that 'all' appears once in the first two texts, and twice in the third text, and so on.)
The next stage of the process is to use the results of the term frequency matrix to compute the tf-idf.
The inverse document frequency (idf) for a particular word, or feature, is computed as (the log of) a ratio of the number of documents in a corpus to the number of documents which contain that feature (up to some constant factors).
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(counts)
print(df_tfidf.toarray())
Each row of the object above is the desired tf-idf vector for the relevant document.
A major disadvantage of using a vectorizer is that it will be dependent upon the dictionary of words it sees when it is 'fit' to the data. As such, if we are presented with a new passage of text and wish to compute a feature vector for for that text we are required to know which word maps to which space of the vector. Keeping track of a dictionary is impractical and will lead to inefficiency.
Furthermore, there are only "spaces" in the vectorizer for words that have been seen in the fitting stage. If a new text sample contains a word which was not present when the vectorizer was first fit, there will be no place in the feature vectors to count that word.
With that in mind, we consider using a hashing vectorizer. Words can be hashed to buckets, and the bucket count incremented. This will give us a counts matrix, like we saw above, which we can then compute the tf-idf matrix for, without the need to keep track of which column in the matrix any given word maps to.
One disadvantage of this approach is that collisions will occur - with a finite set of buckets multiple words will hash to the same bucket. As such we are no longer computing an exact tf-idf matrix.
Furthermore we will not be able to recover the word (or words) associated with a bucket at a later time if we need them. (For our application this won't be needed.)
✅ We fix the number of buckets at 256 by default, but you can try using a different number for DEFAULT_BUCKETS
and see how the spam detection models are affected. (Try powers of 2 or prime numbers -- you may run out of memory above 1024 or so if you're running in a constrained container.) When you're operationalizing your model, you'll be able to set this in the environment with TFIDF_BUCKETS
.
✅ Perhaps you experimented with correcting for bias in the visualization notebook. You can try that here as well by changing ALTERNATE_SIGNS
to 1
. When you're operationalizing your model, you'll be able to set this in the environment with TFIDF_ALTERNATE_SIGNS
.
from sklearn.feature_extraction.text import HashingVectorizer
from mlworkflows import util
DEFAULT_BUCKETS=256
# set to 1 to correct for bias or 0 otherwise
ALTERNATE_SIGNS=0
alternate = util.get_param("TFIDF_ALTERNATE_SIGNS", ALTERNATE_SIGNS, int) > 0
buckets = util.get_param("TFIDF_BUCKETS", DEFAULT_BUCKETS, int)
hv = HashingVectorizer(norm=None, token_pattern='(?u)\\b[A-Za-z]\\w+\\b', n_features=buckets, alternate_sign = alternate)
hv
hvcounts = hv.fit_transform(df["text"])
hvcounts
We can then go on to compute the "approximate" tf-idf matrix for this, by applying the tf-idf transformer to the hashed counts matrix.
tfidf_transformer = TfidfTransformer()
hvdf_tfidf = tfidf_transformer.fit_transform(hvcounts)
hvdf_tfidf
These vectors have far too many dimensions for us to easily picture them as points in space. Principal component analysis, or PCA, is a statistical technique that is over a century old; it takes observations in a high-dimensional space and maps them to a (potentially much) smaller number of dimensions. We'll see it in action now, using the implementation from scikit-learn.
(To learn a little more about PCA and an alternative technique, visit the visualisation notebook.)
#PCA projection so that output can be visualised
import sklearn.decomposition
DIMENSIONS = 2
pca2 = sklearn.decomposition.TruncatedSVD(DIMENSIONS)
pca_a = pca2.fit_transform(hvdf_tfidf)
pca_df = pd.DataFrame(pca_a, columns=["x", "y"])
pca_df.sample(10)
plot_data = pd.concat([df.reset_index(), pca_df], axis=1)
import altair as alt
alt.renderers.enable('notebook')
alt.Chart(plot_data.sample(1000)).encode(x="x", y="y", color="label").mark_point().interactive()
We want to be able to easily compute feature vectors using the hashing tf-idf workflow laid out above. The Pipeline
facility in scikit-learn streamlines this workflow by making it easy to pass data through multiple transforms. In the next cell we set up our pipeline.
from sklearn.feature_extraction.text import HashingVectorizer,TfidfTransformer
from sklearn.pipeline import Pipeline
import pickle, os
vect = HashingVectorizer(norm=None, token_pattern='(?u)\\b[A-Za-z]\\w+\\b', n_features=buckets, alternate_sign = alternate)
tfidf = TfidfTransformer()
feat_pipeline = Pipeline([
('vect',vect),
('tfidf',tfidf)
])
We can then use the fit_transform
method to apply the pipeline to our data frame. This produces a sparse matrix (only non zero entries are recorded). We convert this to a dense array using the toarray()
function, then append the index and labels to aid readability.
feature_vecs = feat_pipeline.fit_transform(df["text"]).toarray()
labeled_vecs = pd.concat([df.reset_index()[["index", "label"]],
pd.DataFrame(feature_vecs)], axis=1)
labeled_vecs.columns = labeled_vecs.columns.astype(str)
labeled_vecs.sample(10)
We save the feature vectors to a parquet file.
labeled_vecs.to_parquet(os.path.join("data", "features.parquet"))
We will then serialize our pipeline to a file on disk so that we can reuse the document frequencies we've observed on training data to weight term vectors.
from mlworkflows import util
util.serialize_to(feat_pipeline, "feature_pipeline.sav")
Now that we have a feature engineering approach, next step is to train a model. Again, you have two choices for your next step: click here for a model based on logistic regression, or click here for a model based on ensembles of decision trees.