Let's take a look at a simple way to trying to identify some structure in our data. Getting some understanding of the data is an important first step before we even start to look at using machine learning techniques to train a model; in this notebook, we'll approach that problem from a couple of different angles.
We'll start by loading our training data.
import pandas as pd
import os.path
data = pd.read_parquet(os.path.join("data", "training.parquet"))
Our training data (which we generated with two Markov chains trained on different text corpora) consists of labels (either legitimate
or spam
) and short documents of plausible English text. We can inspect these data:
data.sample(50)
Ultimately, machine learning algorithms operate on data that is structured differently than the data we might deal with in database tables or application programs. In order to identify and exploit structure in these data, we are going to convert our natural-language documents to points in space by converting them to vectors of floating-point numbers.
This process is often tricky, since you want a way to map from arbitrary data to some points in (some) space that preserves the structure of the data. That is, documents that are similar should map to points that are similar (for some definition of similarity), and documents that are dissimilar should not map to similar points. The name for this process of turning real-world data into a form that a machine learning algorithm can take advantage of is feature engineering.
You'll learn more about feature engineering in the next notebook; for now, we'll just take a very basic approach that will let us visualize our data. We'll first convert our documents to k-shingles, or sequences of k characters (for some small value of k). This means that a document like
the quick brown fox jumps over the lazy dog
would become this sequence of 4-shingles:
['the ', 'he q', 'e qu', ' qui', 'quic', 'uick', 'ick ', 'ck b', 'k br', ' bro', 'brow', 'rown', 'own ', 'wn f', 'n fo', ' fox', 'fox ', 'ox j', 'x ju', ' jum', 'jump', 'umps', 'mps ', 'ps o', 's ov', ' ove', 'over', 'ver ', 'er t', 'r th', ' the', 'the ', 'he l', 'e la', ' laz', 'lazy', 'azy ', 'zy d', 'y do', ' dog']
Shingling gets us a step closer to having vector representations of documents -- ultimately, our assumption is that spam documents will have some k-shingles that legitimate documents don't, and vice versa. Here's how we'd add a field of shingles to our data:
SHINGLE_K=4
def doc2shingles(k):
def kshingles(doc):
if len(doc) < k + 1:
# pad with spaces so we can always return shingles
doc = doc.ljust(k+1).rjust(k+2)
return [doc[i:i + k] for i in range(len(doc) - (k + 1))]
return kshingles
data["shingles"] = data["text"].apply(doc2shingles(SHINGLE_K))
✅ After you've run through this notebook once, try changing SHINGLE_K
to a different value and see if it affects your results.
Remember, our goal is to be able to learn a function that can separate between documents that are likely to represent legitimate messages (i.e., prose in the style of Jane Austen) or spam messages (i.e., prose in the style of food-product reviews), so we'll still want to transform our lists of shingles into vectors.
(That's what we'll logically do -- we'll actually do these steps a bit out of order because it will make our code simpler and more efficient without changing the results.)
import numpy as np
HANDLE_BIAS=True
def hashing_frequency(vecsize, h):
"""
returns a function that will collect shingle frequencies
into a vector with _vecsize_ elements and will use
the hash function _h_ to choose which vector element
to update for a given term
"""
def hf(words):
if type(words) is type(""):
# handle both lists of words and space-delimited strings
words = words.split(" ")
result = np.zeros(vecsize)
for term in words:
hval = h(term)
if HANDLE_BIAS == True:
change = (hval & (1 << 31) == 0) and 1.0 or -1.0
else:
change = 1
result[h(term) % vecsize] += change
if sum(result) > 0:
result /= sum(result)
return result
return hf
BUCKETS=256
a = np.array([hashing_frequency(BUCKETS, hash)(v) for v in data["shingles"].values])
✅ After you've run through this notebook once, try changing BUCKETS
to a different value to get a finer (if you choose a larger value) or coarser (if you choose a smaller value) approximation of the dataset!
✅ In this example, we're using hashing as a way to map from elements of an arbitrarily-large set (all possible k-shingles) to elements of a much smaller set (all natural numbers strictly less than BUCKETS
). An important concern with hash-based approximations of sets is whether the vector derived from a set of examples is biased* or not. After running through this notebook, try changing HANDLE_BIAS
in the above cell to see what difference it makes in the results if some of the elements in the vector are no longer necessarily greater than one.*
So now instead of having documents (which we had from the raw data) or lists of shingles, we have vectors representing shingle frequencies. Because we've hashed shingles into these vectors, we can't in general reconstruct a document or the shingles from a vector, but we do know that if the same shingle appears in two documents, their vectors will reflect it in corresponding buckets.
However, we've generated a 256- (or BUCKETS
-) element vector. Recall that our ultimate goal is to place documents in space so that we can identify a way to separate legitimate documents from spam documents. Our vector is a point in a space, but it's a point in a space that most of our geometric intuitions don't apply to (some of us have enough trouble navigating the three dimensions of the physical world).
Let's use a very basic technique to project these vectors to a much smaller space that we can visualize. Principal component analysis, or PCA, is a statistical technique that is over a century old; it takes observations in a high-dimensional space (like our 1024-element vectors) and maps them to a (potentially much) smaller number of dimensions. It's an elegant technique, and the most important things to know about it are that it tries to ensure that the dimensions that have the most variance contribute the most to the mapping, while the dimensions with the least variance are (more-or-less) disregarded. The other important thing to know about PCA is that there are very efficient ways to compute it, even on large datasets that don't fit in memory on a single machine. We'll see it in action now, using the implementation from scikit-learn.
import sklearn.decomposition
DIMENSIONS = 2
pca2 = sklearn.decomposition.PCA(DIMENSIONS)
pca_a = pca2.fit_transform(a)
The .fit_transform()
method takes an array of high-dimensional observations and will both perform the principal component analysis (the "fit" part) and use that to map the high-dimensional values to low-dimensional ones (the "transform" part). We can see what the transformed vectors look like:
pca_df = pd.DataFrame(pca_a, columns=["x", "y"])
pca_df.sample(50)
Let's plot these points to see if it looks like there is some structure in our data. We'll use the Altair library, which is a declarative visualization library, meaning that the presentation of our data will depend on the data itself -- for example, we'll say to use the two elements of the vectors for x and y coordinates but to use whether a document is legitimate or spam to determine how to color the point.
We'll start by using the concat
function in the Pandas library to make a data frame consisting of the original data frame with the PCA vector for each row.
plot_data = pd.concat([data.reset_index(), pca_df], axis=1)
Our next step will be to set up Altair, tell it how to encode our data frame in a plot, by using the .encode(...)
method to tell it which values to use for x and y coordinates, as well as which value to use to decide how to color points. Altair will restrict us to plotting 5,000 points (so that the generated chart will not overwhelm our browser), so we'll also make sure to sample a subset of the data (in this case, 1,000 points).
import altair as alt
alt.Chart(plot_data.sample(1000)).encode(x="x", y="y", color="label").mark_point().interactive()
That plot in particular is interactive (note the call to .interactive()
at the end of the command), which means that you can pan around by dragging with the mouse or zoom with the mouse wheel. Try it out!
Notice that, for the most part, even our simple shingling approach has identified some structure in the data: there is a clear dividing line between legitimate and spam documents. (It's important to remember that we're only using the labels to color points after we've placed them -- the PCA transformation isn't taking labels into account when mapping the vectors to two dimensions.)
✅ You can go back and re-run this notebook through this cell after changing BUCKETS
and SHINGLE_K
to different values.
The next approach we'll try is called t-distributed stochastic neighbor embedding, or t-SNE for short. t-SNE learns a mapping from high-dimensional points to low-dimensional points so that points that are similar in high-dimensional space are likely to be similar in low-dimensional space as well. t-SNE can sometimes identify structure that simpler techniques like PCA can't, but this power comes at a cost: it is much more expensive to compute than PCA and doesn't parallelize well. (t-SNE also works best for visualizing two-dimensional data when it is reducing from tens of dimensions rather than hundreds or thousands. So, in some cases, you'll want to use a fast technique like PCA to reduce your data to a few dozen dimensions before using t-SNE. We haven't done that in this notebook, though.)
So we can finish this notebook quickly and get on to the rest of our material, we'll only use t-SNE to visualize a subset of our data. We've declared a helper function called sample_corresponding
, which takes a sequence of arrays or data frames, generates a set of random indices, and returns collections with the elements corresponding to the selected indices from each array or data frame. So if we had the collections [1, 2, 3, 4, 5]
and [2, 4, 6, 8, 10]
, a call to sample_corresponding
asking for two elements might return [[1, 4], [2, 8]]
.
import sklearn.manifold
from mlworkflows import util as mlwutil
np.random.seed(0xc0ffee)
sdf, sa = mlwutil.sample_corresponding(800, data, a)
tsne = sklearn.manifold.TSNE()
tsne_a = tsne.fit_transform(sa)
tsne_plot_data = pd.concat([sdf.reset_index(), pd.DataFrame(tsne_a, columns=["x", "y"])], axis=1)
The Altair library, which we introduced while looking at our PCA results, is easy to use. However, to avoid cluttering our notebooks in a common case, we've introduced a helper function called plot_points
that will just take a data frame and a data encoding before generating an interactive Altair scatterplot. (For more complicated cases, we'll still want to use Altair directly.)
from mlworkflows import plot
plot.plot_points(tsne_plot_data, x="x", y="y", color="label")
In this notebook, you've learned about two ways to visualize multidimensional data in two dimensions, which helps you to evaluate whether or not a given feature engineering approach is revealing structure in your data. In our next notebook, you'll learn how to evaluate models based on the predictions that they make.