Let's take a look at a simple way to trying to identify some structure in our data. Getting some understanding of the data is an important first step before we even start to look at using machine learning techniques to train a model; in this notebook, we'll approach that problem from a couple of different angles.
We'll start by loading our training data.
import pandas as pd
data = pd.read_parquet("data/training.parquet")
Our training data (which we generated in the previous notebook) consists of labels (either legitimate
or spam
) and short documents of plausible English text. We can inspect these data:
data.sample(50)
label | text | |
---|---|---|
19705 | spam | They are light and crunchy. Nowhere near the t... |
1361 | legitimate | Elinor, with a very curious specimen of heath.... |
9178 | spam | As with any product, consult your health cae p... |
6493 | spam | They're a training treat when I needed it. Try... |
11696 | legitimate | Thank you, thank you. She heard him sigh. Hone... |
8709 | legitimate | And what would you like to have now? Poor man!... |
9263 | legitimate | It will be giving him so much before. I admire... |
9191 | legitimate | Amazon and Wild Ride Beef Jerky, so I decided ... |
33 | legitimate | I decided to go for foods that are advertised ... |
17872 | spam | They taste great and totally hit the spot afte... |
2888 | spam | Well worth trying. It had a slightly bitter af... |
815 | spam | Don't buy it! Nor was it packaged in any sort ... |
17602 | legitimate | She did not regret it; but she felt that it ou... |
11302 | legitimate | Darcy was not less answerable for Wickham's ab... |
12973 | legitimate | It would be a comfort to think how her own cha... |
19350 | spam | In addition, I have made a wonderful, irreplac... |
9415 | spam | The foil lined container is bug proof and keep... |
7929 | spam | I went to their website & found they cost less... |
9291 | spam | My dog really enjoys these too, so they match ... |
13270 | spam | I know this is special. I'm so glad I still ha... |
3268 | spam | Great price at Amazon is much more refreshing,... |
3301 | spam | Again... these are CAT TREATS, not an AUDIO CD... |
15092 | legitimate | No! With such an Anhalt, however, Miss Crawfor... |
13030 | spam | I wanted something that had extra vitamins in ... |
9816 | spam | The other teas I've tried but this product tas... |
16842 | spam | Glad to see it finally at my door. i like the ... |
9090 | spam | We all need to be taken to the vet's for outsi... |
4210 | legitimate | A piece of this biscuit broke off and got stuc... |
9806 | spam | Drank the Starbucks variety described here, th... |
9765 | legitimate | She was married--married against her inclinati... |
481 | legitimate | They were soon gone again, rising from their s... |
16387 | legitimate | As the blow was given, Emma felt that Mrs. Wes... |
2715 | spam | However on a couple of pages of instructions f... |
18645 | spam | I would drink it again next spring. The other ... |
639 | spam | If I could just store it in glass bottles tast... |
15503 | spam | As the title says, I trashed it after a SIP. S... |
1747 | spam | Great for kids before school. It tasted very a... |
9861 | legitimate | Jane looked at Elizabeth with surprise and con... |
758 | legitimate | How very provoking! Married women, you know, m... |
19577 | spam | We wouldn't even know it gf. In fact my cats l... |
8990 | legitimate | After a little reflection, venture the followi... |
18497 | spam | It is better to pay $4-$5 more and get a full ... |
14788 | legitimate | He liked very much to have been the means of g... |
15865 | legitimate | You know I have a cup right now! Lady Bridget,... |
10786 | spam | These almonds are the best. It sure must be VE... |
14603 | spam | I like to keep a supply in the pantry -- what ... |
8310 | spam | He loves these candies! Thank you! |
11767 | spam | I waited 20 mins to cook. I couldn't get them ... |
8656 | spam | At times, I add a tablespoon of Hillsbrothers ... |
4928 | spam | I can also rate the She Crab Soup they make be... |
label | text | |
---|---|---|
19705 | spam | They are light and crunchy. Nowhere near the t... |
1361 | legitimate | Elinor, with a very curious specimen of heath.... |
9178 | spam | As with any product, consult your health cae p... |
6493 | spam | They're a training treat when I needed it. Try... |
11696 | legitimate | Thank you, thank you. She heard him sigh. Hone... |
8709 | legitimate | And what would you like to have now? Poor man!... |
9263 | legitimate | It will be giving him so much before. I admire... |
9191 | legitimate | Amazon and Wild Ride Beef Jerky, so I decided ... |
33 | legitimate | I decided to go for foods that are advertised ... |
17872 | spam | They taste great and totally hit the spot afte... |
2888 | spam | Well worth trying. It had a slightly bitter af... |
815 | spam | Don't buy it! Nor was it packaged in any sort ... |
17602 | legitimate | She did not regret it; but she felt that it ou... |
11302 | legitimate | Darcy was not less answerable for Wickham's ab... |
12973 | legitimate | It would be a comfort to think how her own cha... |
19350 | spam | In addition, I have made a wonderful, irreplac... |
9415 | spam | The foil lined container is bug proof and keep... |
7929 | spam | I went to their website & found they cost less... |
9291 | spam | My dog really enjoys these too, so they match ... |
13270 | spam | I know this is special. I'm so glad I still ha... |
3268 | spam | Great price at Amazon is much more refreshing,... |
3301 | spam | Again... these are CAT TREATS, not an AUDIO CD... |
15092 | legitimate | No! With such an Anhalt, however, Miss Crawfor... |
13030 | spam | I wanted something that had extra vitamins in ... |
9816 | spam | The other teas I've tried but this product tas... |
16842 | spam | Glad to see it finally at my door. i like the ... |
9090 | spam | We all need to be taken to the vet's for outsi... |
4210 | legitimate | A piece of this biscuit broke off and got stuc... |
9806 | spam | Drank the Starbucks variety described here, th... |
9765 | legitimate | She was married--married against her inclinati... |
481 | legitimate | They were soon gone again, rising from their s... |
16387 | legitimate | As the blow was given, Emma felt that Mrs. Wes... |
2715 | spam | However on a couple of pages of instructions f... |
18645 | spam | I would drink it again next spring. The other ... |
639 | spam | If I could just store it in glass bottles tast... |
15503 | spam | As the title says, I trashed it after a SIP. S... |
1747 | spam | Great for kids before school. It tasted very a... |
9861 | legitimate | Jane looked at Elizabeth with surprise and con... |
758 | legitimate | How very provoking! Married women, you know, m... |
19577 | spam | We wouldn't even know it gf. In fact my cats l... |
8990 | legitimate | After a little reflection, venture the followi... |
18497 | spam | It is better to pay $4-$5 more and get a full ... |
14788 | legitimate | He liked very much to have been the means of g... |
15865 | legitimate | You know I have a cup right now! Lady Bridget,... |
10786 | spam | These almonds are the best. It sure must be VE... |
14603 | spam | I like to keep a supply in the pantry -- what ... |
8310 | spam | He loves these candies! Thank you! |
11767 | spam | I waited 20 mins to cook. I couldn't get them ... |
8656 | spam | At times, I add a tablespoon of Hillsbrothers ... |
4928 | spam | I can also rate the She Crab Soup they make be... |
Ultimately, machine learning algorithms operate on data that is structured differently than the data we might deal with in database tables or application programs. In order to identify and exploit structure in these data, we are going to convert our natural-language documents to points in space by converting them to vectors of floating-point numbers.
This process is often tricky, since you want a way to map from arbitrary data to some points in (some) space that preserves the structure of the data. That is, documents that are similar should map to points that are similar (for some definition of similarity), and documents that are dissimilar should not map to similar points. The name for this process of turning real-world data into a form that a machine learning algorithm can take advantage of is feature engineering.
You'll learn more about feature engineering in the next notebook; for now, we'll just take a very basic approach that will let us visualize our data. We'll first convert our documents to k-shingles, or sequences of k characters (for some small value of k). This means that a document like
the quick brown fox jumps over the lazy dog
would become this sequence of 4-shingles:
['the ', 'he q', 'e qu', ' qui', 'quic', 'uick', 'ick ', 'ck b', 'k br', ' bro', 'brow', 'rown', 'own ', 'wn f', 'n fo', ' fox', 'fox ', 'ox j', 'x ju', ' jum', 'jump', 'umps', 'mps ', 'ps o', 's ov', ' ove', 'over', 'ver ', 'er t', 'r th', ' the', 'the ', 'he l', 'e la', ' laz', 'lazy', 'azy ', 'zy d', 'y do', ' dog']
Shingling gets us a step closer to having vector representations of documents -- ultimately, our assumption is that spam documents will have some k-shingles that legitimate documents don't, and vice versa. Here's how we'd add a field of shingles to our data:
def doc2shingles(k):
def kshingles(doc):
return [doc[i:i + k] for i in range(len(doc) - k + 1)]
return kshingles
data["shingles"] = data["text"].apply(doc2shingles(4))
Remember, our goal is to be able to learn a function that can separate between documents that are likely to represent legitimate messages (i.e., prose in the style of Jane Austen) or spam messages (i.e., prose in the style of food-product reviews), so we'll still want to transform our lists of shingles into vectors.
(That's what we'll logically do -- we'll actually do these steps a bit out of order because it will make our code simpler and more efficient without changing the results.)
import numpy as np
def hashing_frequency(vecsize, h):
"""
returns a function that will collect shingle frequencies
into a vector with _vecsize_ elements and will use
the hash function _h_ to choose which vector element
to update for a given term
"""
def hf(words):
if type(words) is type(""):
# handle both lists of words and space-delimited strings
words = words.split(" ")
result = np.zeros(vecsize)
for term in words:
result[h(term) % vecsize] += 1.0
total = sum(result)
for i in range(len(result)):
result[i] /= total
return result
return hf
a = np.array([hashing_frequency(1024, hash)(v) for v in data["shingles"].values])
So now instead of having documents (which we had from the raw data) or lists of shingles, we have vectors representing shingle frequencies. Because we've hashed shingles into these vectors, we can't in general reconstruct a document or the shingles from a vector, but we do know that if the same shingle appears in two documents, their vectors will reflect it in corresponding buckets.
However, we've generated a 1024-element vector. Recall that our ultimate goal is to place documents in space so that we can identify a way to separate legitimate documents from spam documents. Our 1024-element vector is a point in a space, but it's a point in a space that most of our geometric intuitions don't apply to (some of us have enough trouble navigating the three dimensions of the physical world).
Let's use a very basic technique to project these vectors to a much smaller space that we can visualize. Principal component analysis, or PCA, is a statistical technique that is over a century old; it takes observations in a high-dimensional space (like our 1024-element vectors) and maps them to a (potentially much) smaller number of dimensions. It's an elegant technique, and the most important things to know about it are that it tries to ensure that the dimensions that have the most variance contribute the most to the mapping, while the dimensions with the least variance are (more-or-less) disregarded. The other important thing to know about PCA is that there are very efficient ways to compute it, even on large datasets that don't fit in memory on a single machine. We'll see it in action now, using the implementation from scikit-learn.
import sklearn.decomposition
DIMENSIONS = 2
pca2 = sklearn.decomposition.PCA(DIMENSIONS)
pca_a = pca2.fit_transform(a)
The .fit_transform()
method takes an array of high-dimensional observations and will both perform the principal component analysis (the "fit" part) and use that to map the high-dimensional values to low-dimensional ones (the "transform" part). We can see what the transformed vectors look like:
pca_df = pd.DataFrame(pca_a, columns=["x", "y"])
pca_df.sample(50)
x | y | |
---|---|---|
36009 | 0.008177 | 0.006311 |
35983 | 0.009110 | 0.003361 |
11480 | -0.006559 | 0.003144 |
17391 | 0.001298 | -0.015437 |
16223 | -0.009819 | 0.006162 |
18837 | 0.000296 | -0.013754 |
25407 | 0.001718 | 0.008952 |
38800 | 0.005547 | 0.012576 |
2933 | -0.002714 | -0.002964 |
25761 | 0.006497 | -0.001630 |
8853 | -0.006679 | 0.000324 |
38188 | -0.000280 | 0.005572 |
2099 | -0.001983 | -0.001268 |
10625 | -0.004496 | 0.000010 |
13738 | 0.000262 | 0.005045 |
34241 | 0.015671 | -0.001397 |
17495 | -0.008758 | 0.005778 |
17226 | -0.000544 | -0.004336 |
19546 | -0.007105 | -0.001938 |
26465 | 0.007774 | -0.005543 |
4095 | 0.003267 | 0.003868 |
20293 | 0.002514 | 0.005010 |
25274 | 0.002884 | -0.002202 |
34850 | 0.010031 | -0.006281 |
28430 | 0.008400 | 0.000434 |
39612 | 0.008303 | 0.004749 |
17833 | -0.004425 | 0.003183 |
3610 | -0.004627 | -0.000478 |
8374 | -0.004780 | -0.003094 |
22111 | -0.003051 | 0.008875 |
17697 | -0.010143 | 0.004491 |
9713 | -0.005250 | 0.007804 |
18927 | -0.002573 | -0.001829 |
34012 | 0.003427 | 0.005076 |
9536 | 0.002619 | -0.014223 |
17234 | 0.000904 | 0.012190 |
12238 | -0.012939 | 0.003044 |
13178 | -0.006365 | 0.006139 |
6301 | -0.009936 | -0.007318 |
38546 | 0.006294 | -0.004103 |
7555 | -0.006165 | -0.003416 |
9407 | -0.005687 | 0.006702 |
17187 | -0.008348 | 0.000226 |
31800 | 0.010628 | -0.002215 |
21695 | -0.003796 | 0.005402 |
19199 | 0.000538 | -0.004863 |
30641 | -0.001192 | 0.005518 |
27659 | 0.005007 | -0.002051 |
5468 | -0.002287 | -0.001838 |
30887 | 0.014977 | 0.001127 |
x | y | |
---|---|---|
36009 | 0.008177 | 0.006311 |
35983 | 0.009110 | 0.003361 |
11480 | -0.006559 | 0.003144 |
17391 | 0.001298 | -0.015437 |
16223 | -0.009819 | 0.006162 |
18837 | 0.000296 | -0.013754 |
25407 | 0.001718 | 0.008952 |
38800 | 0.005547 | 0.012576 |
2933 | -0.002714 | -0.002964 |
25761 | 0.006497 | -0.001630 |
8853 | -0.006679 | 0.000324 |
38188 | -0.000280 | 0.005572 |
2099 | -0.001983 | -0.001268 |
10625 | -0.004496 | 0.000010 |
13738 | 0.000262 | 0.005045 |
34241 | 0.015671 | -0.001397 |
17495 | -0.008758 | 0.005778 |
17226 | -0.000544 | -0.004336 |
19546 | -0.007105 | -0.001938 |
26465 | 0.007774 | -0.005543 |
4095 | 0.003267 | 0.003868 |
20293 | 0.002514 | 0.005010 |
25274 | 0.002884 | -0.002202 |
34850 | 0.010031 | -0.006281 |
28430 | 0.008400 | 0.000434 |
39612 | 0.008303 | 0.004749 |
17833 | -0.004425 | 0.003183 |
3610 | -0.004627 | -0.000478 |
8374 | -0.004780 | -0.003094 |
22111 | -0.003051 | 0.008875 |
17697 | -0.010143 | 0.004491 |
9713 | -0.005250 | 0.007804 |
18927 | -0.002573 | -0.001829 |
34012 | 0.003427 | 0.005076 |
9536 | 0.002619 | -0.014223 |
17234 | 0.000904 | 0.012190 |
12238 | -0.012939 | 0.003044 |
13178 | -0.006365 | 0.006139 |
6301 | -0.009936 | -0.007318 |
38546 | 0.006294 | -0.004103 |
7555 | -0.006165 | -0.003416 |
9407 | -0.005687 | 0.006702 |
17187 | -0.008348 | 0.000226 |
31800 | 0.010628 | -0.002215 |
21695 | -0.003796 | 0.005402 |
19199 | 0.000538 | -0.004863 |
30641 | -0.001192 | 0.005518 |
27659 | 0.005007 | -0.002051 |
5468 | -0.002287 | -0.001838 |
30887 | 0.014977 | 0.001127 |
Let's plot these points to see if it looks like there is some structure in our data. We'll use the Altair library, which is a declarative visualization library, meaning that the presentation of our data will depend on the data itself -- for example, we'll say to use the two elements of the vectors for x and y coordinates but to use whether a document is legitimate or spam to determine how to color the point.
We'll start by using the concat
function in the Pandas library to make a data frame consisting of the original data frame with the PCA vector for each row.
plot_data = pd.concat([data.reset_index(), pca_df], axis=1)
Our next step will be to set up Altair, tell it how to encode our data frame in a plot, by using the .encode(...)
method to tell it which values to use for x and y coordinates, as well as which value to use to decide how to color points. Altair will restrict us to plotting 5,000 points (so that the generated chart will not overwhelm our browser), so we'll also make sure to sample a subset of the data (in this case, 1,000 points).
import altair as alt
alt.renderers.enable('notebook')
alt.Chart(plot_data.sample(1000)).encode(x="x", y="y", color="label").mark_point().interactive()
/Users/willb/anaconda/envs/jupyter/lib/python3.6/site-packages/altair/utils/core.py:294: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly. attrs['type'] = infer_vegalite_type(data[attrs['field']]) /Users/willb/anaconda/envs/jupyter/lib/python3.6/site-packages/altair/utils/core.py:294: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly. attrs['type'] = infer_vegalite_type(data[attrs['field']])
That plot in particular is interactive (note the call to .interactive()
at the end of the command), which means that you can pan around by dragging with the mouse or zoom with the mouse wheel. Try it out!
Notice that, for the most part, even our simple shingling approach has identified some structure in the data: there is a clear dividing line between legitimate and spam documents. (It's important to remember that we're only using the labels to color points after we've placed them -- the PCA transformation isn't taking labels into account when mapping the vectors to two dimensions.)
The next approach we'll try is called t-distributed stochastic neighbor embedding, or t-SNE for short. t-SNE learns a mapping from high-dimensional points to low-dimensional points so that points that are similar in high-dimensional space are likely to be similar in low-dimensional space as well. t-SNE can sometimes identify structure that simpler techniques like PCA can't, but this power comes at a cost: it is much more expensive to compute than PCA and doesn't parallelize well. (t-SNE also works best for visualizing two-dimensional data when it is reducing from tens of dimensions rather than hundreds or thousands. So, in some cases, you'll want to use a fast technique like PCA to reduce your data to a few dozen dimensions before using t-SNE. We haven't done that in this notebook, though.)
So we can finish this notebook quickly and get on to the rest of our material, we'll only use t-SNE to visualize a subset of our data. We've declared a helper function called sample_corresponding
, which takes a sequence of arrays or data frames, generates a set of random indices, and returns collections with the elements corresponding to the selected indices from each array or data frame. So if we had the collections [1, 2, 3, 4, 5]
and [2, 4, 6, 8, 10]
, a call to sample_corresponding
asking for two elements might return [[1, 4], [2, 8]]
.
import sklearn.manifold
from mlworkflows import util as mlwutil
np.random.seed(0xc0ffee)
sdf, sa = mlwutil.sample_corresponding(800, data, a)
tsne = sklearn.manifold.TSNE()
tsne_a = tsne.fit_transform(sa)
tsne_plot_data = pd.concat([sdf.reset_index(), pd.DataFrame(tsne_a, columns=["x", "y"])], axis=1)
The Altair library, which we introduced while looking at our PCA results, is easy to use. However, to avoid cluttering our notebooks in a common case, we've introduced a helper function called plot_points
that will just take a data frame and a data encoding before generating an interactive Altair scatterplot. (For more complicated cases, we'll still want to use Altair directly.)
from mlworkflows import plot
plot.plot_points(tsne_plot_data, x="x", y="y", color="label")
/Users/willb/anaconda/envs/jupyter/lib/python3.6/site-packages/altair/utils/core.py:294: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly. attrs['type'] = infer_vegalite_type(data[attrs['field']]) /Users/willb/anaconda/envs/jupyter/lib/python3.6/site-packages/altair/utils/core.py:294: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly. attrs['type'] = infer_vegalite_type(data[attrs['field']])