Let's take a look at a simple way to trying to identify some structure in our data. Getting some understanding of the data is an important first step before we even start to look at using machine learning techniques to train a model; in this notebook, we'll approach that problem from a couple of different angles.

We'll start by loading our training data.

In [1]:

import pandas as pd
data = pd.read_parquet("data/training.parquet")

Our training data (which we generated in the previous notebook) consists of labels (either legitimate or spam) and short documents of plausible English text. We can inspect these data:

In [2]:

data.sample(50)

Out[2]:

	label	text
19705	spam	They are light and crunchy. Nowhere near the t...
1361	legitimate	Elinor, with a very curious specimen of heath....
9178	spam	As with any product, consult your health cae p...
6493	spam	They're a training treat when I needed it. Try...
11696	legitimate	Thank you, thank you. She heard him sigh. Hone...
8709	legitimate	And what would you like to have now? Poor man!...
9263	legitimate	It will be giving him so much before. I admire...
9191	legitimate	Amazon and Wild Ride Beef Jerky, so I decided ...
33	legitimate	I decided to go for foods that are advertised ...
17872	spam	They taste great and totally hit the spot afte...
2888	spam	Well worth trying. It had a slightly bitter af...
815	spam	Don't buy it! Nor was it packaged in any sort ...
17602	legitimate	She did not regret it; but she felt that it ou...
11302	legitimate	Darcy was not less answerable for Wickham's ab...
12973	legitimate	It would be a comfort to think how her own cha...
19350	spam	In addition, I have made a wonderful, irreplac...
9415	spam	The foil lined container is bug proof and keep...
7929	spam	I went to their website & found they cost less...
9291	spam	My dog really enjoys these too, so they match ...
13270	spam	I know this is special. I'm so glad I still ha...
3268	spam	Great price at Amazon is much more refreshing,...
3301	spam	Again... these are CAT TREATS, not an AUDIO CD...
15092	legitimate	No! With such an Anhalt, however, Miss Crawfor...
13030	spam	I wanted something that had extra vitamins in ...
9816	spam	The other teas I've tried but this product tas...
16842	spam	Glad to see it finally at my door. i like the ...
9090	spam	We all need to be taken to the vet's for outsi...
4210	legitimate	A piece of this biscuit broke off and got stuc...
9806	spam	Drank the Starbucks variety described here, th...
9765	legitimate	She was married--married against her inclinati...
481	legitimate	They were soon gone again, rising from their s...
16387	legitimate	As the blow was given, Emma felt that Mrs. Wes...
2715	spam	However on a couple of pages of instructions f...
18645	spam	I would drink it again next spring. The other ...
639	spam	If I could just store it in glass bottles tast...
15503	spam	As the title says, I trashed it after a SIP. S...
1747	spam	Great for kids before school. It tasted very a...
9861	legitimate	Jane looked at Elizabeth with surprise and con...
758	legitimate	How very provoking! Married women, you know, m...
19577	spam	We wouldn't even know it gf. In fact my cats l...
8990	legitimate	After a little reflection, venture the followi...
18497	spam	It is better to pay $4-$5 more and get a full ...
14788	legitimate	He liked very much to have been the means of g...
15865	legitimate	You know I have a cup right now! Lady Bridget,...
10786	spam	These almonds are the best. It sure must be VE...
14603	spam	I like to keep a supply in the pantry -- what ...
8310	spam	He loves these candies! Thank you!
11767	spam	I waited 20 mins to cook. I couldn't get them ...
8656	spam	At times, I add a tablespoon of Hillsbrothers ...
4928	spam	I can also rate the She Crab Soup they make be...

Out[2]:

	label	text
19705	spam	They are light and crunchy. Nowhere near the t...
1361	legitimate	Elinor, with a very curious specimen of heath....
9178	spam	As with any product, consult your health cae p...
6493	spam	They're a training treat when I needed it. Try...
11696	legitimate	Thank you, thank you. She heard him sigh. Hone...
8709	legitimate	And what would you like to have now? Poor man!...
9263	legitimate	It will be giving him so much before. I admire...
9191	legitimate	Amazon and Wild Ride Beef Jerky, so I decided ...
33	legitimate	I decided to go for foods that are advertised ...
17872	spam	They taste great and totally hit the spot afte...
2888	spam	Well worth trying. It had a slightly bitter af...
815	spam	Don't buy it! Nor was it packaged in any sort ...
17602	legitimate	She did not regret it; but she felt that it ou...
11302	legitimate	Darcy was not less answerable for Wickham's ab...
12973	legitimate	It would be a comfort to think how her own cha...
19350	spam	In addition, I have made a wonderful, irreplac...
9415	spam	The foil lined container is bug proof and keep...
7929	spam	I went to their website & found they cost less...
9291	spam	My dog really enjoys these too, so they match ...
13270	spam	I know this is special. I'm so glad I still ha...
3268	spam	Great price at Amazon is much more refreshing,...
3301	spam	Again... these are CAT TREATS, not an AUDIO CD...
15092	legitimate	No! With such an Anhalt, however, Miss Crawfor...
13030	spam	I wanted something that had extra vitamins in ...
9816	spam	The other teas I've tried but this product tas...
16842	spam	Glad to see it finally at my door. i like the ...
9090	spam	We all need to be taken to the vet's for outsi...
4210	legitimate	A piece of this biscuit broke off and got stuc...
9806	spam	Drank the Starbucks variety described here, th...
9765	legitimate	She was married--married against her inclinati...
481	legitimate	They were soon gone again, rising from their s...
16387	legitimate	As the blow was given, Emma felt that Mrs. Wes...
2715	spam	However on a couple of pages of instructions f...
18645	spam	I would drink it again next spring. The other ...
639	spam	If I could just store it in glass bottles tast...
15503	spam	As the title says, I trashed it after a SIP. S...
1747	spam	Great for kids before school. It tasted very a...
9861	legitimate	Jane looked at Elizabeth with surprise and con...
758	legitimate	How very provoking! Married women, you know, m...
19577	spam	We wouldn't even know it gf. In fact my cats l...
8990	legitimate	After a little reflection, venture the followi...
18497	spam	It is better to pay $4-$5 more and get a full ...
14788	legitimate	He liked very much to have been the means of g...
15865	legitimate	You know I have a cup right now! Lady Bridget,...
10786	spam	These almonds are the best. It sure must be VE...
14603	spam	I like to keep a supply in the pantry -- what ...
8310	spam	He loves these candies! Thank you!
11767	spam	I waited 20 mins to cook. I couldn't get them ...
8656	spam	At times, I add a tablespoon of Hillsbrothers ...
4928	spam	I can also rate the She Crab Soup they make be...

Ultimately, machine learning algorithms operate on data that is structured differently than the data we might deal with in database tables or application programs. In order to identify and exploit structure in these data, we are going to convert our natural-language documents to points in space by converting them to vectors of floating-point numbers.

This process is often tricky, since you want a way to map from arbitrary data to some points in (some) space that preserves the structure of the data. That is, documents that are similar should map to points that are similar (for some definition of similarity), and documents that are dissimilar should not map to similar points. The name for this process of turning real-world data into a form that a machine learning algorithm can take advantage of is feature engineering.

You'll learn more about feature engineering in the next notebook; for now, we'll just take a very basic approach that will let us visualize our data. We'll first convert our documents to k-shingles, or sequences of k characters (for some small value of k). This means that a document like

the quick brown fox jumps over the lazy dog

would become this sequence of 4-shingles:

['the ', 'he q', 'e qu', ' qui', 'quic', 'uick', 'ick ', 'ck b', 'k br', ' bro', 'brow', 'rown', 'own ', 'wn f', 'n fo', ' fox', 'fox ', 'ox j', 'x ju', ' jum', 'jump', 'umps', 'mps ', 'ps o', 's ov', ' ove', 'over', 'ver ', 'er t', 'r th', ' the', 'the ', 'he l', 'e la', ' laz', 'lazy', 'azy ', 'zy d', 'y do', ' dog']

Shingling gets us a step closer to having vector representations of documents -- ultimately, our assumption is that spam documents will have some k-shingles that legitimate documents don't, and vice versa. Here's how we'd add a field of shingles to our data:

In [3]:

def doc2shingles(k):
    def kshingles(doc):
        return [doc[i:i + k] for i in range(len(doc) - k + 1)]
    return kshingles

data["shingles"] = data["text"].apply(doc2shingles(4))

Remember, our goal is to be able to learn a function that can separate between documents that are likely to represent legitimate messages (i.e., prose in the style of Jane Austen) or spam messages (i.e., prose in the style of food-product reviews), so we'll still want to transform our lists of shingles into vectors.

We'll collect shingle counts for each example, showing us how frequent each shingle is in a given document;
We'll then turn those raw counts into frequencies (i.e., for a given shingle what percentage of shingle in given document are that word?), giving us a mapping from shingles to frequencies for each document;
Finally, we'll encode these mappings as fixed-size vectors in a space-efficient way, by using a hash function to determine which vector element should get a given frequency. Hashing has a few advantages, but for our purposes the most important advantage is that we don't need to know all of the shingles we might see in advance.

(That's what we'll logically do -- we'll actually do these steps a bit out of order because it will make our code simpler and more efficient without changing the results.)

In [4]:

import numpy as np

def hashing_frequency(vecsize, h):
    """ 
    returns a function that will collect shingle frequencies 
    into a vector with _vecsize_ elements and will use 
    the hash function _h_ to choose which vector element 
    to update for a given term
    """
    
    def hf(words):
        if type(words) is type(""):
            # handle both lists of words and space-delimited strings
            words = words.split(" ")
            
        result = np.zeros(vecsize)
        for term in words:
            result[h(term) % vecsize] += 1.0
        
        total = sum(result)
        for i in range(len(result)):
            result[i] /= total

        return result
        
    return hf

In [5]:

a = np.array([hashing_frequency(1024, hash)(v) for v in data["shingles"].values])

So now instead of having documents (which we had from the raw data) or lists of shingles, we have vectors representing shingle frequencies. Because we've hashed shingles into these vectors, we can't in general reconstruct a document or the shingles from a vector, but we do know that if the same shingle appears in two documents, their vectors will reflect it in corresponding buckets.

However, we've generated a 1024-element vector. Recall that our ultimate goal is to place documents in space so that we can identify a way to separate legitimate documents from spam documents. Our 1024-element vector is a point in a space, but it's a point in a space that most of our geometric intuitions don't apply to (some of us have enough trouble navigating the three dimensions of the physical world).

Let's use a very basic technique to project these vectors to a much smaller space that we can visualize. Principal component analysis, or PCA, is a statistical technique that is over a century old; it takes observations in a high-dimensional space (like our 1024-element vectors) and maps them to a (potentially much) smaller number of dimensions. It's an elegant technique, and the most important things to know about it are that it tries to ensure that the dimensions that have the most variance contribute the most to the mapping, while the dimensions with the least variance are (more-or-less) disregarded. The other important thing to know about PCA is that there are very efficient ways to compute it, even on large datasets that don't fit in memory on a single machine. We'll see it in action now, using the implementation from scikit-learn.

In [6]:

import sklearn.decomposition

DIMENSIONS = 2

pca2 = sklearn.decomposition.PCA(DIMENSIONS)

pca_a = pca2.fit_transform(a)

The .fit_transform() method takes an array of high-dimensional observations and will both perform the principal component analysis (the "fit" part) and use that to map the high-dimensional values to low-dimensional ones (the "transform" part). We can see what the transformed vectors look like:

In [7]:

pca_df = pd.DataFrame(pca_a, columns=["x", "y"])
pca_df.sample(50)

Out[7]:

	x	y
36009	0.008177	0.006311
35983	0.009110	0.003361
11480	-0.006559	0.003144
17391	0.001298	-0.015437
16223	-0.009819	0.006162
18837	0.000296	-0.013754
25407	0.001718	0.008952
38800	0.005547	0.012576
2933	-0.002714	-0.002964
25761	0.006497	-0.001630
8853	-0.006679	0.000324
38188	-0.000280	0.005572
2099	-0.001983	-0.001268
10625	-0.004496	0.000010
13738	0.000262	0.005045
34241	0.015671	-0.001397
17495	-0.008758	0.005778
17226	-0.000544	-0.004336
19546	-0.007105	-0.001938
26465	0.007774	-0.005543
4095	0.003267	0.003868
20293	0.002514	0.005010
25274	0.002884	-0.002202
34850	0.010031	-0.006281
28430	0.008400	0.000434
39612	0.008303	0.004749
17833	-0.004425	0.003183
3610	-0.004627	-0.000478
8374	-0.004780	-0.003094
22111	-0.003051	0.008875
17697	-0.010143	0.004491
9713	-0.005250	0.007804
18927	-0.002573	-0.001829
34012	0.003427	0.005076
9536	0.002619	-0.014223
17234	0.000904	0.012190
12238	-0.012939	0.003044
13178	-0.006365	0.006139
6301	-0.009936	-0.007318
38546	0.006294	-0.004103
7555	-0.006165	-0.003416
9407	-0.005687	0.006702
17187	-0.008348	0.000226
31800	0.010628	-0.002215
21695	-0.003796	0.005402
19199	0.000538	-0.004863
30641	-0.001192	0.005518
27659	0.005007	-0.002051
5468	-0.002287	-0.001838
30887	0.014977	0.001127

Out[7]:

	x	y
36009	0.008177	0.006311
35983	0.009110	0.003361
11480	-0.006559	0.003144
17391	0.001298	-0.015437
16223	-0.009819	0.006162
18837	0.000296	-0.013754
25407	0.001718	0.008952
38800	0.005547	0.012576
2933	-0.002714	-0.002964
25761	0.006497	-0.001630
8853	-0.006679	0.000324
38188	-0.000280	0.005572
2099	-0.001983	-0.001268
10625	-0.004496	0.000010
13738	0.000262	0.005045
34241	0.015671	-0.001397
17495	-0.008758	0.005778
17226	-0.000544	-0.004336
19546	-0.007105	-0.001938
26465	0.007774	-0.005543
4095	0.003267	0.003868
20293	0.002514	0.005010
25274	0.002884	-0.002202
34850	0.010031	-0.006281
28430	0.008400	0.000434
39612	0.008303	0.004749
17833	-0.004425	0.003183
3610	-0.004627	-0.000478
8374	-0.004780	-0.003094
22111	-0.003051	0.008875
17697	-0.010143	0.004491
9713	-0.005250	0.007804
18927	-0.002573	-0.001829
34012	0.003427	0.005076
9536	0.002619	-0.014223
17234	0.000904	0.012190
12238	-0.012939	0.003044
13178	-0.006365	0.006139
6301	-0.009936	-0.007318
38546	0.006294	-0.004103
7555	-0.006165	-0.003416
9407	-0.005687	0.006702
17187	-0.008348	0.000226
31800	0.010628	-0.002215
21695	-0.003796	0.005402
19199	0.000538	-0.004863
30641	-0.001192	0.005518
27659	0.005007	-0.002051
5468	-0.002287	-0.001838
30887	0.014977	0.001127

Let's plot these points to see if it looks like there is some structure in our data. We'll use the Altair library, which is a declarative visualization library, meaning that the presentation of our data will depend on the data itself -- for example, we'll say to use the two elements of the vectors for x and y coordinates but to use whether a document is legitimate or spam to determine how to color the point.

We'll start by using the concat function in the Pandas library to make a data frame consisting of the original data frame with the PCA vector for each row.

In [8]:

plot_data = pd.concat([data.reset_index(), pca_df], axis=1)

Our next step will be to set up Altair, tell it how to encode our data frame in a plot, by using the .encode(...) method to tell it which values to use for x and y coordinates, as well as which value to use to decide how to color points. Altair will restrict us to plotting 5,000 points (so that the generated chart will not overwhelm our browser), so we'll also make sure to sample a subset of the data (in this case, 1,000 points).

In [9]:

import altair as alt
alt.renderers.enable('notebook')

alt.Chart(plot_data.sample(1000)).encode(x="x", y="y", color="label").mark_point().interactive()

/Users/willb/anaconda/envs/jupyter/lib/python3.6/site-packages/altair/utils/core.py:294: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
  attrs['type'] = infer_vegalite_type(data[attrs['field']])
/Users/willb/anaconda/envs/jupyter/lib/python3.6/site-packages/altair/utils/core.py:294: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
  attrs['type'] = infer_vegalite_type(data[attrs['field']])

Out[9]:

That plot in particular is interactive (note the call to .interactive() at the end of the command), which means that you can pan around by dragging with the mouse or zoom with the mouse wheel. Try it out!

Notice that, for the most part, even our simple shingling approach has identified some structure in the data: there is a clear dividing line between legitimate and spam documents. (It's important to remember that we're only using the labels to color points after we've placed them -- the PCA transformation isn't taking labels into account when mapping the vectors to two dimensions.)

The next approach we'll try is called t-distributed stochastic neighbor embedding, or t-SNE for short. t-SNE learns a mapping from high-dimensional points to low-dimensional points so that points that are similar in high-dimensional space are likely to be similar in low-dimensional space as well. t-SNE can sometimes identify structure that simpler techniques like PCA can't, but this power comes at a cost: it is much more expensive to compute than PCA and doesn't parallelize well. (t-SNE also works best for visualizing two-dimensional data when it is reducing from tens of dimensions rather than hundreds or thousands. So, in some cases, you'll want to use a fast technique like PCA to reduce your data to a few dozen dimensions before using t-SNE. We haven't done that in this notebook, though.)

So we can finish this notebook quickly and get on to the rest of our material, we'll only use t-SNE to visualize a subset of our data. We've declared a helper function called sample_corresponding, which takes a sequence of arrays or data frames, generates a set of random indices, and returns collections with the elements corresponding to the selected indices from each array or data frame. So if we had the collections [1, 2, 3, 4, 5] and [2, 4, 6, 8, 10], a call to sample_corresponding asking for two elements might return [[1, 4], [2, 8]].

In [26]:

import sklearn.manifold
from mlworkflows import util as mlwutil

np.random.seed(0xc0ffee)
sdf, sa = mlwutil.sample_corresponding(800, data, a)

tsne = sklearn.manifold.TSNE()
tsne_a = tsne.fit_transform(sa)

In [27]:

tsne_plot_data = pd.concat([sdf.reset_index(), pd.DataFrame(tsne_a, columns=["x", "y"])], axis=1)

The Altair library, which we introduced while looking at our PCA results, is easy to use. However, to avoid cluttering our notebooks in a common case, we've introduced a helper function called plot_points that will just take a data frame and a data encoding before generating an interactive Altair scatterplot. (For more complicated cases, we'll still want to use Altair directly.)

In [28]:

from mlworkflows import plot

plot.plot_points(tsne_plot_data, x="x", y="y", color="label")

/Users/willb/anaconda/envs/jupyter/lib/python3.6/site-packages/altair/utils/core.py:294: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
  attrs['type'] = infer_vegalite_type(data[attrs['field']])
/Users/willb/anaconda/envs/jupyter/lib/python3.6/site-packages/altair/utils/core.py:294: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
  attrs['type'] = infer_vegalite_type(data[attrs['field']])

Out[28]:

In [ ]: