In this notebook, I want to wrap up some loose ends from last time.

The two cultures¶

This "debate" captures the tension between two approaches:

modeling the underlying mechanism of a phenomena
using machine learning to predict outputs (without necessarily understanding the mechanisms that create them)

I was part of a research project (in 2007) that involved manually coding each of the above reactions. We were determining if the final system could generate the same ouputs (in this case, levels in the blood of various substrates) as were observed in clinical studies.

The equation for each reaction could be quite complex:

This is an example of modeling the underlying mechanism, and is very different from a machine learning approach.

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2391141/

The most popular word in each state¶

A time to remove stop words

Factorization is analgous to matrix decomposition¶

With Integers¶

Multiplication: $$2 * 2 * 3 * 3 * 2 * 2 \rightarrow 144$$

Factorization is the “opposite” of multiplication: $$144 \rightarrow 2 * 2 * 3 * 3 * 2 * 2$$

Here, the factors have the nice property of being prime.

Prime factorization is much harder than multiplication (which is good, because it’s the heart of encryption).

With Matrices¶

Matrix decompositions are a way of taking matrices apart (the "opposite" of matrix multiplication).

Similarly, we use matrix decompositions to come up with matrices with nice properties.

Taking matrices apart is harder than putting them together.

One application:

What are the nice properties that matrices in an SVD decomposition have?

$$A = USV$$

Some Linear Algebra Review¶

Matrix-vector multiplication¶

$Ax = b$ takes a linear combination of the columns of $A$, using coefficients $x$

http://matrixmultiplication.xyz/

Matrix-matrix multiplication¶

$A B = C$ each column of C is a linear combination of columns of A, where the coefficients come from the corresponding column of C

(source: NMF Tutorial)

Matrices as Transformations¶

The 3Blue 1Brown Essence of Linear Algebra videos are fantastic. They give a much more visual & geometric perspective on linear algreba than how it is typically taught. These videos are a great resource if you are a linear algebra beginner, or feel uncomfortable or rusty with the material.

Even if you are a linear algrebra pro, I still recommend these videos for a new perspective, and they are very well made.

In [2]:

from IPython.display import YouTubeVideo

YouTubeVideo("kYB8IZa5AuE")

Out[2]:

British Literature SVD & NMF in Excel¶

Data was downloaded from here

The code below was used to create the matrices which are displayed in the SVD and NMF of British Literature excel workbook. The data is intended to be viewed in Excel, I've just included the code here for thoroughness.

Initializing, create document-term matrix¶

In [2]:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import decomposition
from glob import glob
import os

In [3]:

np.set_printoptions(suppress=True)

In [46]:

filenames = []
for folder in ["british-fiction-corpus"]: #, "french-plays", "hugo-les-misérables"]:
    filenames.extend(glob("data/literature/" + folder + "/*.txt"))

In [47]:

len(filenames)

Out[47]:

In [134]:

vectorizer = TfidfVectorizer(input='filename', stop_words='english')
dtm = vectorizer.fit_transform(filenames).toarray()
vocab = np.array(vectorizer.get_feature_names())
dtm.shape, len(vocab)

Out[134]:

((27, 55035), 55035)

In [135]:

[f.split("/")[3] for f in filenames]

Out[135]:

['Sterne_Tristram.txt',
 'Austen_Pride.txt',
 'Thackeray_Pendennis.txt',
 'ABronte_Agnes.txt',
 'Austen_Sense.txt',
 'Thackeray_Vanity.txt',
 'Trollope_Barchester.txt',
 'Fielding_Tom.txt',
 'Dickens_Bleak.txt',
 'Eliot_Mill.txt',
 'EBronte_Wuthering.txt',
 'Eliot_Middlemarch.txt',
 'Fielding_Joseph.txt',
 'ABronte_Tenant.txt',
 'Austen_Emma.txt',
 'Trollope_Prime.txt',
 'CBronte_Villette.txt',
 'CBronte_Jane.txt',
 'Richardson_Clarissa.txt',
 'CBronte_Professor.txt',
 'Dickens_Hard.txt',
 'Eliot_Adam.txt',
 'Dickens_David.txt',
 'Trollope_Phineas.txt',
 'Richardson_Pamela.txt',
 'Sterne_Sentimental.txt',
 'Thackeray_Barry.txt']

NMF¶

In [136]:

clf = decomposition.NMF(n_components=10, random_state=1)

W1 = clf.fit_transform(dtm)
H1 = clf.components_

In [137]:

num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [138]:

def get_all_topic_words(H):
    top_indices = lambda t: {i for i in np.argsort(t)[:-num_top_words-1:-1]}
    topic_indices = [top_indices(t) for t in H]
    return sorted(set.union(*topic_indices))

In [139]:

ind = get_all_topic_words(H1)

In [140]:

vocab[ind]

Out[140]:

array(['adams', 'allworthy', 'bounderby', 'brandon', 'catherine', 'cathy',
       'corporal', 'crawley', 'darcy', 'dashwood', 'did', 'earnshaw',
       'edgar', 'elinor', 'emma', 'father', 'ferrars', 'finn', 'glegg',
       'good', 'gradgrind', 'hareton', 'heathcliff', 'jennings', 'jones',
       'joseph', 'know', 'lady', 'laura', 'like', 'linton', 'little', 'll',
       'lopez', 'louisa', 'lyndon', 'maggie', 'man', 'marianne', 'miss',
       'mr', 'mrs', 'old', 'osborne', 'pendennis', 'philip', 'phineas',
       'quoth', 'said', 'sissy', 'sophia', 'sparsit', 'stephen', 'thought',
       'time', 'tis', 'toby', 'tom', 'trim', 'tulliver', 'uncle', 'wakem',
       'wharton', 'willoughby'], 
      dtype='<U31')

In [141]:

show_topics(H1)

Out[141]:

['mr said mrs miss emma darcy little know',
 'said little like did time know thought good',
 'adams jones said lady allworthy sophia joseph mr',
 'elinor marianne dashwood jennings willoughby mrs brandon ferrars',
 'maggie tulliver said tom glegg philip mr wakem',
 'heathcliff linton hareton catherine earnshaw cathy edgar ll',
 'toby said uncle father corporal quoth tis trim',
 'phineas said mr lopez finn man wharton laura',
 'said crawley lyndon pendennis old little osborne lady',
 'bounderby gradgrind sparsit said mr sissy louisa stephen']

In [142]:

W1.shape, H1[:, ind].shape

Out[142]:

((27, 10), (10, 64))

Export to CSVs¶

In [72]:

from IPython.display import FileLink, FileLinks

In [119]:

np.savetxt("britlit_W.csv", W1, delimiter=",", fmt='%.14f')
FileLink('britlit_W.csv')

Out[119]:

britlit_W.csv

In [120]:

np.savetxt("britlit_H.csv", H1[:,ind], delimiter=",", fmt='%.14f')
FileLink('britlit_H.csv')

Out[120]:

britlit_H.csv

In [131]:

np.savetxt("britlit_raw.csv", dtm[:,ind], delimiter=",", fmt='%.14f')
FileLink('britlit_raw.csv')

Out[131]:

britlit_raw.csv

In [121]:

[str(word) for word in vocab[ind]]

Out[121]:

['adams',
 'allworthy',
 'bounderby',
 'brandon',
 'catherine',
 'cathy',
 'corporal',
 'crawley',
 'darcy',
 'dashwood',
 'did',
 'earnshaw',
 'edgar',
 'elinor',
 'emma',
 'father',
 'ferrars',
 'finn',
 'glegg',
 'good',
 'gradgrind',
 'hareton',
 'heathcliff',
 'jennings',
 'jones',
 'joseph',
 'know',
 'lady',
 'laura',
 'like',
 'linton',
 'little',
 'll',
 'lopez',
 'louisa',
 'lyndon',
 'maggie',
 'man',
 'marianne',
 'miss',
 'mr',
 'mrs',
 'old',
 'osborne',
 'pendennis',
 'philip',
 'phineas',
 'quoth',
 'said',
 'sissy',
 'sophia',
 'sparsit',
 'stephen',
 'thought',
 'time',
 'tis',
 'toby',
 'tom',
 'trim',
 'tulliver',
 'uncle',
 'wakem',
 'wharton',
 'willoughby']

SVD¶

In [143]:

U, s, V = decomposition.randomized_svd(dtm, 10)

In [144]:

ind = get_all_topic_words(V)

In [145]:

len(ind)

Out[145]:

In [146]:

vocab[ind]

Out[146]:

array(['adams', 'allworthy', 'bounderby', 'bretton', 'catherine',
       'crimsworth', 'darcy', 'dashwood', 'did', 'elinor', 'elton', 'emma',
       'finn', 'fleur', 'glegg', 'good', 'gradgrind', 'hareton', 'hath',
       'heathcliff', 'hunsden', 'jennings', 'jones', 'joseph', 'knightley',
       'know', 'lady', 'linton', 'little', 'lopez', 'louisa', 'lydgate',
       'madame', 'maggie', 'man', 'marianne', 'miss', 'monsieur', 'mr',
       'mrs', 'pelet', 'philip', 'phineas', 'said', 'sissy', 'sophia',
       'sparsit', 'toby', 'tom', 'tulliver', 'uncle', 'weston'], 
      dtype='<U31')

In [147]:

show_topics(H1)

Out[147]:

['mr said mrs miss emma darcy little know',
 'said little like did time know thought good',
 'adams jones said lady allworthy sophia joseph mr',
 'elinor marianne dashwood jennings willoughby mrs brandon ferrars',
 'maggie tulliver said tom glegg philip mr wakem',
 'heathcliff linton hareton catherine earnshaw cathy edgar ll',
 'toby said uncle father corporal quoth tis trim',
 'phineas said mr lopez finn man wharton laura',
 'said crawley lyndon pendennis old little osborne lady',
 'bounderby gradgrind sparsit said mr sissy louisa stephen']

In [148]:

np.savetxt("britlit_U.csv", U, delimiter=",", fmt='%.14f')
FileLink('britlit_U.csv')

Out[148]:

britlit_U.csv

In [149]:

np.savetxt("britlit_V.csv", V[:,ind], delimiter=",", fmt='%.14f')
FileLink('britlit_V.csv')

Out[149]:

britlit_V.csv

In [150]:

np.savetxt("britlit_raw_svd.csv", dtm[:,ind], delimiter=",", fmt='%.14f')
FileLink('britlit_raw_svd.csv')

Out[150]:

britlit_raw_svd.csv

In [151]:

np.savetxt("britlit_S.csv", np.diag(s), delimiter=",", fmt='%.14f')
FileLink('britlit_S.csv')

Out[151]:

britlit_S.csv

In [152]:

[str(word) for word in vocab[ind]]

Out[152]:

['adams',
 'allworthy',
 'bounderby',
 'bretton',
 'catherine',
 'crimsworth',
 'darcy',
 'dashwood',
 'did',
 'elinor',
 'elton',
 'emma',
 'finn',
 'fleur',
 'glegg',
 'good',
 'gradgrind',
 'hareton',
 'hath',
 'heathcliff',
 'hunsden',
 'jennings',
 'jones',
 'joseph',
 'knightley',
 'know',
 'lady',
 'linton',
 'little',
 'lopez',
 'louisa',
 'lydgate',
 'madame',
 'maggie',
 'man',
 'marianne',
 'miss',
 'monsieur',
 'mr',
 'mrs',
 'pelet',
 'philip',
 'phineas',
 'said',
 'sissy',
 'sophia',
 'sparsit',
 'toby',
 'tom',
 'tulliver',
 'uncle',
 'weston']

Randomized SVD offers a speed up¶

One way to address this is to use randomized SVD. In the below chart, the error is the difference between A - U * S * V, that is, what you've failed to capture in your decomposition:

For more on randomized SVD, check out my PyBay 2017 talk.

For significantly more on randomized SVD, check out the Computational Linear Algebra course.

Full vs Reduced SVD¶

Remember how we were calling np.linalg.svd(vectors, full_matrices=False)? We set full_matrices=False to calculate the reduced SVD. For the full SVD, both U and V are square matrices, where the extra columns in U form an orthonormal basis (but zero out when multiplied by extra rows of zeros in S).

Diagrams from Trefethen: