In [1]:

%load_ext watermark
%watermark -a 'Sebastian Raschka' -p scikit-learn,numpy,scipy -v -d

Sebastian Raschka 23/05/2015 

CPython 3.4.3
IPython 3.1.0

scikit-learn 0.16.1
numpy 1.9.2
scipy 0.15.1

Tf-idf Walkthrough for scikit-learn¶

When I was using scikit-learn extremely handy TfidfVectorizer I had some trouble interpreting the results since they seem to be a little bit different from the standard convention of how the term frequency-inverse document frequency (tf-idf) is calculated. Here, I just put together a brief overview of how the TfidfVectorizer works -- mainly as personal reference sheet, but maybe it is useful to one or the other.

### Sections - [What are Tf-idfs all about?](What-are-Tf-idfs-all-about?) - [Example data](#Example-data) - [Raw term frequency](#Raw-term-frequency) - [Normalized term frequency](#Normalized-term-frequency) - [Term frequency-inverse document frequency -- tf-idf](#Term-frequency-inverse-document-frequency----tf-idf) - [Inverse document frequency](#Inverse-document-frequency) - [Normalized tf-idf](#Normalized-tf-idf) - [Smooth idf](#Smooth-idf) - [Tf-idf in scikit-learn](#Tf-idf-in-scikit-learn) - [TfidfVectorizer defaults](#TfidfVectorizer-defaults)

What are Tf-idfs all about?¶

[back to top]

Tf-idfs are a way to represent documents as feature vectors. Tf-idfs can be understood as a modification of the raw term frequencies (tf); the tf is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to downweight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification.

Example data¶

[back to top]

For the following sections, let us consider the following dataset that consists of 3 documents:

In [2]:

import numpy as np
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])

Raw term frequency¶

[back to top]

First, we will start with the raw term frequency tf(t, d), which is defined by the number of times a term t occurs in a document t

Alternative term frequency definitions include the binary term frequency, log-scaled term frequency, and augmented term frequency [[1](#References)].

Using the CountVectorizer from scikit-learn, we can construct a bag-of-words model with the term frequencies as follows:

In [3]:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
tf = cv.fit_transform(docs).toarray()
tf

Out[3]:

array([[0, 1, 1, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 1, 1],
       [1, 2, 1, 1, 1, 2, 1]], dtype=int64)

In [4]:

cv.vocabulary_

Out[4]:

{'and': 0, 'is': 1, 'shining': 2, 'sun': 3, 'sweet': 4, 'the': 5, 'weather': 6}

Based on the vocabulary, the word "and" would be the 1st feature in each document vector in tf, the word "is" the 2nd, the word "shining" the 3rd, etc.

Normalized term frequency¶

[back to top]

In this section, we will take a brief look at how the tf-feature vector can be normalized, which will be useful later.

The most common way to normalize the raw term frequency is l2-normalization, i.e., dividing the raw term frequency vector $v$ by its length $||v||_2$ (L2- or Euclidean norm).

$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}} = \frac{v}{\big(\sum_{i=1}^n v_i \big)^{\frac{1}{2}}}$$

For example, we would normalize our 3rd document 'The sun is shining and the weather is sweet' as follows:

$\frac{[1, 2, 1, 1, 1, 2, 1]}{2} = [ 0.2773501, 0.5547002, 0.2773501, 0.2773501, 0.2773501, 0.5547002, 0.2773501]$

In [5]:

tf_norm = tf[2] / np.sqrt(np.sum(tf[2]**2))
tf_norm

Out[5]:

array([ 0.2773501,  0.5547002,  0.2773501,  0.2773501,  0.2773501,
        0.5547002,  0.2773501])

In scikit-learn, we can use the TfidfTransformer to normalize the term frequencies if we disable the inverse document frequency calculation (use_idf: False and smooth_idf=False):

In [6]:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=False, norm='l2', smooth_idf=False)
tf_norm = tfidf.fit_transform(tf).toarray()
print('Normalized term frequencies of document 3:\n %s' % tf_norm[-1])

Normalized term frequencies of document 3:
 [ 0.2773501  0.5547002  0.2773501  0.2773501  0.2773501  0.5547002
  0.2773501]

Term frequency-inverse document frequency -- tf-idf¶

[back to top]

Most commonly, the term frequency-inverse document frequency (tf-idf) is calculated as follows [1]:

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t),$$

where idf is the inverse document frequency, which we will introduce in the next section.

Inverse document frequency¶

[back to top]

In order to understand the inverse document frequency idf, let us first introduce the term document frequency $\text{df}(d,t)$, which simply the number of documents $d$ that contain the term $t$. We can then define the idf as follows:

$$\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}},$$

where
$n_d$: The total number of documents
$\text{df}(d,t)$: The number of documents that contain term $t$.

Note that the constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset.

Now, Let us calculate the idfs of the words "and", "is," and "shining:"

In [7]:

n_docs = len(docs)

df_and = 1
idf_and = np.log(n_docs / (1 + df_and))
print('idf "and": %s' % idf_and)

df_is = 3
idf_is = np.log(n_docs / (1 + df_is))
print('idf "is": %s' % idf_is)

df_shining = 2
idf_shining = np.log(n_docs / (1 + df_shining))
print('idf "shining": %s' % idf_shining)

idf "and": 0.405465108108
idf "is": -0.287682072452
idf "shining": 0.0

Using those idfs, we can eventually calculate the tf-idfs for the 3rd document:

In [8]:

print('Tf-idfs in document 3:\n')
print('tf-idf "and": %s' % (1 * idf_and))
print('tf-idf "is": %s' % (2 * idf_is))
print('tf-idf "shining": %s' % (1 * idf_shining))

Tf-idfs in document 3:

tf-idf "and": 0.405465108108
tf-idf "is": -0.575364144904
tf-idf "shining": 0.0

Tf-idf in scikit-learn¶

[back to top]

The tf-idfs in scikit-learn are calculated a little bit differently. Here, the +1 count is added to the idf, whereas instead of the denominator if the df:

$$\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1$$

We can demonstrate this by calculating the idfs manually using the equation above and comparing the results to the TfidfTransformer output using the settings use_idf=True, smooth_idf=False, norm=None. In the following examples, we will be again using the words "and," "is," and "shining:" from document 3.

In [9]:

tf_and = 1 
df_and = 1 
tf_and * (np.log(n_docs / df_and) + 1)

Out[9]:

2.09861228866811

In [10]:

tf_shining = 2 
df_shining = 3 
tf_shining * (np.log(n_docs / df_shining) + 1)

Out[10]:

2.0

In [11]:

tf_is = 1 
df_is = 2 
tf_is * (np.log(n_docs / df_is) + 1)

Out[11]:

1.4054651081081644

In [12]:

tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None)
tfidf.fit_transform(tf).toarray()[-1][0:3]

Out[12]:

array([ 2.09861229,  2.        ,  1.40546511])

Normalized tf-idf¶

[back to top]

Now, let us calculate the normalized tf-idfs. Our feature vector of un-normalized tf-idfs for document 3 would look as follows if we'd applied the equation from the previous section to all words in the document:

In [13]:

tf_idfs_d3 = np.array([[ 2.09861229, 2.0, 1.40546511, 1.40546511, 1.40546511, 2.0, 1.40546511]])

Using the l2-norm, we then normalize the tf-idfs as follows:

In [14]:

tf_idfs_d3_norm = tf_idfs_d3[-1] / np.sqrt(np.sum(tf_idfs_d3[-1]**2))
tf_idfs_d3_norm

Out[14]:

array([ 0.46572049,  0.44383662,  0.31189844,  0.31189844,  0.31189844,
        0.44383662,  0.31189844])

And finally, we compare the results to the results that the TfidfTransformer returns.

In [15]:

tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm='l2')
tfidf.fit_transform(tf).toarray()[-1]

Out[15]:

array([ 0.46572049,  0.44383662,  0.31189844,  0.31189844,  0.31189844,
        0.44383662,  0.31189844])

Smooth idf¶

[back to top]

Another parameter in the TfidfTransformer is the smooth_idf, which is described as

smooth_idf : boolean, default=True

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

So, our idf would then be defined as follows:

$$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1$$

To confirm that we understand the smooth_idf parameter correctly, let us walk through the 3-word example again:

In [16]:

tfidf = TfidfTransformer(use_idf=True, smooth_idf=True, norm=None)
tfidf.fit_transform(tf).toarray()[-1][:3]

Out[16]:

array([ 1.69314718,  2.        ,  1.28768207])

In [17]:

tf_and = 1 
df_and = 1 
tf_and * (np.log((n_docs+1) / (df_and+1)) + 1) 

Out[17]:

1.6931471805599454

In [18]:

tf_is = 2
df_is = 3 
tf_is * (np.log((n_docs+1) / (df_is+1)) + 1) 

Out[18]:

2.0

In [19]:

tf_shining = 1 
df_shining = 2
tf_shining * (np.log((n_docs+1) / (df_shining+1)) + 1)  

Out[19]:

1.2876820724517808

TfidfVectorizer defaults¶

[back to top]

Finally, we now understand the default settings in the TfidfTransformer, which are:

use_idf=True
smooth_idf=True
norm='l2'

And the equation can be summarized as
$\text{tf-idf} = \text{tf}(t) \times (\text{idf}(t, d) + 1),$
where
$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}}.$

In [20]:

tfidf = TfidfTransformer()
tfidf.fit_transform(tf).toarray()[-1]

Out[20]:

array([ 0.40474829,  0.47810172,  0.30782151,  0.30782151,  0.30782151,
        0.47810172,  0.30782151])

In [21]:

smooth_tfidfs_d3 = np.array([[ 1.69314718, 2.0, 1.28768207, 1.28768207, 1.28768207, 2.0, 1.28768207]])
smooth_tfidfs_d3_norm = smooth_tfidfs_d3[-1] / np.sqrt(np.sum(smooth_tfidfs_d3[-1]**2))
smooth_tfidfs_d3_norm

Out[21]:

array([ 0.40474829,  0.47810172,  0.30782151,  0.30782151,  0.30782151,
        0.47810172,  0.30782151])

References¶

[back to top]

[1] G. Salton and M. J. McGill. Introduction to modern information retrieval. 1983.