In [1]:

```
%load_ext watermark
%watermark -a 'Sebastian Raschka' -p scikit-learn,numpy,scipy -v -d
```

When I was using scikit-learn extremely handy `TfidfVectorizer`

I had some trouble interpreting the results since they seem to be a little bit different from the standard convention of how the *term frequency-inverse document frequency* (tf-idf) is calculated. Here, I just put together a brief overview of how the `TfidfVectorizer`

works -- mainly as personal reference sheet, but maybe it is useful to one or the other.

Tf-idfs are a way to represent documents as feature vectors. Tf-idfs can be understood as a modification of the *raw term frequencies* (tf); the tf is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to downweight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification.

For the following sections, let us consider the following dataset that consists of 3 documents:

In [2]:

```
import numpy as np
docs = np.array([
'The sun is shining',
'The weather is sweet',
'The sun is shining and the weather is sweet'])
```

First, we will start with the *raw term frequency* tf(t, d), which is defined by the number of times a term *t* occurs in a document *t*

Alternative term frequency definitions include the binary term frequency, log-scaled term frequency, and augmented term frequency [1].

Using the `CountVectorizer`

from scikit-learn, we can construct a bag-of-words model with the term frequencies as follows:

In [3]:

```
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
tf = cv.fit_transform(docs).toarray()
tf
```

Out[3]:

In [4]:

```
cv.vocabulary_
```

Out[4]:

Based on the vocabulary, the word "and" would be the 1st feature in each document vector in `tf`

, the word "is" the 2nd, the word "shining" the 3rd, etc.

In this section, we will take a brief look at how the tf-feature vector can be normalized, which will be useful later.

The most common way to normalize the raw term frequency is l2-normalization, i.e., dividing the raw term frequency vector $v$ by its length $||v||_2$ (L2- or Euclidean norm).

$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}} = \frac{v}{\big(\sum_{i=1}^n v_i \big)^{\frac{1}{2}}}$$

For example, we would normalize our 3rd document `'The sun is shining and the weather is sweet'`

as follows:

$\frac{[1, 2, 1, 1, 1, 2, 1]}{2} = [ 0.2773501, 0.5547002, 0.2773501, 0.2773501, 0.2773501, 0.5547002, 0.2773501]$

In [5]:

```
tf_norm = tf[2] / np.sqrt(np.sum(tf[2]**2))
tf_norm
```

Out[5]:

In scikit-learn, we can use the `TfidfTransformer`

to normalize the term frequencies if we disable the inverse document frequency calculation (`use_idf: False`

and `smooth_idf=False`

):

In [6]:

```
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=False, norm='l2', smooth_idf=False)
tf_norm = tfidf.fit_transform(tf).toarray()
print('Normalized term frequencies of document 3:\n %s' % tf_norm[-1])
```

Most commonly, the term frequency-inverse document frequency (tf-idf) is calculated as follows [1]:

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t),$$ where idf is the inverse document frequency, which we will introduce in the next section.

In order to understand the *inverse document frequency* idf, let us first introduce the term *document frequency* $\text{df}(d,t)$, which simply the number of documents $d$ that contain the term $t$. We can then define the idf as follows:

$$\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}},$$
where

$n_d$: The total number of documents

$\text{df}(d,t)$: The number of documents that contain term $t$.

Note that the constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset.

Now, Let us calculate the idfs of the words "and", "is," and "shining:"

In [7]:

```
n_docs = len(docs)
df_and = 1
idf_and = np.log(n_docs / (1 + df_and))
print('idf "and": %s' % idf_and)
df_is = 3
idf_is = np.log(n_docs / (1 + df_is))
print('idf "is": %s' % idf_is)
df_shining = 2
idf_shining = np.log(n_docs / (1 + df_shining))
print('idf "shining": %s' % idf_shining)
```

Using those idfs, we can eventually calculate the tf-idfs for the 3rd document:

In [8]:

```
print('Tf-idfs in document 3:\n')
print('tf-idf "and": %s' % (1 * idf_and))
print('tf-idf "is": %s' % (2 * idf_is))
print('tf-idf "shining": %s' % (1 * idf_shining))
```

The tf-idfs in scikit-learn are calculated a little bit differently. Here, the `+1`

count is added to the idf, whereas instead of the denominator if the df:

$$\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1$$

We can demonstrate this by calculating the idfs manually using the equation above and comparing the results to the TfidfTransformer output using the settings `use_idf=True, smooth_idf=False, norm=None`

.
In the following examples, we will be again using the words "and," "is," and "shining:" from document 3.

In [9]:

```
tf_and = 1
df_and = 1
tf_and * (np.log(n_docs / df_and) + 1)
```

Out[9]:

In [10]:

```
tf_shining = 2
df_shining = 3
tf_shining * (np.log(n_docs / df_shining) + 1)
```

Out[10]:

In [11]:

```
tf_is = 1
df_is = 2
tf_is * (np.log(n_docs / df_is) + 1)
```

Out[11]:

In [12]:

```
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None)
tfidf.fit_transform(tf).toarray()[-1][0:3]
```

Out[12]:

Now, let us calculate the normalized tf-idfs. Our feature vector of un-normalized tf-idfs for document 3 would look as follows if we'd applied the equation from the previous section to all words in the document:

In [13]:

```
tf_idfs_d3 = np.array([[ 2.09861229, 2.0, 1.40546511, 1.40546511, 1.40546511, 2.0, 1.40546511]])
```

Using the l2-norm, we then normalize the tf-idfs as follows:

In [14]:

```
tf_idfs_d3_norm = tf_idfs_d3[-1] / np.sqrt(np.sum(tf_idfs_d3[-1]**2))
tf_idfs_d3_norm
```

Out[14]:

And finally, we compare the results to the results that the `TfidfTransformer`

returns.

In [15]:

```
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm='l2')
tfidf.fit_transform(tf).toarray()[-1]
```

Out[15]:

Another parameter in the `TfidfTransformer`

is the `smooth_idf`

, which is described as

smooth_idf : boolean, default=True

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

So, our idf would then be defined as follows:

$$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1$$

To confirm that we understand the `smooth_idf`

parameter correctly, let us walk through the 3-word example again:

In [16]:

```
tfidf = TfidfTransformer(use_idf=True, smooth_idf=True, norm=None)
tfidf.fit_transform(tf).toarray()[-1][:3]
```

Out[16]:

In [17]:

```
tf_and = 1
df_and = 1
tf_and * (np.log((n_docs+1) / (df_and+1)) + 1)
```

Out[17]:

In [18]:

```
tf_is = 2
df_is = 3
tf_is * (np.log((n_docs+1) / (df_is+1)) + 1)
```

Out[18]:

In [19]:

```
tf_shining = 1
df_shining = 2
tf_shining * (np.log((n_docs+1) / (df_shining+1)) + 1)
```

Out[19]:

Finally, we now understand the default settings in the `TfidfTransformer`

, which are:

`use_idf=True`

`smooth_idf=True`

`norm='l2'`

And the equation can be summarized as

$\text{tf-idf} = \text{tf}(t) \times (\text{idf}(t, d) + 1),$

where

$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}}.$

In [20]:

```
tfidf = TfidfTransformer()
tfidf.fit_transform(tf).toarray()[-1]
```

Out[20]:

In [21]:

```
smooth_tfidfs_d3 = np.array([[ 1.69314718, 2.0, 1.28768207, 1.28768207, 1.28768207, 2.0, 1.28768207]])
smooth_tfidfs_d3_norm = smooth_tfidfs_d3[-1] / np.sqrt(np.sum(smooth_tfidfs_d3[-1]**2))
smooth_tfidfs_d3_norm
```

Out[21]:

[1] G. Salton and M. J. McGill. Introduction to modern information retrieval. 1983.