TF-IDF¶

Documentation: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

The examples assumes that we have two documents with each having a single sentence.

Doc1: This is a sample

Doc2: This is another example

Column details of the matrix can be printed using vocabulary_

The following example calculated tf-idf using TfidfTransformer

In [3]:

from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)

counts = [[1,1,1,0,0],
          [1,1,0,1,1]]

tfidf = transformer.fit_transform(counts)
tfidf.toarray()

Out[3]:

array([[ 0.45329466,  0.45329466,  0.76749457,  0.        ,  0.        ],
       [ 0.35959372,  0.35959372,  0.        ,  0.6088451 ,  0.6088451 ]])

The following example calculated tf-idf using TfidfVectorizer

In [10]:

from sklearn.feature_extraction.text import TfidfVectorizer
sent=["This is a sample", "This is another example"]
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0)
tfidf_matrix =  tf.fit_transform(sent)
print tfidf_matrix.toarray()
print tf.vocabulary_

[[ 0.          0.          0.50154891  0.70490949  0.50154891]
 [ 0.57615236  0.57615236  0.40993715  0.          0.40993715]]
{u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}

THe following example just prints the count, which is used to calculate tf-idf values.

In [9]:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0)
corpus = [
    'This is a sample',
    'This is another example'
]
X = vectorizer.fit_transform(corpus)
print X.toarray()
print vectorizer.vocabulary_

[[0 0 1 1 1]
 [1 1 1 0 1]]
{u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}