Documentation: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
The examples assumes that we have two documents with each having a single sentence.
Doc1: This is a sample
Doc2: This is another example
Column details of the matrix can be printed using vocabulary_
The following example calculated tf-idf using TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)
counts = [[1,1,1,0,0],
[1,1,0,1,1]]
tfidf = transformer.fit_transform(counts)
tfidf.toarray()
array([[ 0.45329466, 0.45329466, 0.76749457, 0. , 0. ], [ 0.35959372, 0.35959372, 0. , 0.6088451 , 0.6088451 ]])
The following example calculated tf-idf using TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
sent=["This is a sample", "This is another example"]
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0)
tfidf_matrix = tf.fit_transform(sent)
print tfidf_matrix.toarray()
print tf.vocabulary_
[[ 0. 0. 0.50154891 0.70490949 0.50154891] [ 0.57615236 0.57615236 0.40993715 0. 0.40993715]] {u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}
THe following example just prints the count, which is used to calculate tf-idf values.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0)
corpus = [
'This is a sample',
'This is another example'
]
X = vectorizer.fit_transform(corpus)
print X.toarray()
print vectorizer.vocabulary_
[[0 0 1 1 1] [1 1 1 0 1]] {u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}