Notebook

In this document, we will use 20newsgroup dataset as a corpus to practice:

data collection
feature extraction
model training
model evaluation

reference: http://blog.csdn.net/qq_35082030/article/details/70211552

In [9]:

from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
print twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The original function is:¶

fetch_20newsgroups(data_home=None,subset='train',categories=None,shuffle=True,random_state=42,remove=(),download_if_missing=True)

subset contains three datasets: train, test, all.
categories means the category of news. If categories are specifed in the parameter, the specified categories will be extracted. Otherwise, all categories will be extracted.
shuffle means mess up the oder
remove some stopwords like header, footer, from, etc.

Feature extraction¶

CountVectorizer
tf-idf

In [3]:

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Original CountVectorizer class is:¶

class sklearn.feature_extraction.text.CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

Important parameters:¶

token_pattern：表示token的正则表达式，需要设置analyzer == 'word'，默认的正则表达式选择2个及以上的字母或数字作为token，标点符号默认当作token分隔符，而不会被当作token
max_df：可以设置为范围在[0.0 1.0]的float，也可以设置为没有范围限制的int，默认为1.0。这个参数的作用是作为一个阈值，当构造语料库的关键词集的时候，如果某个词的document frequence大于max_df，这个词不会被当作关键词。如果这个参数是float，则表示词出现的次数与语料库文档数的百分比，如果是int，则表示词出现的次数。如果参数中已经给定了vocabulary，则这个参数无效
类似于max_df，不同之处在于如果某个词的document frequence小于min_df，则这个词不会被当作关键词
max_features：默认为None，可设为int，对所有关键词的term frequency进行降序排序，只取前max_features个作为关键词集

Original TfidfVectorizer class is:¶

class sklearn.feature_extraction.text.TfidfVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Important parameters:¶

TfidfVectorizer与CountVectorizer有很多相同的参数，下面只解释不同的参数

binary：默认为False，tf-idf中每个词的权值是tf*idf，如果binary设为True，所有出现的词的tf将置为1，TfidfVectorizer计算得到的tf与CountVectorizer得到的tf是一样的，就是词频，不是词频/该词所在文档的总词数。
norm：默认为'l2'，可设为'l1'或None，计算得到tf-idf值后，如果norm='l2'，则整行权值将归一化，即整行权值向量为单位向量，如果norm=None，则不会进行归一化。大多数情况下，使用归一化是有必要的。
use_idf：默认为True，权值是tf*idf，如果设为False，将不使用idf，就是只使用tf，相当于CountVectorizer了。
smooth_idf：idf平滑参数，默认为True，idf=ln((文档总数+1)/(包含该词的文档数+1))+1，如果设为False，idf=ln(文档总数/包含该词的文档数)+1
sublinear_tf：默认为False，如果设为True，则替换tf为1 + log(tf)。

Reference: http://blog.csdn.net/du_qi/article/details/51564303

Model training¶

Here we use naive bayes model to classify the news.

In [10]:

from sklearn.naive_bayes import MultinomialNB
# start training
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

# Next, we will write two sentences to test the model.
docs_new = ['Abuse of antibiotics is very common', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

# the following code will show the category pridicted by the model
predicted = clf.predict(X_new_tfidf)
print predicted
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

[2 1]
'Abuse of antibiotics is very common' => sci.med
'OpenGL on the GPU is fast' => comp.graphics

From the result, we can see that the model is not bad. But this is not the way we evaluate a model, we need to do the followings:

Model Evaluation¶

We need to use the Accuracy, Precision, Recall, F1-measure to evaluate a model.

Original metrics.classification_report class¶

sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None, sample_weight=None, digits=2) Parameters:

y_true : 1d array-like, or label indicator array / sparse matrix Ground truth (correct) target values.
y_pred : 1d array-like, or label indicator array / sparse matrix Estimated targets as returned by a classifier.
target_names : list of strings Optional display names matching the labels (same order).

Returns:

report: string
Text summary of the precision, recall, F1 score for each class.

About Precision, Recall, F1-score¶

正确率(Precision) = 提取出的正确信息条数 / 提取出的信息条数
召回率(Recall) = 提取出的正确信息条数 / 样本中的信息条数

两者(Precision, Recall)取值在0和1之间，数值越接近1，查准率或查全率就越高。¶

F值(F1-score) = 正确率 * 召回率 * 2 / (正确率 + 召回率) （F 值即为正确率和召回率的调和平均值）

E.g.¶

不妨举这样一个例子：某池塘有1400条鲤鱼，300只虾，300只鳖。现在以捕鲤鱼为目的。撒一大网，逮着了700条鲤鱼，200只虾，100只鳖。那么，这些指标分别如下：

正确率 = 700 / (700 + 200 + 100) = 70%

召回率 = 700 / 1400 = 50%

F值 = 70% * 50% * 2 / (70% + 50%) = 58.3%

不妨看看如果把池子里的所有的鲤鱼、虾和鳖都一网打尽，这些指标又有何变化：

正确率 = 1400 / (1400 + 300 + 300) = 70%

召回率 = 1400 / 1400 = 100%

F值 = 70% * 100% * 2 / (70% + 100%) = 82.35%

由此可见，正确率是评估捕获的成果中目标成果所占得比例；召回率，顾名思义，就是从关注领域中，召回目标类别的比例；而F值，则是综合这二者指标的评估指标，用于综合反映整体的指标。

In [13]:

from sklearn import metrics
import numpy as np;

# get the test data from test dataset
twenty_test = fetch_20newsgroups(subset='test',categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
#vectorize test data
X_test_counts = count_vect.transform(docs_test)
#extract feature of test data
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
# use the model to predict the category 
predicted = clf.predict(X_test_tfidf)

#get the precision, recall, f1-score and support of this model
print(metrics.classification_report(twenty_test.target, predicted,target_names=twenty_test.target_names))
#get the accuracy of the model
print("accurary\t"+str(np.mean(predicted == twenty_test.target)))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

           avg / total       0.88      0.83      0.84      1502

accurary	0.834886817577

We can see that the precision of some categories are close to 1, which means a high precision. In this model, the accuracy is 0.8348.

Interpretation of the terms in the report:¶

precision: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Precision = TP/TP+FP

Recall: (Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. Recall = TP/TP+FN

F1-score: The f1-score gives you the harmonic mean of precision and recall. The scores corresponding to every class will tell you the accuracy of the classifier in classifying the data points in that particular class compared to all other classes. F1 Score = 2(Recall Precision) / (Recall + Precision)

Support: The support is the number of occurrences of each class in y_true.For instance, the support of alt.atheism category is 319. This means in the dataset, there are 319 records with the category of alt.atheism.

Accuracy: Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. For our model, we have got 0.803 which means our model is approx. 80% accurate.

Reference: http://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

In [ ]: