In this document, we will use 20newsgroup dataset as a corpus to practice:
reference: http://blog.csdn.net/qq_35082030/article/details/70211552
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
print twenty_train.target_names
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
fetch_20newsgroups(data_home=None,subset='train',categories=None,shuffle=True,random_state=42,remove=(),download_if_missing=True)
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
class sklearn.feature_extraction.text.CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
class sklearn.feature_extraction.text.TfidfVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
TfidfVectorizer与CountVectorizer有很多相同的参数,下面只解释不同的参数
Reference: http://blog.csdn.net/du_qi/article/details/51564303
Here we use naive bayes model to classify the news.
from sklearn.naive_bayes import MultinomialNB
# start training
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
# Next, we will write two sentences to test the model.
docs_new = ['Abuse of antibiotics is very common', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
# the following code will show the category pridicted by the model
predicted = clf.predict(X_new_tfidf)
print predicted
for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, twenty_train.target_names[category]))
[2 1] 'Abuse of antibiotics is very common' => sci.med 'OpenGL on the GPU is fast' => comp.graphics
From the result, we can see that the model is not bad. But this is not the way we evaluate a model, we need to do the followings:
We need to use the Accuracy, Precision, Recall, F1-measure to evaluate a model.
sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None, sample_weight=None, digits=2) Parameters:
Returns:
正确率(Precision) = 提取出的正确信息条数 / 提取出的信息条数
召回率(Recall) = 提取出的正确信息条数 / 样本中的信息条数
不妨举这样一个例子:某池塘有1400条鲤鱼,300只虾,300只鳖。现在以捕鲤鱼为目的。撒一大网,逮着了700条鲤鱼,200只虾,100只鳖。那么,这些指标分别如下:
正确率 = 700 / (700 + 200 + 100) = 70%
召回率 = 700 / 1400 = 50%
F值 = 70% * 50% * 2 / (70% + 50%) = 58.3%
不妨看看如果把池子里的所有的鲤鱼、虾和鳖都一网打尽,这些指标又有何变化:
正确率 = 1400 / (1400 + 300 + 300) = 70%
召回率 = 1400 / 1400 = 100%
F值 = 70% * 100% * 2 / (70% + 100%) = 82.35%
from sklearn import metrics
import numpy as np;
# get the test data from test dataset
twenty_test = fetch_20newsgroups(subset='test',categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
#vectorize test data
X_test_counts = count_vect.transform(docs_test)
#extract feature of test data
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
# use the model to predict the category
predicted = clf.predict(X_test_tfidf)
#get the precision, recall, f1-score and support of this model
print(metrics.classification_report(twenty_test.target, predicted,target_names=twenty_test.target_names))
#get the accuracy of the model
print("accurary\t"+str(np.mean(predicted == twenty_test.target)))
precision recall f1-score support alt.atheism 0.97 0.60 0.74 319 comp.graphics 0.96 0.89 0.92 389 sci.med 0.97 0.81 0.88 396 soc.religion.christian 0.65 0.99 0.78 398 avg / total 0.88 0.83 0.84 1502 accurary 0.834886817577
We can see that the precision of some categories are close to 1, which means a high precision. In this model, the accuracy is 0.8348.
precision: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Precision = TP/TP+FP
Recall: (Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. Recall = TP/TP+FN
F1-score: The f1-score gives you the harmonic mean of precision and recall. The scores corresponding to every class will tell you the accuracy of the classifier in classifying the data points in that particular class compared to all other classes. F1 Score = 2(Recall Precision) / (Recall + Precision)
Support: The support is the number of occurrences of each class in y_true.For instance, the support of alt.atheism category is 319. This means in the dataset, there are 319 records with the category of alt.atheism.
Accuracy: Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. For our model, we have got 0.803 which means our model is approx. 80% accurate.