Data is from Kaggle's News Aggregator Dataset
import pandas as pd
I sampled 10% of the data to speed up the analysis.
news = pd.read_csv('data/uci-news-aggregator.csv').sample(frac=0.1)
len(news)
42242
news.head(3)
ID | TITLE | URL | PUBLISHER | CATEGORY | STORY | HOSTNAME | TIMESTAMP | |
---|---|---|---|---|---|---|---|---|
58434 | 58435 | Russell Crowe Sings Johnny Cash on 'The Tonigh... | http://screencrush.com/russell-crowe-johnny-cash/ | ScreenCrush | e | dxzxHQTC1v6cP7MdjlKbJkMlfYwLM | screencrush.com | 1396019111324 |
244967 | 245413 | HP cuts more jobs than expected | http://www.digitaljournal.com/business/busines... | DigitalJournal.com | b | de8PjvC03vbwIdMC0hkfXZTLVY0sM | www.digitaljournal.com | 1400928726875 |
314969 | 315429 | NTSB faults pilots in last year's Asiana flight | http://ktar.com/23/1744462/NTSB-faults-pilots-... | KTAR.com | b | deigsQuEj4RZW3M_TqkzwLBT_oUTM | ktar.com | 1403705331596 |
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
X = news['TITLE']
y = encoder.fit_transform(news['CATEGORY'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
We count the number of occurences of each word and use it as our features.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=3)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
train_vectors
<31681x9925 sparse matrix of type '<class 'numpy.int64'>' with 267231 stored elements in Compressed Sparse Row format>
We use a random forest for classification.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=20)
rf.fit(train_vectors, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
from sklearn.metrics import accuracy_score
pred = rf.predict(test_vectors)
accuracy_score(y_test, pred, )
0.85048764321560455
85% accuracy, not a bad score.
We'll use lime to explain the model.
To use lime, we need to construct a pipeline that does the process of vectorizing and classfying together.
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=list(encoder.classes_))
We take an example text from data.
example = X_test.sample(1).iloc[0]
example
'Scientific Games to buy Bally Tech'
c.predict_proba([example])
array([[ 0.95, 0. , 0. , 0.05]])
exp = explainer.explain_instance(example, c.predict_proba, top_labels=1)
/Users/libelo/anaconda/lib/python3.5/re.py:203: FutureWarning: split() requires a non-empty pattern match. return _compile(pattern, flags).split(string, maxsplit)
exp.show_in_notebook()
Above is the explanation of the classification generated by lime.
dreamgonfly@gmail.com