Classifying News Headlines and Explaining the Result¶

Data is from Kaggle's News Aggregator Dataset

In [1]:

import pandas as pd

I sampled 10% of the data to speed up the analysis.

In [2]:

news = pd.read_csv('data/uci-news-aggregator.csv').sample(frac=0.1)

In [3]:

len(news)

Out[3]:

In [4]:

news.head(3)

Out[4]:

	ID	TITLE	URL	PUBLISHER	CATEGORY	STORY	HOSTNAME	TIMESTAMP
58434	58435	Russell Crowe Sings Johnny Cash on 'The Tonigh...	http://screencrush.com/russell-crowe-johnny-cash/	ScreenCrush	e	dxzxHQTC1v6cP7MdjlKbJkMlfYwLM	screencrush.com	1396019111324
244967	245413	HP cuts more jobs than expected	http://www.digitaljournal.com/business/busines...	DigitalJournal.com	b	de8PjvC03vbwIdMC0hkfXZTLVY0sM	www.digitaljournal.com	1400928726875
314969	315429	NTSB faults pilots in last year's Asiana flight	http://ktar.com/23/1744462/NTSB-faults-pilots-...	KTAR.com	b	deigsQuEj4RZW3M_TqkzwLBT_oUTM	ktar.com	1403705331596

In [5]:

from sklearn.preprocessing import LabelEncoder

In [6]:

encoder = LabelEncoder()

In [7]:

X = news['TITLE']
y = encoder.fit_transform(news['CATEGORY'])

In [8]:

from sklearn.model_selection import train_test_split

In [9]:

X_train, X_test, y_train, y_test = train_test_split(X, y)

We count the number of occurences of each word and use it as our features.

In [10]:

from sklearn.feature_extraction.text import CountVectorizer

In [13]:

vectorizer = CountVectorizer(min_df=3)

In [25]:

train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

train_vectors

Out[25]:

<31681x9925 sparse matrix of type '<class 'numpy.int64'>'
	with 267231 stored elements in Compressed Sparse Row format>

We use a random forest for classification.

In [14]:

from sklearn.ensemble import RandomForestClassifier

In [15]:

rf = RandomForestClassifier(n_estimators=20)
rf.fit(train_vectors, y_train)

Out[15]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [16]:

from sklearn.metrics import accuracy_score

In [17]:

pred = rf.predict(test_vectors)
accuracy_score(y_test, pred, )

Out[17]:

0.85048764321560455

85% accuracy, not a bad score.

Explaining the result¶

We'll use lime to explain the model.

To use lime, we need to construct a pipeline that does the process of vectorizing and classfying together.

In [19]:

from sklearn.pipeline import make_pipeline

In [26]:

c = make_pipeline(vectorizer, rf)

In [27]:

from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=list(encoder.classes_))

We take an example text from data.

In [28]:

example = X_test.sample(1).iloc[0]
example

Out[28]:

'Scientific Games to buy Bally Tech'

In [30]:

c.predict_proba([example])

Out[30]:

array([[ 0.95,  0.  ,  0.  ,  0.05]])

In [32]:

exp = explainer.explain_instance(example, c.predict_proba, top_labels=1)

/Users/libelo/anaconda/lib/python3.5/re.py:203: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)

In [33]:

exp.show_in_notebook()

Above is the explanation of the classification generated by lime.

Reference¶

https://github.com/marcotcr/lime

dreamgonfly@gmail.com