This notebook contains code and comments from Section 3.4 of the book Ensemble Methods for Machine Learning. Please see the book for additional details on this topic. This notebook and code are released under the MIT license.
Sentiment analysis is a natural language processing (NLP) task for identifying the the polarity of opinion as positive, neutral or negative. This case study explores a supervised sentiment analysis task for movie reviews. The data set we will use is the Large Movie Review Dataset, which was originally collected and curated for a 2011 paper on sentiment analysis:
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
# Data directory relative to this notebook
sentiment_data_directory = './data/ch03/'
This data set has already been pre-processed by count vectorization to generate bag-of-word features. These pre-processed term-document count features, our data set, can be found in ./data/ch03/train/labeledBow.feat
and ./data/ch03/test/labeledBow.feat
.
This step aims to remove common words such as “the”, “is”, “a”, “an”. Stop word removal can reduce the dimensionality of the data, (to make processing faster), and can improve classification performance. This is because words like “the” are often not really informative for information retrieval and text-mining tasks.
Listing 3.11: Drop stop words from the vocabulary
import nltk
import time
import numpy as np
def prune_vocabulary(data_path, max_features=5000):
start_time = time.time()
with open('{0}/imdb.vocab'.format(data_path), 'r', encoding='utf8') as vocab_file:
vocabulary = vocab_file.read().splitlines()
print('Vocabulary load time = {0} seconds.'.format(time.time() - start_time))
# Convert the list of stopwords to a set for faster processing
# nltk.download('stopwords') # **** UNCOMMENT THIS LINE IF YOU HAVEN'T USED NLTK BEFORE ***
stopwords = set(nltk.corpus.stopwords.words("english"))
# Keep only those vocabulary words that are NOT stopwords
to_keep = [True if word not in stopwords else False for word in vocabulary]
feature_ind = np.where(to_keep)[0]
return feature_ind[:max_features]
features = prune_vocabulary(sentiment_data_directory, max_features=5000)
Vocabulary load time = 0.04451417922973633 seconds.
Our second pre-processing step converts the count features to tf-idf features. These features represent the term frequency-inverse document frequency, a statistic that weights each feature in a document (in our case, a single review) relative to how often it appears in that document as well as how often it appears in the entire corpus (in our case, all the reviews).
We can use scikit-learn’s pre-processing toolbox to convert our count features to tf-idf features using the TfidfTransformer.
Listing 3.12: Extract tf-idf features and save the data set (NOTE: generates a 41 MB file)
import h5py
from sklearn.datasets import load_svmlight_files
from scipy.sparse import csr_matrix as sp
from sklearn.feature_extraction.text import TfidfTransformer
def preprocess_and_save(data_path, feature_ind):
data_files = ['{0}/{1}/labeledBow.feat'.format(data_path, data_set)
for data_set in ['train', 'test']]
[Xtrn, ytrn, Xtst, ytst] = load_svmlight_files(data_files)
n_features = len(feature_ind)
ytrn[ytrn <= 5], ytst[ytst <= 5] = 0, 0
ytrn[ytrn > 5], ytst[ytst > 5] = 1, 1
# Transform the bag-of-words
tfidf = TfidfTransformer()
Xtrn = tfidf.fit_transform(Xtrn[:, feature_ind])
Xtst = tfidf.transform(Xtst[:, feature_ind])
# Save the data in HDF5 format with sparse matrix representation
with h5py.File('{0}/imdb-{1}k.h5'.format(data_path,
round(n_features/1000)), 'w') as db:
db.create_dataset('Xtrn',
data=sp.todense(Xtrn), compression='gzip')
db.create_dataset('ytrn', data=ytrn, compression='gzip')
db.create_dataset('Xtst',
data=sp.todense(Xtst), compression='gzip')
db.create_dataset('ytst', data=ytst, compression='gzip')
preprocess_and_save(sentiment_data_directory, features)
We adopt the popular dimensionality reduction approach of principal components analysis (PCA), which aims to compress and embed the data into a lower-dimensional representation while while preserving as much of the information as possible.
To avoid loading the entire data set into memory and to process the data more efficiently, we perform Incremental PCA instead.
Listing 3.13: Perform dimensionality reduction using Incremental PCA (NOTE: generates a 187 MB file)
from sklearn.decomposition import IncrementalPCA
def transform_sentiment_data(data_path, n_features=5000, n_components=500):
db = h5py.File('{0}/imdb-{1}k.h5'.format(data_path, round(n_features/1000)), 'r')
pca = IncrementalPCA(n_components=n_components)
chunk_size = 1000
n_samples = db['Xtrn'].shape[0]
for i in range(0, n_samples // chunk_size):
pca.partial_fit(db['Xtrn'][i*chunk_size:(i+1) * chunk_size])
Xtrn = pca.transform(db['Xtrn'])
Xtst = pca.transform(db['Xtst'])
with h5py.File('{0}/imdb-{1}k-pca{2}.h5'.format(data_path,
round(n_features/1000), n_components), 'w') as db2:
db2.create_dataset('Xtrn', data=Xtrn, compression='gzip')
db2.create_dataset('ytrn', data=db['ytrn'], compression='gzip')
db2.create_dataset('Xtst', data=Xtst, compression='gzip')
db2.create_dataset('ytst', data=db['ytst'], compression='gzip')
transform_sentiment_data(sentiment_data_directory, n_features=5000, n_components=500)
Our goal now is to train a heterogeneous ensemble with meta-learning. Specifically, we will use ensemble several base estimators by blending them. Blending is a variant of stacking, where, instead of using cross validation, we use a single validation set.
def load_sentiment_data(data_path,n_features=5000, n_components=500):
with h5py.File('{0}/imdb-{1}k-pca{2}.h5'.format(data_path,
round(n_features/1000), n_components), 'r') as db:
Xtrn = np.array(db.get('Xtrn'))
ytrn = np.array(db.get('ytrn'))
Xtst = np.array(db.get('Xtst'))
ytst = np.array(db.get('ytst'))
return Xtrn, ytrn, Xtst, ytst
Xtrn, ytrn, Xtst, ytst = load_sentiment_data('./data/Ch03/', n_features=5000, n_components=500)
Next, we use five base estimators: RandomForestClassifier
with 100 randomized decision trees, [ExtraTreesClassifier
] (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier) with 100 extremely randomized trees, Logistic Regression
, Bernoulli naïve Bayes (BernoulliNB
) and a linear SVM trained with stochastic gradient descent (SGDClassifier
).
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import BernoulliNB
estimators = [('rf', RandomForestClassifier(n_estimators=100, n_jobs=-1)),
('xt', ExtraTreesClassifier(n_estimators=100, n_jobs=-1)),
('lr', LogisticRegression(C=0.01, solver='lbfgs')),
('bnb', BernoulliNB()),
('svm', SGDClassifier(loss='hinge', penalty='l2', alpha=0.01,
n_jobs=-1, max_iter=10, tol=None))]
To blend these base-estimators into a heterogeneous ensemble with meta-learning, we use the following procedure:
(Xtrn, ytrn)
with 80% of the data and a validation set (Xval, yval)
, with the remaining 20% of the data(Xtrn, ytrn)
Xmeta
with the trained estimators using Xval
;[Xval, Xmeta]
; this augmented validation set will have 500 original features + 5 meta-features([Xval, Xmeta], yval)
This leaves one final decision: the choice of the level-2 estimator. Previously, we used simple linear classifiers. For this classification task, we utilize a neural network.
from sklearn.neural_network import MLPClassifier
meta_estimator = MLPClassifier(hidden_layer_sizes=(128, 64, 32),
alpha=0.001)
Listing 3.14: Blending models with a validation set
from sklearn.model_selection import train_test_split
def blend_models(level1_estimators, level2_estimator,
X, y , use_probabilities=False):
Xtrn, Xval, ytrn, yval = train_test_split(X, y, test_size=0.2)
n_estimators = len(level1_estimators)
n_samples = len(yval)
Xmeta = np.zeros((n_samples, n_estimators))
for i, (model, estimator) in enumerate(level1_estimators):
estimator.fit(Xtrn, ytrn)
Xmeta[:, i] = estimator.predict(Xval)
Xmeta = np.hstack([Xval, Xmeta])
level2_estimator.fit(Xmeta, yval)
final_model = {'level-1': level1_estimators,
'level-2': level2_estimator,
'use-proba': use_probabilities}
return final_model
Listing 3.2 for individual base estimator predictions and Listing 3.9 for meta-estimator predictions are combined here.
def predict_stacking(X, stacked_model):
# Get level-1 predictions
level1_estimators = stacked_model['level-1']
n_samples, n_estimators = X.shape[0], len(level1_estimators)
use_probabilities = stacked_model['use-proba']
Xmeta = np.zeros((n_samples, n_estimators)) # Initialize meta-features
for i, (model, estimator) in enumerate(level1_estimators):
if use_probabilities:
Xmeta[:, i] = estimator.predict_proba(X)[:, 1]
else:
Xmeta[:, i] = estimator.predict(X)
level2_estimator = stacked_model['level-2']
Xmeta = np.hstack([X, Xmeta])
y = level2_estimator.predict(Xmeta)
return y
Train a simple blending model with this code:
stacked_model = blend_models(estimators, meta_estimator, Xtrn, ytrn)
from sklearn.metrics import accuracy_score
ypred = predict_stacking(Xtrn, stacked_model)
trn_err = (1 - accuracy_score(ytrn, ypred)) * 100
print(trn_err)
ypred = predict_stacking(Xtst, stacked_model)
tst_err = (1 - accuracy_score(ytst, ypred)) * 100
print(tst_err)
7.340000000000002 16.959999999999997
We finally visualize and comparing the performance of each individual base classifier with the meta-classifier ensemble. Stacking/blending improves classification performance by ensembling diverse base classifiers.
trn_errors, tst_errors = np.zeros((len(estimators) + 1, )), np.zeros((len(estimators) + 1, ))
for i, (method, estimator) in enumerate(estimators):
start_time = time.time()
estimator.fit(Xtrn, ytrn)
run_time = time.time() - start_time
ypred = estimator.predict(Xtrn)
trn_errors[i] = (1 - accuracy_score(ytrn, ypred)) * 100
ypred = estimator.predict(Xtst)
tst_errors[i] = (1 - accuracy_score(ytst, ypred)) * 100
print('{0}: training error = {1:4.2f}%, test error = {2:4.2f}%, running time = {3:4.2f} seconds.'
.format(method, trn_errors[i], tst_errors[i], run_time))
trn_errors[-1] = trn_err
tst_errors[-1] = tst_err
rf: training error = 0.00%, test error = 22.41%, running time = 9.56 seconds. xt: training error = 0.00%, test error = 29.92%, running time = 4.26 seconds. lr: training error = 17.65%, test error = 18.03%, running time = 0.10 seconds. bnb: training error = 21.66%, test error = 22.84%, running time = 0.32 seconds. svm: training error = 27.34%, test error = 27.31%, running time = 0.39 seconds.
import matplotlib.pyplot as plt
def autolabel(ax, rects):
for rect in rects:
height = np.round(rect.get_height(), 1)
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
%matplotlib inline
labels = ['Random\nForest', 'Extra\nTrees', 'Logistic\nRegression',
'Bernoulli\n Naive Bayes', 'SVM', 'Blending with\nNeural Nets']
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(x - width / 2, trn_errors, width, label='Training Error')
rects2 = ax.bar(x + width / 2, tst_errors, width, label='Test Error')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Model performance (error %)', fontsize=12)
# ax.set_title('Scores by group and gender')
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=10)
ax.legend()
ax.set_axisbelow(True)
# ax.grid(linestyle='-', linewidth='0.5', color='gray')
autolabel(ax, rects1)
autolabel(ax, rects2)
fig.tight_layout()
# plt.savefig('./figures/CH03_F19_Kunapuli.png', format='png', dpi=300, bbox_inches='tight', pad_inches=0)
# plt.savefig('./figures/CH03_F19_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0)