Sentiment Analysis of Twitter posts

Marcin Zabłocki


The goal of this project was to predict sentiment for the given Twitter post using Python. Sentiment analysis can predict many different emotions attached to the text, but in this report only 3 major were considered: positive, negative and neutral. The training dataset was small (just over 5900 examples) and the data within it was highly skewed, which greatly impacted on the difficulty of building good classifier. After creating a lot of custom features, utilizing both bag-of-words and word2vec representations and applying the Extreme Gradient Boosting algorithm, the classification accuracy at level of 58% was achieved.

Used Python Libraries

Data was pre-processed using pandas, gensim and numpy libraries and the learning/validating process was built with scikit-learn. Plots were created using plotly.

In [2]:
from collections import Counter
import nltk
import pandas as pd
from emoticons import EmoticonDetector
import re as regex
import numpy as np
import plotly
from plotly import graph_objs
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from time import time
import gensim

# plotly configuration
C:\Program Files\Anaconda3\lib\site-packages\nltk\ DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  regargs, varargs, varkwargs, defaults = inspect.getargspec(func)
C:\Program Files\Anaconda3\lib\site-packages\gensim\ UserWarning:

detected Windows; aliasing chunkize to chunkize_serial

C:\Program Files\Anaconda3\lib\site-packages\numpy\lib\ DeprecationWarning:

`scipy.sparse.sparsetools` is deprecated!
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.