The goal of this project was to predict sentiment for the given Twitter post using Python. Sentiment analysis can predict many different emotions attached to the text, but in this report only 3 major were considered: positive, negative and neutral. The training dataset was small (just over 5900 examples) and the data within it was highly skewed, which greatly impacted on the difficulty of building good classifier. After creating a lot of custom features, utilizing both bag-of-words and word2vec representations and applying the Extreme Gradient Boosting algorithm, the classification accuracy at level of 58% was achieved.
Data was pre-processed using pandas, gensim and numpy libraries and the learning/validating process was built with scikit-learn. Plots were created using plotly.
from collections import Counter import nltk import pandas as pd from emoticons import EmoticonDetector import re as regex import numpy as np import plotly from plotly import graph_objs from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV from time import time import gensim # plotly configuration plotly.offline.init_notebook_mode()
C:\Program Files\Anaconda3\lib\site-packages\nltk\decorators.py:59: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead regargs, varargs, varkwargs, defaults = inspect.getargspec(func) C:\Program Files\Anaconda3\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial C:\Program Files\Anaconda3\lib\site-packages\numpy\lib\utils.py:99: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated! scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.