Each of these higher level tasks is broken into smaller discrete components that we use to accomplish these higher level goals.
Caution : NLP is not equivalent to machine learning (ML) with text. They have very different goals. NLP is about analyzing, understanding and generating natural human languages. That’s a very different goal than ML which is, in the case of supervised ML for instance, building predictive models. The fact that text is the input data doesn’t make it NLP. ML is more a component that is useful for the broad field of NLP, i.e. it is a tool in the NLP arsenal.
# read yelp.csv into a DataFrame using a relative path import pandas as pd path = '../data/yelp.csv' yelp = pd.read_csv(path)
# examine the first three rows yelp.head(3)
|0||9yKzy9PApeiPPOUJEtnvkg||2011-01-26||fWKvX83p0-ka4JS3dc6E5A||5||My wife took me here on my birthday for breakf...||review||rLtl8ZkDX5vH5nAx9C3q5Q||2||5||0|
|1||ZRJwVLyzEJq1VAihDhYiow||2011-07-27||IjZ33sJrzXqU-0X6U8NwyA||5||I have no idea why some people give bad review...||review||0a2KyEL0d3Yb1V6aivbIuQ||0||0||0|
|2||6oRAC4uyJCsJl1X0WZpVSA||2012-06-14||IESLBzqUCLdSzSqm0eCSxQ||4||love the gyro plate. Rice is so good and I als...||review||0hT2KtfLiobPvh6cDC8JQg||0||1||0|
# examine the text for the first row yelp.loc[0, 'text']
'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'
Goal : Distinguish between 5-star and 1-star reviews using only the review text. (We will not be using the other columns.)
Why do we use only the 5-star and 1-star reviews? There are two reasons for this. First, solving a 2-class problem (also known as binary classification) instead of a 5-class problem is much easier. Second, we might actually get better results by only training our model on 1-star and 5-star reviews even though we are going to predict 1, 2, 3, 4, or 5 stars. The reason for that is that 1-star and 5-star reviews have the most polarized language. They are the most extreme so there’s the most signal about psitivity or negativity in those reviews. So what we might do is just build a model with 1-star and 5-star reviews only. Then, to predict 1, 2, 3, 4, or 5, we can take the predicted probabilities of 5-star and place them to five different buckets. That’s all to say that just because we have a 5-class problem doesn’t mean we necessarily need to train our model on all five classes.
# examine the class distribution yelp.stars.value_counts().sort_index()
1 749 2 927 3 1461 4 3526 5 3337 Name: stars, dtype: int64
# create a new DataFrame that only contains the 5-star and 1-star reviews yelp_best_worst = yelp[(yelp.stars == 5) | (yelp.stars == 1)]
# examine the shape yelp_best_worst.shape
# define X and y X = yelp_best_worst.text y = yelp_best_worst.stars
# split X and y into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
# examine the object shapes print(X_train.shape) print(X_test.shape)
Tokenization doesn’t necessarily mean separate text into words. That’s what
CountVectorizer does by default but technically, our token can be whatever unit is meaningful to us. A sentence can be the token level, a paragraph can be the token level, or very commonly a n-gram is the token level. Recall that a one-gram is a collection of all single words, two-grams are collections of words pairs, ect.
# use CountVectorizer to create document-term matrices from X_train and X_test from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer()
# fit and transform X_train X_train_dtm = vect.fit_transform(X_train)
# only transform X_test X_test_dtm = vect.transform(X_test)
# examine the shapes : rows are documents, columns are terms (aka "tokens" or "features") print(X_train_dtm.shape) print(X_test_dtm.shape)
(3064, 16825) (1022, 16825)
# examine the last 50 features print(vect.get_feature_names()[-50:])
['yyyyy', 'z11', 'za', 'zabba', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zihuatenejo', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombi', 'zombies', 'zone', 'zones', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel', 'zzed', 'éclairs', 'école', 'ém']
# show default parameters for CountVectorizer vect
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
# don't convert to lowercase vect = CountVectorizer(lowercase=False) X_train_dtm = vect.fit_transform(X_train) X_train_dtm.shape
# include 1-grams and 2-grams vect = CountVectorizer(ngram_range = (1, 2)) X_train_dtm = vect.fit_transform(X_train) X_train_dtm.shape
# examine the last 50 features print(vect.get_feature_names()[-50:])
['zone out', 'zone when', 'zones', 'zones dolls', 'zoning', 'zoning issues', 'zoo', 'zoo and', 'zoo is', 'zoo not', 'zoo the', 'zoo ve', 'zoyo', 'zoyo for', 'zucca', 'zucca appetizer', 'zucchini', 'zucchini and', 'zucchini bread', 'zucchini broccoli', 'zucchini carrots', 'zucchini fries', 'zucchini pieces', 'zucchini strips', 'zucchini veal', 'zucchini very', 'zucchini with', 'zuchinni', 'zuchinni again', 'zuchinni the', 'zumba', 'zumba class', 'zumba or', 'zumba yogalates', 'zupa', 'zupa flavors', 'zuzu', 'zuzu in', 'zuzu is', 'zuzu the', 'zwiebel', 'zwiebel kräuter', 'zzed', 'zzed in', 'éclairs', 'éclairs napoleons', 'école', 'école lenôtre', 'ém', 'ém all']
How do we decide whether we want to a different $n$-gram range ? We should always compare our classification models against the null model. The null model is a model that always predicts the most frequent class.
Approach 1 : Always predict the most frequent class
5 838 1 184 Name: stars, dtype: int64
# calculate null accuracy # method 1 y_test.value_counts().head(1) / y_test.shape # method 2 # print(y_test.value_counts() / y_test.shape)
5 0.819961 Name: stars, dtype: float64
82% is the accuracy that could be achieved by always predicting the most frequent class.
Approach 2 : Use the default parameters for CountVectorizer
from sklearn.naive_bayes import MultinomialNB from sklearn import metrics # define a function that accepts a vectorizer and calculates the accuracy def tokenize_test(vect): # create document-term matrices using the vectorizer X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) # print the number of features that were generated print('Features : ', X_train_dtm.shape) # use Multinomial Naive Bayes to predict the star rating nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) # print the accuracy of its predictions print('Accuracy : ', metrics.accuracy_score(y_test, y_pred_class))
# use the default parameters vect = CountVectorizer() tokenize_test(vect)
Features : 16825 Accuracy : 0.9187866927592955
Note that the accuracy score that we get is better than the one of the null model.
Approach 3 : Don't convert to lowercase
# don't convert to lowercase vect = CountVectorizer(lowercase = False) tokenize_test(vect)
Features : 20838 Accuracy : 0.9099804305283757
We would probably have guessed a greater accuracy since there are more nuanced features. But this is not the case. Why ? One way to think about this is that we now have certain features that actually mean the same thing but are associated to different weights/values.
Stepping back, recall that in ML (in general), we only want to pass features to our model that have signal. Let's talk about HAM and SPAM messages for a second to get an intuition about this. Ideally, we want our model to learn which words are hammy and which words are spammy. If the word
'nasty' is spammy either way whether it's uppercase or lowercase, it's always better to have it only once as a feature. We only want to add features if those features are adding more signal to our model then they're adding noise. On average, including lowercase and uppercase versions of the same word can result in diluting the signal rather than increasing it.
Another way to think about this is saying that our new model is overfitting the data. Recall that overfitting is creating a model that is unneccessarily complex and adding terms does increase our model complexity.
Approach 4 : Include 1-grams and 2-grams
Note that our previous models were not picking up on negation until now. For instance,
'happy' would most probably be associated to a 5-star word whereas
'not happy' would be associated to a 1-star phrase. So maybe by including 2-grams, we will pick up on 5-star or 1-star phrases.
# include 1-grams and 2-grams vect = CountVectorizer(ngram_range=(1, 2)) tokenize_test(vect)
Features : 169847 Accuracy : 0.8542074363992173
So it actually got worse... One conclusion we can make about our experimentations is that it is hard to predict the outcome.
Summary : Tuning CountVectorizer is a form of feature engineering, the process through which you create features that don't natively exist in the dataset. Our goal is to create features that contain the signal from the data (with respect to the response value), rather than the noise.
Conclusion : We should always remember the context. Let's pretend that our task was to predict English proficiency. In that case, lowercase or not might be hugely predictive of whether or not someone is highly proficient in English. If that is our task, i.e. if that is our response we're trying to predict, then
lowercase = False might actually be a good setting which will create features that have signal (not noise). So it all comes down to what is our task, the context of our problem, and how can we tune
CountVectorizer to give us features that have the signal, not features that have noise.
# show vectorizer parameters vect
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 2), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
# remove English stop words vect = CountVectorizer(stop_words = 'english') tokenize_test(vect)
Features : 16528 Accuracy : 0.9158512720156555
# examine the stop words print(sorted(vect.get_stop_words()))
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']
# ignore terms that appear in more than 50% of the documents vect = CountVectorizer(max_df = 0.5) tokenize_test(vect)
Features : 16815 Accuracy : 0.9207436399217221
# examine the terms that were removed due to max_df ("corpus-specific stop words") print(sorted(vect.stop_words_))
['and', 'for', 'in', 'is', 'it', 'my', 'of', 'the', 'this', 'to']
# vect.stop_words_ is completely distinct from vect.get_stop_words() print(vect.get_stop_words())
# only keep the top 1000 most frequent terms vect = CountVectorizer(max_features = 1000) tokenize_test(vect)
Features : 1000 Accuracy : 0.8923679060665362
# only keep terms that appear in at least 2 documents vect = CountVectorizer(min_df = 2) tokenize_test(vect)
Features : 8783 Accuracy : 0.9246575342465754
# include 1-grams and 2-grams, and only keep terms that appear in at least 2 documents vect = CountVectorizer(ngram_range = (1, 2), min_df = 2) tokenize_test(vect)
Features : 43957 Accuracy : 0.9324853228962818
Guidelines for tuning CountVectorizer :
From the scikit-learn documentation:
Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding. To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.
Why should you care ?
When working with text in Python, you are likely to encounter errors related to encoding, and understanding Unicode will help you to troubleshoot these errors.
Unicode basics :
ASCII basics :
The default encoding in Python 2 is ASCII. The default encoding in Python 3 is UTF-8.
# examine two types of strings print(type(b'hello')) print(type('hello'))
<class 'bytes'> <class 'str'>
# 'decode' converts 'bytes' to 'str' b'hello'.decode(encoding = 'utf-8')
# 'encode' converts 'str' to 'bytes' 'hello'.encode(encoding = 'utf-8')
From the scikit-learn documentation:
The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the files are in. The CountVectorizer takes an encoding parameter for this purpose. For modern text files, the correct encoding is probably UTF-8, which is therefore the default (encoding="utf-8").
If the text you are loading is not actually encoded with UTF-8, however, you will get a UnicodeDecodeError. The vectorizers can be told to be silent about decoding errors by setting the decode_error parameter to either "ignore" or "replace".