Basic Natural Language Processing (NLP)

What is NLP ?

  • Using computers to process (analyze, understand, generate) natural human languages
  • Most knowledge created by humans is unstructured text, and we need a way to make sense of it
  • Build probabilistic model using data about a language
  • Requires an understanding of language and the world

Higher level "task areas"

Each of these higher level tasks is broken into smaller discrete components that we use to accomplish these higher level goals.

Lower level "components"

  • Tokenization : breaking text into tokens (words, sentences, n-grams)
  • Stop word removal : removing common words
  • TF-IDF : computing word importance
  • Stemming and lemmatization : reducing words to their base form
  • Part-of-speech tagging
  • Named entity recognition : person/organization/location
  • Segmentation : "New York City subway"
  • Word sense disambiguation : "buy a mouse"
  • Spelling correction
  • Language detection
  • Machine learning

Caution : NLP is not equivalent to machine learning (ML) with text. They have very different goals. NLP is about analyzing, understanding and generating natural human languages. That’s a very different goal than ML which is, in the case of supervised ML for instance, building predictive models. The fact that text is the input data doesn’t make it NLP. ML is more a component that is useful for the broad field of NLP, i.e. it is a tool in the NLP arsenal.

Agenda

  1. Reading in the Yelp reviews corpus
  2. Tokenizing the text
  3. Comparing the accuracy of different approaches
  4. Removing frequent terms (stop words)
  5. Removing infrequent terms
  6. Handling Unicode errors

Part 1 : Reading in the Yelp reviews corpus

  • "corpus" = collection of documents
  • "corpora" = plural form of corpus
In [1]:
# read yelp.csv into a DataFrame using a relative path
import pandas as pd
path = '../data/yelp.csv'
yelp = pd.read_csv(path)
In [2]:
# examine the first three rows
yelp.head(3)
Out[2]:
business_id date review_id stars text type user_id cool useful funny
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als... review 0hT2KtfLiobPvh6cDC8JQg 0 1 0
In [3]:
# examine the text for the first row
yelp.loc[0, 'text']
Out[3]:
'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

Goal : Distinguish between 5-star and 1-star reviews using only the review text. (We will not be using the other columns.)

Why do we use only the 5-star and 1-star reviews? There are two reasons for this. First, solving a 2-class problem (also known as binary classification) instead of a 5-class problem is much easier. Second, we might actually get better results by only training our model on 1-star and 5-star reviews even though we are going to predict 1, 2, 3, 4, or 5 stars. The reason for that is that 1-star and 5-star reviews have the most polarized language. They are the most extreme so there’s the most signal about psitivity or negativity in those reviews. So what we might do is just build a model with 1-star and 5-star reviews only. Then, to predict 1, 2, 3, 4, or 5, we can take the predicted probabilities of 5-star and place them to five different buckets. That’s all to say that just because we have a 5-class problem doesn’t mean we necessarily need to train our model on all five classes.

In [4]:
# examine the class distribution
yelp.stars.value_counts().sort_index()
Out[4]:
1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64
In [5]:
# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars == 5) | (yelp.stars == 1)]
In [6]:
# examine the shape
yelp_best_worst.shape
Out[6]:
(4086, 10)
In [7]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars
In [8]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
In [9]:
# examine the object shapes
print(X_train.shape)
print(X_test.shape)
(3064,)
(1022,)

Part 2 : Tokenizing the text

  • What : Separate text into units such as words, n-grams, or sentences
  • Why : Gives structure to previously unstructured text
  • Notes : Relatively easy with English language text, not easy with some languages

Tokenization doesn’t necessarily mean separate text into words. That’s what CountVectorizer does by default but technically, our token can be whatever unit is meaningful to us. A sentence can be the token level, a paragraph can be the token level, or very commonly a n-gram is the token level. Recall that a one-gram is a collection of all single words, two-grams are collections of words pairs, ect.

In [10]:
# use CountVectorizer to create document-term matrices from X_train and X_test
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
In [11]:
# fit and transform X_train
X_train_dtm = vect.fit_transform(X_train)
In [12]:
# only transform X_test
X_test_dtm = vect.transform(X_test)
In [13]:
# examine the shapes : rows are documents, columns are terms (aka "tokens" or "features")
print(X_train_dtm.shape)
print(X_test_dtm.shape)
(3064, 16825)
(1022, 16825)
In [14]:
# examine the last 50 features
print(vect.get_feature_names()[-50:])
['yyyyy', 'z11', 'za', 'zabba', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zihuatenejo', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombi', 'zombies', 'zone', 'zones', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel', 'zzed', 'éclairs', 'école', 'ém']
In [15]:
# show default parameters for CountVectorizer
vect
Out[15]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
  • lowercase: boolean, True by default
    • Convert all characters to lowercase before tokenizing.
In [16]:
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape
Out[16]:
(3064, 20838)
  • ngram_range: tuple $(\min n, \max n)$, default=$(1, 1)$
    • The lower and upper boundary of the range of $n$-values for different $n$-grams to be extracted.
    • All values of $n$ such that $\min n \leq n \leq \max n$ will be used.
In [17]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range = (1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape
Out[17]:
(3064, 169847)
In [18]:
# examine the last 50 features
print(vect.get_feature_names()[-50:])
['zone out', 'zone when', 'zones', 'zones dolls', 'zoning', 'zoning issues', 'zoo', 'zoo and', 'zoo is', 'zoo not', 'zoo the', 'zoo ve', 'zoyo', 'zoyo for', 'zucca', 'zucca appetizer', 'zucchini', 'zucchini and', 'zucchini bread', 'zucchini broccoli', 'zucchini carrots', 'zucchini fries', 'zucchini pieces', 'zucchini strips', 'zucchini veal', 'zucchini very', 'zucchini with', 'zuchinni', 'zuchinni again', 'zuchinni the', 'zumba', 'zumba class', 'zumba or', 'zumba yogalates', 'zupa', 'zupa flavors', 'zuzu', 'zuzu in', 'zuzu is', 'zuzu the', 'zwiebel', 'zwiebel kräuter', 'zzed', 'zzed in', 'éclairs', 'éclairs napoleons', 'école', 'école lenôtre', 'ém', 'ém all']

Part 3 : Comparing the accuracy of different approaches

How do we decide whether we want to a different $n$-gram range ? We should always compare our classification models against the null model. The null model is a model that always predicts the most frequent class.

Approach 1 : Always predict the most frequent class

In [19]:
y_test.value_counts()
Out[19]:
5    838
1    184
Name: stars, dtype: int64
In [20]:
# calculate null accuracy
# method 1
y_test.value_counts().head(1) / y_test.shape

# method 2
# print(y_test.value_counts()[5] / y_test.shape[0])
Out[20]:
5    0.819961
Name: stars, dtype: float64

82% is the accuracy that could be achieved by always predicting the most frequent class.

Approach 2 : Use the default parameters for CountVectorizer

In [21]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features : ', X_train_dtm.shape[1])
    
    # use Multinomial Naive Bayes to predict the star rating
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    
    # print the accuracy of its predictions
    print('Accuracy : ', metrics.accuracy_score(y_test, y_pred_class))
In [22]:
# use the default parameters
vect = CountVectorizer()
tokenize_test(vect)
Features :  16825
Accuracy :  0.9187866927592955

Note that the accuracy score that we get is better than the one of the null model.

Approach 3 : Don't convert to lowercase

In [23]:
# don't convert to lowercase
vect = CountVectorizer(lowercase = False)
tokenize_test(vect)
Features :  20838
Accuracy :  0.9099804305283757

We would probably have guessed a greater accuracy since there are more nuanced features. But this is not the case. Why ? One way to think about this is that we now have certain features that actually mean the same thing but are associated to different weights/values.

Stepping back, recall that in ML (in general), we only want to pass features to our model that have signal. Let's talk about HAM and SPAM messages for a second to get an intuition about this. Ideally, we want our model to learn which words are hammy and which words are spammy. If the word 'nasty' is spammy either way whether it's uppercase or lowercase, it's always better to have it only once as a feature. We only want to add features if those features are adding more signal to our model then they're adding noise. On average, including lowercase and uppercase versions of the same word can result in diluting the signal rather than increasing it.

Another way to think about this is saying that our new model is overfitting the data. Recall that overfitting is creating a model that is unneccessarily complex and adding terms does increase our model complexity.

Approach 4 : Include 1-grams and 2-grams

Note that our previous models were not picking up on negation until now. For instance, 'happy' would most probably be associated to a 5-star word whereas 'not happy' would be associated to a 1-star phrase. So maybe by including 2-grams, we will pick up on 5-star or 1-star phrases.

In [24]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)
Features :  169847
Accuracy :  0.8542074363992173

So it actually got worse... One conclusion we can make about our experimentations is that it is hard to predict the outcome.

Summary : Tuning CountVectorizer is a form of feature engineering, the process through which you create features that don't natively exist in the dataset. Our goal is to create features that contain the signal from the data (with respect to the response value), rather than the noise.

Conclusion : We should always remember the context. Let's pretend that our task was to predict English proficiency. In that case, lowercase or not might be hugely predictive of whether or not someone is highly proficient in English. If that is our task, i.e. if that is our response we're trying to predict, then lowercase = False might actually be a good setting which will create features that have signal (not noise). So it all comes down to what is our task, the context of our problem, and how can we tune CountVectorizer to give us features that have the signal, not features that have noise.

Part 4 : Removing frequent terms (stop words)

  • What : Remove common words that appear in most documents
  • Why : They probably don't tell you much about your text
In [25]:
# show vectorizer parameters
vect
Out[25]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
  • stop_words: string {'english'}, list, or None (default)
    • If 'english', a built-in stop word list for English is used.
    • If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    • If None, no stop words will be used.
In [26]:
# remove English stop words
vect = CountVectorizer(stop_words = 'english')
tokenize_test(vect)
Features :  16528
Accuracy :  0.9158512720156555
In [27]:
# examine the stop words
print(sorted(vect.get_stop_words()))
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']
  • max_df: float in range $[0.0, 1.0]$ or int, default=1.0
    • When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    • If float, the parameter represents a proportion of documents.
    • If integer, the parameter represents an absolute count.
In [28]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df = 0.5)
tokenize_test(vect)
Features :  16815
Accuracy :  0.9207436399217221
  • stop_words_: Terms that were ignored because they either:
    • occurred in too many documents (max_df)
    • occurred in too few documents (min_df)
    • were cut off by feature selection (max_features)
In [29]:
# examine the terms that were removed due to max_df ("corpus-specific stop words")
print(sorted(vect.stop_words_))
['and', 'for', 'in', 'is', 'it', 'my', 'of', 'the', 'this', 'to']
In [30]:
# vect.stop_words_ is completely distinct from vect.get_stop_words()
print(vect.get_stop_words())
None

Part 5 : Removing infrequent terms

  • max_features: int or None, default=None
    • If not None, build a vocabulary that only considers the top max_features ordered by term frequency across the corpus.
In [31]:
# only keep the top 1000 most frequent terms
vect = CountVectorizer(max_features = 1000)
tokenize_test(vect)
Features :  1000
Accuracy :  0.8923679060665362
  • min_df: float in range $[0.0, 1.0]$ or int, default=1
    • When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    • If float, the parameter represents a proportion of documents.
    • If integer, the parameter represents an absolute count.
In [32]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df = 2)
tokenize_test(vect)
Features :  8783
Accuracy :  0.9246575342465754
In [33]:
# include 1-grams and 2-grams, and only keep terms that appear in at least 2 documents
vect = CountVectorizer(ngram_range = (1, 2), min_df = 2)
tokenize_test(vect)
Features :  43957
Accuracy :  0.9324853228962818

Guidelines for tuning CountVectorizer :

  • Use your knowledge of the problem and the text, and your understanding of the tuning parameters, to help you decide what parameters to tune and how to tune them.
  • Experiment, and let the data tell you the best approach !

Part 6 : Handling Unicode errors

From the scikit-learn documentation:

Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding. To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.

Why should you care ?

When working with text in Python, you are likely to encounter errors related to encoding, and understanding Unicode will help you to troubleshoot these errors.

Unicode basics :

  • Unicode is a system that assigns a unique number for every character in every language. These numbers are called code points. For example, the code point for "A" is U+0041, and the official name is "LATIN CAPITAL LETTER A".
  • An encoding specifies how to store the code points in memory :
    • UTF-8 is the most popular Unicode encoding. It uses 8 to 32 bits to store each character.
    • UTF-16 is the second most popular Unicode encoding. It uses 16 or 32 bits to store each character.
    • UTF-32 is the least popular Unicode encoding. It uses 32 bits to store each character.

ASCII basics :

  • ASCII is an encoding from the 1960's that uses 8 bits to store each character, and only supports English characters.
  • ASCII-encoded files are sometimes called plain text.
  • UTF-8 is backward-compatible with ASCII, because the first 8 bits of a UTF-8 encoding are identical to the ASCII encoding.

The default encoding in Python 2 is ASCII. The default encoding in Python 3 is UTF-8.

In [34]:
# examine two types of strings
print(type(b'hello'))
print(type('hello'))
<class 'bytes'>
<class 'str'>
In [35]:
# 'decode' converts 'bytes' to 'str'
b'hello'.decode(encoding = 'utf-8')
Out[35]:
'hello'
In [36]:
# 'encode' converts 'str' to 'bytes'
'hello'.encode(encoding = 'utf-8')
Out[36]:
b'hello'

From the scikit-learn documentation:

The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the files are in. The CountVectorizer takes an encoding parameter for this purpose. For modern text files, the correct encoding is probably UTF-8, which is therefore the default (encoding="utf-8").

If the text you are loading is not actually encoded with UTF-8, however, you will get a UnicodeDecodeError. The vectorizers can be told to be silent about decoding errors by setting the decode_error parameter to either "ignore" or "replace".