In the lecture we took a look at a simple tokenizer and sentence segmenter. In this exercise we will expand our understanding of the problem by asking a few important questions, and looking at the problem from a different perspectives.
import re
Write a tokenizer to correctly tokenize the following text:
text = """'Curiouser and curiouser!' cried Alice (she was so much surprised, that for the moment she quite
forgot how to speak good English); 'now I'm opening out like the largest telescope that ever was! Good-bye,
feet!' (for when she looked down at her feet, they seemed to be almost out of sight, they were getting so far
off). 'Oh, my poor little feet, I wonder who will put on your shoes and stockings for you now, dears? I'm sure I
shan't be able! I shall be a great deal too far off to trouble myself about you: you must manage the best
way you can; —but I must be kind to them,' thought Alice, 'or perhaps they won't walk the way I want to go!
Let me see: I'll give them a new pair of boots every Christmas.'
"""
token = re.compile('Mr.|[\w\']+|[.?]')
tokens = token.findall(text)
print(tokens)
["'Curiouser", 'and', 'curiouser', "'", 'cried', 'Alice', 'she', 'was', 'so', 'much', 'surprised', 'that', 'for', 'the', 'moment', 'she', 'quite', 'forgot', 'how', 'to', 'speak', 'good', 'English', "'now", "I'm", 'opening', 'out', 'like', 'the', 'largest', 'telescope', 'that', 'ever', 'was', 'Good', 'bye', 'feet', "'", 'for', 'when', 'she', 'looked', 'down', 'at', 'her', 'feet', 'they', 'seemed', 'to', 'be', 'almost', 'out', 'of', 'sight', 'they', 'were', 'getting', 'so', 'far', 'off', '.', "'Oh", 'my', 'poor', 'little', 'feet', 'I', 'wonder', 'who', 'will', 'put', 'on', 'your', 'shoes', 'and', 'stockings', 'for', 'you', 'now', 'dears', '?', "I'm", 'sure', 'I', "shan't", 'be', 'able', 'I', 'shall', 'be', 'a', 'great', 'deal', 'too', 'far', 'off', 'to', 'trouble', 'myself', 'about', 'you', 'you', 'must', 'manage', 'the', 'best', 'way', 'you', 'can', 'but', 'I', 'must', 'be', 'kind', 'to', 'them', "'", 'thought', 'Alice', "'or", 'perhaps', 'they', "won't", 'walk', 'the', 'way', 'I', 'want', 'to', 'go', 'Let', 'me', 'see', "I'll", 'give', 'them', 'a', 'new', 'pair', 'of', 'boots', 'every', 'Christmas', '.', "'"]
Questions:
As you might imagine, tokenizing tweets differs from standard tokenization. There are 'rules' on what specific elements of a tweet might be (mentions, hashtags, links), and how they are tokenized. The goal of this exercise is not to create a bullet-proof Twitter tokenizer but to understand tokenization in a different domain.
Tokenize the following UCLMR tweet correctly:
tweet = "#emnlp2016 paper on numerical grounding for error correction http://arxiv.org/abs/1608.04147 @geospith @riedelcastro #NLProc"
tweet
'#emnlp2016 paper on numerical grounding for error correction http://arxiv.org/abs/1608.04147 @geospith @riedelcastro #NLProc'
token = re.compile('[\w\s]+')
tokens = token.findall(tweet)
print(tokens)
['emnlp2016 paper on numerical grounding for error correction http', 'arxiv', 'org', 'abs', '1608', '04147 ', 'geospith ', 'riedelcastro ', 'NLProc']
Questions:
Sentence segmentation is not a trivial task either. There might be some cases where your simple sentence segmentation won't work properly.
First, make sure you understand the following sentence segmentation code used in the lecture:
import re
def sentence_segment(match_regex, tokens):
"""
Splits a sequence of tokens into sentences, splitting wherever the given matching regular expression
matches.
Parameters
----------
match_regex the regular expression that defines at which token to split.
tokens the input sequence of string tokens.
Returns
-------
a list of token lists, where each inner list represents a sentence.
>>> tokens = ['the','man','eats','.','She', 'sleeps', '.']
>>> sentence_segment(re.compile('\.'), tokens)
[['the', 'man', 'eats', '.'], ['She', 'sleeps', '.']]
"""
current = []
sentences = [current]
for tok in tokens:
current.append(tok)
if match_regex.match(tok):
current = []
sentences.append(current)
if not sentences[-1]:
sentences.pop(-1)
return sentences
Next, modify the following code so that sentence segmentation returns correctly segmented sentences on the following text:
text = """
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is the longest official one-word placename in U.K. Isn't that weird? I mean, someone took the effort to really make this name as complicated as possible, huh?! Of course, U.S.A. also has its own record in the longest name, albeit a bit shorter... This record belongs to the place called Chargoggagoggmanchauggagoggchaubunagungamaugg. There's so many wonderful little details one can find out while browsing http://www.wikipedia.org during their Ph.D. or an M.Sc.
"""
token = re.compile('Mr.|[\w\']+|[.?]')
tokens = token.findall(text)
sentences = sentence_segment(re.compile('\.'), tokens)
for sentence in sentences:
print(sentence)
['Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch', 'is', 'the', 'longest', 'official', 'one', 'word', 'placename', 'in', 'U', '.'] ['K', '.'] ["Isn't", 'that', 'weird', '?', 'I', 'mean', 'someone', 'took', 'the', 'effort', 'to', 'really', 'make', 'this', 'name', 'as', 'complicated', 'as', 'possible', 'huh', '?', 'Of', 'course', 'U', '.'] ['S', '.'] ['A', '.'] ['also', 'has', 'its', 'own', 'record', 'in', 'the', 'longest', 'name', 'albeit', 'a', 'bit', 'shorter', '.'] ['.'] ['.'] ['This', 'record', 'belongs', 'to', 'the', 'place', 'called', 'Chargoggagoggmanchauggagoggchaubunagungamaugg', '.'] ["There's", 'so', 'many', 'wonderful', 'little', 'details', 'one', 'can', 'find', 'out', 'while', 'browsing', 'http', 'www', '.'] ['wikipedia', '.'] ['org', 'during', 'their', 'Ph', '.'] ['D', '.'] ['or', 'an', 'M', '.'] ['Sc', '.']
Questions: