In this code-along session, you will use some basic Natural Language Processing to plot the most frequently occurring words in the novel Moby Dick. In doing so, you'll also see the efficacy of thinking in terms of the following Data Science pipeline with a constant regard for process:
For example, what would the following word frequency distribution be from?
Follow the instructions in the README.md to get your system set up and ready to go.
What are the most frequent words in the novel Moby Dick and how often do they occur?
Your raw data is the text of Melville's novel Moby Dick. We can find it at Project Gutenberg.
TO DO: Head there, find Moby Dick and then store the relevant url in your Python namespace:
# Store url
url = 'https://www.gutenberg.org/files/2701/2701-h/2701-h.htm'
You're going to use requests
to get the web data.
You can find out more in DataCamp's Importing Data in Python (Part 2) course.
According to the requests
package website:
Requests is one of the most downloaded Python packages of all time, pulling in over 13,000,000 downloads every month. All the cool kids are doing it!
You'll be making a GET
request from the website, which means you're getting data from it. requests
make this easy with its get
function.
TO DO: Make the request here and check the object type returned.
# Import `requests`
import requests
# Make the request and check object type
r = requests.get(url)
type(r)
requests.models.Response
This is a Response
object. You can see in the requests
kickstart guide that a Response
object has an attribute text
that allows you to get the HTML from it!
TO DO: Get the HTML and print the HTML to check it out:
# Extract HTML from Response object and print
html = r.text
#print(html)
OK! This HTML is not quite what you want. However, it does contain what you want: the text of Moby Dick. What you need to do now is wrangle this HTML to extract the novel.
Recap:
Up next: it's time for you to parse the html and extract the text of the novel.
Here you'll use the package BeautifulSoup
. The package website says:
TO DO: Create a BeautifulSoup
object from the HTML.
# Import BeautifulSoup from bs4
from bs4 import BeautifulSoup
# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")
type(soup)
bs4.BeautifulSoup
From these soup objects, you can extract all types of interesting information about the website you're scraping, such as title:
# Get soup title
soup.title
<title> Moby Dick; Or the Whale, by Herman Melville </title>
Or the title as a string:
# Get soup title as string
soup.title.string
'\n Moby Dick; Or the Whale, by Herman Melville\n '
Or all URLs found within a page’s < a > tags (hyperlinks):
# Get hyperlinks from soup and check out first 10
soup.findAll('a')[:8]
[<a href="#link2H_4_0002"> ETYMOLOGY. </a>, <a href="#link2H_4_0003"> EXTRACTS (Supplied by a Sub-Sub-Librarian). </a>, <a href="#link2HCH0001"> CHAPTER 1. Loomings. </a>, <a href="#link2HCH0002"> CHAPTER 2. The Carpet-Bag. </a>, <a href="#link2HCH0003"> CHAPTER 3. The Spouter-Inn. </a>, <a href="#link2HCH0004"> CHAPTER 4. The Counterpane. </a>, <a href="#link2HCH0005"> CHAPTER 5. Breakfast. </a>, <a href="#link2HCH0006"> CHAPTER 6. The Street. </a>]
What you want to do is to extract the text from the soup
and there's a souper helpful .get_text()
method precisely for this.
TO DO: Get the text, print it out and have a look at it. Is it what you want?
# Get the text out of the soup and print it
text = soup.get_text()
#print(text)
Notice that this is now nearly what you want. You'll need to do a bit more work.
Recap:
Up next: you'll use Natural Language Processing, tokenization and regular expressions to extract the list of words in Moby Dick.
You'll now use nltk
, the Natural Language Toolkit, to
You want to tokenize your text, that is, split it into a list a words.
To do this, you're going to use a powerful tool called regular expressions, or regex.
The regular expression that matches all words beginning with 'p' is 'p\w+'. Let's unpack this:
You'll now use the built-in Python package re
to extract all words beginning with 'p' from the sentence 'peter piper picked a peck of pickled peppers' as a warm-up.
# Import regex package
import re
# Define sentence
sentence = 'peter piper pick a peck of pickled peppers'
# Define regex
ps = 'p\w+'
# Find all words in sentence that match the regex and print them
re.findall(ps, sentence)
['peter', 'piper', 'pick', 'peck', 'pickled', 'peppers']
This looks pretty good. Now, if 'p\w+' is the regex that matches words beginning with 'p', what's the regex that matches all words?
It's your job to now do this for our toy Peter Piper sentence above.
# Find all words and print them
re.findall('\w+', sentence)
['peter', 'piper', 'pick', 'a', 'peck', 'of', 'pickled', 'peppers']
TO DO: use regex to get all the words in Moby Dick:
# Find all words in Moby Dick and print several
tokens = re.findall('\w+', text)
tokens[:8]
['Moby', 'Dick', 'Or', 'the', 'Whale', 'by', 'Herman', 'Melville']
Recap:
Up next: extract the list of words in Moby Dick using nltk
, the Natural Language Toolkit.
Go get it!
# Import RegexpTokenizer from nltk.tokenize
from nltk.tokenize import RegexpTokenizer
# Create tokenizer
tokenizer = RegexpTokenizer('\w+')
# Create tokens
tokens = tokenizer.tokenize(text)
tokens[:8]
['Moby', 'Dick', 'Or', 'the', 'Whale', 'by', 'Herman', 'Melville']
TO DO: Create a list containing all the words in Moby Dick such that all words contain only lower case letters. You'll find the string method .lower()
handy:
# Initialize new list
words = []
# Loop through list tokens and make lower case
for word in tokens:
words.append(word.lower())
# Print several items from list as sanity check
words[:8]
['moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville']
Recap:
Up next: remove common words such as 'a' and 'the' from the list of words.
It is common practice to remove words that appear alot in the English language such as 'the', 'of' and 'a' (known as stopwords) because they're not so interesting. For more on all of these techniques, check out our Natural Language Processing Fundamentals in Python course.
The package nltk
has a list of stopwords in English which you'll now store as sw
and print the first several elements of.
If you get an error here, run the command nltk.download('stopwords')
to install the stopwords on your system.
# Import nltk
import nltk
# Get English stopwords and print some of them
sw = nltk.corpus.stopwords.words('english')
sw[:5]
['i', 'me', 'my', 'myself', 'we']
You want the list of all words in words
that are not in sw
. One way to get this list is to loop over all elements of words
and add the to a new list if they are not in sw
:
# Initialize new list
words_ns = []
# Add to words_ns all words that are in words but not in sw
for word in words:
if word not in sw:
words_ns.append(word)
# Print several list items as sanity check
words_ns[:5]
['moby', 'dick', 'whale', 'herman', 'melville']
Recap:
Up next: plot the word frequency distribution of words in Moby Dick.
Our question was 'What are the most frequent words in the novel Moby Dick and how often do they occur?'
You can now plot a frequency distribution of words in Moby Dick in two line of code using nltk
. To do this,
nltk.FreqDist()
;#Import datavis libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Figures inline and set visualization style
%matplotlib inline
sns.set()
# Create freq dist and plot
freqdist1 = nltk.FreqDist(words_ns)
freqdist1.plot(25)
Recap:
Up next: adding more stopwords.
# Import stopwords from sklearn
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
# Add sklearn stopwords to words_sw
sw = set(sw + list(ENGLISH_STOP_WORDS))
# Initialize new list
words_ns = []
# Add to words_ns all words that are in words but not in sw
for word in words:
if word not in sw:
words_ns.append(word)
# Create freq dist and plot
freqdist2 = nltk.FreqDist(words_ns)
freqdist2.plot(25)
The cool thing is that, in using nltk
to answer our question, we actually already presented our solution in a manner that can be communicated to other: a frequency distribution plot! You can read off the most common words, along with their frequency. For example, 'whale' is the most common word in the novel (go figure), excepting stopwords, and it occurs a whopping >1200 times!
As you have seen that there are lots of novels on Project Gutenberg we can make these word frequency distributions of, it makes sense to write your own function that does all of this:
def plot_word_freq(url):
"""Takes a url (from Project Gutenberg) and plots a word frequency
distribution"""
# Make the request and check object type
r = requests.get(url)
# Extract HTML from Response object and print
html = r.text
# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")
# Get the text out of the soup and print it
text = soup.get_text()
# Create tokenizer
tokenizer = RegexpTokenizer('\w+')
# Create tokens
tokens = tokenizer.tokenize(text)
# Initialize new list
words = []
# Loop through list tokens and make lower case
for word in tokens:
words.append(word.lower())
# Get English stopwords and print some of them
sw = nltk.corpus.stopwords.words('english')
# Initialize new list
words_ns = []
# Add to words_ns all words that are in words but not in sw
for word in words:
if word not in sw:
words_ns.append(word)
# Create freq dist and plot
freqdist1 = nltk.FreqDist(words_ns)
freqdist1.plot(25)
Now use the function to plot word frequency distributions from other texts on Project Gutenberg:
plot_word_freq('https://www.gutenberg.org/files/42671/42671-h/42671-h.htm')
plot_word_freq('https://www.gutenberg.org/files/521/521-h/521-h.htm')
plot_word_freq('https://www.gutenberg.org/files/10/10-h/10-h.htm')