Our objective here is to process a plain text file so that it is more suitable for analysis. In particular. we will take two Godfather screenplays and remove the stage directions. Here are the steps:
Since we're doing this for two files we will introduce the concept of reusable functions. We've used functions in Python, in this case we're defining our own functions for the first time and using them. The basic syntax is simple:
def function_name(arguments):
# processing
# return a value (usually)
We can start by defining our function to fetch a URL, building on the materials we saw with Scraping.
import urllib.request
# this function simply fetches the contents of a URL
def fetch(url):
response = urllib.request.urlopen(url) # open for reading
return response.read() # read and return
godfatherUrl = "https://www.imsdb.com/scripts/Godfather.html" # URL to use
godfatherSource = fetch(godfatherUrl) # fetch URL
godfatherSource[0:80] # preview
b'<html>\r\n<head><title>Godfather Script at IMSDb.</title>\r\n<meta name="description'
from bs4 import BeautifulSoup
# this function extracts the text from the Godfather screenplays
def extract(source):
soup = BeautifulSoup(source) # parse the source document
return soup.find("pre").find("pre").text.strip() # return the plain text (no tags)
godfatherText = extract(godfatherSource) # extract text from source
godfatherText[0:80] # preview
'THE GODFATHER\n\t_____________\n\n\tScreenplay\n\n\tby\n\n\tMARIO PUZO\n\n\tand\n\n\tFRANCIS FORD'
import re
directions = r'^\t?[^\t]' # regular expression to avoid one tab only at start of line
# this function cleans the text by skipping lines with one tab (and multiple new lines)
def clean(text):
lines = re.sub(r'\n\n+', "\n\n", text).split("\n") # create list from new line
return [l for l in lines if not re.match(directions, l)] # create list from non-match lines
godfather = clean(godfatherText) # clean text
godfather[0:20] # preview
['', '', '', '', '', '', '\t\t\t\t\t1 Gulf and Western Plaza', '', '', '', '\t\t\t\t THE GODFATHER', '', '', '\t\t\t\tBONASERA', '\t\tAmerica has made my fortune.', '', '', '\t\t\t\tBONASERA', '\t\tI raised my daughter in the American', '\t\tfashion; I gave her freedom, but']
godfather2url = "https://www.imsdb.com/scripts/Godfather-Part-II.html"
godfather2 = clean(extract(fetch(godfather2url))) # call nested functions
godfather2[0:40] # preview
['', '\t\t\t\t Part Two', '', '\t\t\t\tScreenplay by', '', '\t\t\t\tMario Puzo', '', '\t\t\t\t and', '', '\t\t\t Francis Ford Coppola', '', '', '', '', '', "\t\t Mario Puzo's THE GODFATHER", '', '', '', '', '', '\t\t\t\t\t\t\t\tDISSOLVE TO:', '', '', '', '', '', '', '', '', '\t\t\t\tWOMAN', '\t\t\t(Sicilian)', "\t\tThey've killed young Paolo! They've", '\t\tkilled the boy Paolo!', '', '', '', '', '', '']
And there we are, we now have code to process our Godfather screenplays. It's not perfect, but it's a great start!
CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell.
Created January 31, 2019 (Jupyter 5).