Notebook

Converting with the Art of Literary Text Analysis¶

Our objective here is to process a plain text file so that it is more suitable for analysis. In particular. we will take two Godfather screenplays and remove the stage directions. Here are the steps:

fetch the two screenplays
extract the screenplay text from the files
remove the stage directions

Since we're doing this for two files we will introduce the concept of reusable functions. We've used functions in Python, in this case we're defining our own functions for the first time and using them. The basic syntax is simple:

def function_name(arguments):
    # processing
    # return a value (usually)

We can start by defining our function to fetch a URL, building on the materials we saw with Scraping.

In [64]:

import urllib.request

# this function simply fetches the contents of a URL
def fetch(url):
    response = urllib.request.urlopen(url) # open for reading
    return response.read() # read and return

In [65]:

godfatherUrl = "https://www.imsdb.com/scripts/Godfather.html" # URL to use
godfatherSource = fetch(godfatherUrl) # fetch URL
godfatherSource[0:80] # preview

Out[65]:

b'<html>\r\n<head><title>Godfather Script at IMSDb.</title>\r\n<meta name="description'

In [66]:

from bs4 import BeautifulSoup

# this function extracts the text from the Godfather screenplays
def extract(source):
    soup = BeautifulSoup(source) # parse the source document
    return soup.find("pre").find("pre").text.strip() # return the plain text (no tags)

In [67]:

godfatherText = extract(godfatherSource) # extract text from source
godfatherText[0:80] # preview

Out[67]:

'THE GODFATHER\n\t_____________\n\n\tScreenplay\n\n\tby\n\n\tMARIO PUZO\n\n\tand\n\n\tFRANCIS FORD'

In [68]:

import re

directions = r'^\t?[^\t]' # regular expression to avoid one tab only at start of line

# this function cleans the text by skipping lines with one tab (and multiple new lines)
def clean(text):
    lines = re.sub(r'\n\n+', "\n\n", text).split("\n") # create list from new line
    return [l for l in lines if not re.match(directions, l)] # create list from non-match lines

In [72]:

godfather = clean(godfatherText) # clean text
godfather[0:20] # preview

Out[72]:

['',
 '',
 '',
 '',
 '',
 '',
 '\t\t\t\t\t1 Gulf and Western Plaza',
 '',
 '',
 '',
 '\t\t\t\t  THE GODFATHER',
 '',
 '',
 '\t\t\t\tBONASERA',
 '\t\tAmerica has made my fortune.',
 '',
 '',
 '\t\t\t\tBONASERA',
 '\t\tI raised my daughter in the American',
 '\t\tfashion; I gave her freedom, but']

In [74]:

godfather2url = "https://www.imsdb.com/scripts/Godfather-Part-II.html"
godfather2 = clean(extract(fetch(godfather2url))) # call nested functions
godfather2[0:40] # preview

Out[74]:

['',
 '\t\t\t\t Part Two',
 '',
 '\t\t\t\tScreenplay by',
 '',
 '\t\t\t\tMario Puzo',
 '',
 '\t\t\t\t    and',
 '',
 '\t\t\t Francis Ford Coppola',
 '',
 '',
 '',
 '',
 '',
 "\t\t     Mario Puzo's THE GODFATHER",
 '',
 '',
 '',
 '',
 '',
 '\t\t\t\t\t\t\t\tDISSOLVE TO:',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '\t\t\t\tWOMAN',
 '\t\t\t(Sicilian)',
 "\t\tThey've killed young Paolo!  They've",
 '\t\tkilled the boy Paolo!',
 '',
 '',
 '',
 '',
 '',
 '']

And there we are, we now have code to process our Godfather screenplays. It's not perfect, but it's a great start!

CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell.
Created January 31, 2019 (Jupyter 5).