High-speed-rail document analysis

By Ben Welsh

This analysis was conducted for April 27, 2019 story "How California’s faltering high-speed rail project was ‘captured’ by costly consultants."

It found that outside consultants have provided more than 3,000 environmental statements, business documents and other reports to the California High-Speed Rail Authority. Altogether they contain more than 152,000 pages.

How we did it

Import Python tools

In [2]:
import os
import PyPDF2
import pathlib
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlretrieve

Set output directory where documents will be saved.

In [5]:
output_dir = pathlib.Path(os.getenv("OUTPUT_DIR") or "./output")

Read in a list of all the URLs on the rail authority's site that contain consultant reports.

In [23]:
page_list = open("./input/urls.csv").read().split("\n")

Request each page and parse out all of its PDF links

In [27]:
def parse_page(url):
    """
    Parse all the PDF urls from the provided URL's HTML.
    """
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    a_list = soup.find_all("a")
    pdf_list = [a['href'] for a in a_list if a['href'].endswith(".pdf")]
    return [f"http://hsr.ca.gov{href}" for href in pdf_list]
In [28]:
pdf_list = []
In [31]:
for page in page_list:
    pdfs = parse_page(page)
    pdf_list.extend(pdfs)

How many total PDF urls were found?

In [40]:
f"{len(pdf_list):,}"
Out[40]:
'3,410'

Remove all of the duplicates.

In [41]:
pdf_set = set(pdf_list)

How many URLs remain?

In [42]:
f"{len(pdf_set):,}"
Out[42]:
'3,168'

Download them all.

In [32]:
def download_pdf(url):
    """
    Download a PDF url to the output folder.
    """
    filename = url.split("/")[-1]
    path = output_dir.joinpath(filename)
    if path.exists():
        return
    try:
        print(f"Downloading {filename}")
        urlretrieve(url, path)
    except Exception:
        print(f"Failed to download {url}")            
In [35]:
for url in pdf_set:
    download_pdf(url)

Get their page counts.

In [45]:
def get_page_count(path):
    """
    Get the page count of the provided PDF path.
    """
    with open(path, 'rb') as f:
        try:
            pdfReader = PyPDF2.PdfFileReader(f)
            return pdfReader.numPages
        except:
            return pd.np.NaN
In [36]:
path_list = list(output_dir.glob('*.pdf'))

Count the total number of documents again to check out many we actually downloaded.

In [47]:
f"{len(pdf_path_list):,}"
Out[47]:
'3,129'

Loop through all the documents and tally pages.

In [38]:
page_counts = dict((p, get_page_count(p)) for p in path_list)
In [49]:
df = pd.DataFrame(pdf_page_counts.items(), columns=["pdf", "page_count"])
In [50]:
df.sort_values("page_count", ascending=False).head()
Out[50]:
pdf page_count
1413 /media/palewire/Passport/hsr-pdfs/final_EIR_Me... 8923.0
142 /media/palewire/Passport/hsr-pdfs/final_ERIS_F... 4297.0
2108 /media/palewire/Passport/hsr-pdfs/BayCValley20... 2126.0
2508 /media/palewire/Passport/hsr-pdfs/brdmtg_04211... 1523.0
2421 /media/palewire/Passport/hsr-pdfs/Los_Angeles_... 1369.0
In [51]:
len(df)
Out[51]:
3129
In [52]:
f"{df.page_count.sum():,}"
Out[52]:
'151,703.0'