The purpose of this notebook is to demonstrate how to import data from a pdf and create a csv (or other format) file with the contents. We'll use Tabula for this job. Python has a package that acts as a wrapper to the Java Tabula program and there is a requirement to have a Java runtime available. For assistance with your platform, follow the instructions at:
For demonstration purposes we'll use the pdf's made publically available by the Nova Scotia government which contains the results of water testing for the schools in Nova Scotia.
Source of PDF's: https://www.ednet.ns.ca/water-test-results
The government site provides eight separate files, one for each of the regions. The first four pages are a similar bit of text, then the data starts. We can read directly from the website, however, we'll first download and save the files locally in case they move, become unavailable or change in format in the future.
# create a list of regions (named based on the files on the gov site)
regions = ['avrce',
'cbvrce',
'ccrce',
'csap_en', # assume there is a French equivalent
'hrce',
'srce',
'ssrce',
'tcrce']
# NOTE: after you've run this cell once, you don't need to run it everytime as you
# already have the files locally saved (unless you want updates)
import requests
# using the requests (https://requests.readthedocs.io/) library, download the pdf's and save a copy locally
for region in regions:
url = f"https://www.ednet.ns.ca/sites/default/files/docs/watertestresults_{region}.pdf"
with requests.get(url, allow_redirects=True) as r:
with open(f"watertestresults_{region}.pdf", 'wb') as local_copy:
local_copy.write(r.content)
It is possible to obtain the raw text in the pdf. I've tried a number of method, but find pdftotext to be the most reliable. Others like PyPDF2 and subsequent forks omit some of the text. I've not been able to determine why, but notice consistently that the same blocks of text are missing from the output. In this sample, it was the first and last bullet points on the forth page. PDFToText seems to get all the text everytime, however, for tabular data, I find Tabula does a wonderful job.
# optional:
# for interest sake, pull the text from the first four pages and print it here
import pdftotext
with open("watertestresults_srce.pdf", "rb") as f:
pdf = pdftotext.PDF(f,raw=False)
for page in range(4):
print(pdf[page])
Water Test Results Strait Regional Centre for Education Prepared by the Department of Education and Early Childhood Development, September 2020 Introduction In December of 2019, the Province of Nova Scotia committed to testing the water in all public schools across Nova Scotia after Health Canada revised its guidelines related to lead concentration levels in water. Health Canada also revised its guidelines related to copper concentration levels in water, and copper tests were also conducted. To ensure that parents were confident in the availability of safe drinking water, the government also put bottled water in every public school in Nova Scotia. This water will remain in place until appropriate mediation steps are taken. In the meantime, Regional Centres of Education (RCEs) and Conseil scolaire acadien provincial (CSAP) have taken steps to remove access to any other water source until test results were received, and appropriate remediation steps can be taken. Nova Scotia’s Results Nova Scotia is committed to supplying clean drinkable water to our staff and students at our public schools. Currently, all schools receive clean drinking water. This will not change. The following pages include the water test results for all schools in this region. To find results for a particular school, press the Ctrl or Command key and F to reveal the search menu. Schools are also listed in alphabetical order for quick reference. Data captured includes school name, location of the water source in the school, the sample ID code for the water system/plumbing route tested, and the concentration of lead and copper. Concentrations of lead and copper are presented in milligrams per litre (mg/L). Health Canada’s maximum acceptable concentration of lead in drinking water is 0.0050 mg/L; the maximum acceptable concentration of copper in drinking water is 2.0000 mg/L. The data also shows where the water “passed” the test, or if the water “exceed limits.” Water taps that exceed lead and/or copper limits will not be used for drinking water. These taps will be either replaced, disconnected, taped off, or restricted to hand washing only. Next Steps Nova Scotia continues to deal with restrictions based on COVID-19. This means there is no access to school water fountains. Bottled water continues to be available to staff and students. Moving forward, schools will continue to receive bottled water until appropriate remediation is completed. Nova Scotia’s remediation plan for school water includes the following measures: •• ongoing remediation of water taps, plumbing and/or pipes •• restrict access to water taps that do not meet Health Canada guidelines •• ongoing communication with students and staff about water safety efforts at their school Information about individual schools and updates on remediation can be found on your REC/CSAP website. For additional information about schools in your region, please contact your region or CSAP. For more information about Health Canada’s guidelines related to lead and copper, go to the following sites: Lead: www.canada.ca/en/health-canada/news/2019/03/health-canada-sets-new-guideline-for-lead-in-drinking- water-latest-in-series-of-government-actions-to-protect-canadians-from-exposure-to-lead.html Copper: www.canada.ca/en/health-canada/services/publications/healthy-living/water-talk-copper.html#a3
The purpose of this notebook is to show a practical example using Tabula to extract data. We'll also plan to conflate the data from the eight PDF files into a single csv file for download. No guarantee is made on the completeness of this data as stored here in PDF, or as extracted and saved back in raw form. As discussed above, extracting data from PDF's can be troublesome and it is up to the reader to quality check the output for integrity, completeness and accuracy.
After going through the process below it was noted that 20 records are not properly imported from the avrce file. The records' location spans multiple lines, and in some cases results in a blank, in others a description that only has the second line in the Location text. Data is like this sometimes and may require some decisions afterward. Do we clean up manually, remove all blanks or come up with another solution?
But it does report progress below it as it goes.
import pandas as pd
import tabula
# create an empty dataframe (df) which we'll append each region's data
df = pd.DataFrame()
for region in regions:
pdf_path = f"watertestresults_{region}.pdf"
tabula_df = tabula.read_pdf(pdf_path, pages='all', stream=True)
temp_df = pd.DataFrame()
for page in tabula_df:
temp_df = temp_df.append(page.set_index("Sample ID"))
# since the pdf file doesn't include the region, we'll add that column and fill it
# NB: there is probably a more memory efficient way to accomplish this
temp_df['Region'] = region
df = df.append(temp_df)
print(f"Imported {len(temp_df.index)} records for {region}. Total in df: {len(df.index)}")
Imported 2182 records for avrce. Total in df: 2182 Imported 1134 records for cbvrce. Total in df: 3316 Imported 1939 records for ccrce. Total in df: 5255 Imported 470 records for csap_en. Total in df: 5725 Imported 6306 records for hrce. Total in df: 12031 Imported 312 records for srce. Total in df: 12343 Imported 1252 records for ssrce. Total in df: 13595 Imported 1163 records for tcrce. Total in df: 14758
# look at the data types
df.dtypes
School Name object Location object Lead (Pb) mg/L object Lead Result object Copper (Cu) mg/L object Copper Result object Region object dtype: object
# and recognize we need to convert the mg/L columns to numeric
df['Lead (Pb) mg/L'] = pd.to_numeric(df['Lead (Pb) mg/L'], errors='coerce')
df['Copper (Cu) mg/L'] = pd.to_numeric(df['Copper (Cu) mg/L'], errors='coerce')
df.dtypes
School Name object Location object Lead (Pb) mg/L float64 Lead Result object Copper (Cu) mg/L float64 Copper Result object Region object dtype: object
Now the data is in our dataframe what can we do with it? We can use Panda's to analyse it, export it to Excel, csv or other formats, plot charts and run any other sort of analyse we like. This is where the learning curve comes into play - Pandas is a great tool, but does require some learning. There are a number of online courses available; some high quality ones on Pluralsight.
What follows is some examples what sort of things can be done.
# export to excel
# see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
# for options
df.to_excel("water_test_results.xlsx")
# save an excel file with only records that have a sample id
df[df.index.notnull()].to_excel("water_test_results_noblanks.xlsx")
# similar for csv
df.to_csv("water_test_results.csv")
# find a particular Record
df[df.index=="MXH510"]
School Name | Location | Lead (Pb) mg/L | Lead Result | Copper (Cu) mg/L | Copper Result | Region | |
---|---|---|---|---|---|---|---|
Sample ID | |||||||
MXH510 | Prince Andrew High School | NaN | 0.0072 | Exceeds Limit | 2.6 | Exceeds Limit | hrce |
# see the records that have the additional second line in the Locations
# note that NaN is pandas way to say Not A Number
df[df["School Name"].isnull()]
School Name | Location | Lead (Pb) mg/L | Lead Result | Copper (Cu) mg/L | Copper Result | Region | |
---|---|---|---|---|---|---|---|
Sample ID | |||||||
NaN | NaN | M/E Guidence Office W/R - Main Entrance Guiden... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Janitor Room across from Room 102 - Janitor Ro... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Janitor Room-New Gym Girls L/R - Janitor Room-... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Boys W/R across from Custodial Office - Boys W... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Boys W/R across from Custodial Office - Boys W... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Boys W/R across from Custodial Office - Boys W... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Girls W/R across from Custodial Office - Girls... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Girls W/R across from Custodial Office - Girls... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Girls W/R across from Custodial Office - Girls... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Girls W/R across from Custodial Office - Girls... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Janitor Room across from Room 110 - Janitor Ro... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Boys W/R across from Custodial Office - Boys W... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Boys W/R across from Room 218 - Boys Washroom ... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Boys W/R across from Room 218 - Boys Washroom ... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Boys W/R across from Room 218 - Boys Washroom ... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Girls W/R across from Sr. Staff Room - Girls W... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Girls W/R across from Sr. Staff Room - Girls W... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Girls W/R across from Room 218 - Girls Washroo... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Girls W/R across from Room 218 - Girls Washroo... | NaN | NaN | NaN | NaN | avrce |
NaN | NaN | Girls W/R across from Room 218 - Girls Washroo... | NaN | NaN | NaN | NaN | avrce |
# list all the regions in our df
for reg in sorted(df["Region"].unique()):
print(reg)
avrce cbvrce ccrce csap_en hrce srce ssrce tcrce
# list unique school names
for school in df["School Name"].dropna().unique():
print(school)
# messy way to remove the ones with null values
# schools = [school for school in df["School Name"][-df["School Name"].isnull()].unique()]
# print(sorted(schools))
Aldershot Elementary School Annapolis East Elementary School Annapolis West Education Centre Avon View High School Berwick & District School Bridgetown Regional Community School Brooklyn District Elementary School Cambridge & District Elementary School Central Kings Rural High School Champlain Elementary School Clark Rutherford Memorial School Coldbrook & District School Dr. Arthur Hines Elementary School Dwight Ross Elementary School Evangeline Middle School Falmouth District School Gaspereau Valley Elementary School Glooscap Elementary School Hantsport School Horton High School Kings County Academy Kingston & District School L.E. Shaw Elementary School Lawrencetown Consolidated School Lawrencetown Education Centre Middleton Bus Garage Middleton Regional High School New Minas Elementary School Northeast Kings Education Centre Operations Centre Pine Ridge Middle School Port Williams Elementary School Somerset & District Elementary School St. Mary's Elementary School Three Mile Plains District School West Hants Middle School West Kings District High School Windsor Elementary School Windsor Forks District School Wolfville School Adult High School (Ashby) Baddeck Academy Boularderie Elementary School Bras d'Or Elementary School Breton Education Centre Brookland Elementary School Cabot Education Centre Cape Smokey Elementary School Coxheath Elementary School Cusack School Donkin School Dr. T.L. Sullivan School Ferrisview Elementary School Glace Bay Elementary School Glace Bay High School Greenfield Elementary School Harbourside Elementary School John Bernard Croak V.C. Memorial School Jubilee Elementary School Malcolm Munroe Middle School Marion Bridge School Memorial Composite High School Middle River Consolidated Mountainview Elementary School North Highlands Elementary School Oceanview Education Centre Rankin School of the Narrows Riverside School Riverview High School Robin Foote Elementary School Sherwood Park Education Centre Shipyard Elementary School St. Anne's Elementary School Sydney Academy Sydney Mines Middle School Sydney River Elementary School Tompkins Memorial Elementary School Whitney Pier Memorial Middle School A. G. Baillie Memorial School Advocate District School Amherst Regional High School Bible Hill Consolidated Bible Hill Junior High School Brookfield Elementary School Central Colchester Junior High School Chiganois Elementary School Cobequid Consolidated Elementary School Cobequid District Elementary School Cobequid Educational Centre Cumberland North Academy Cyrus Eaton Elementary School Debert Elementary School Dr. W.A. MacLeod Consolidated School E. B. Chandler Junior High School Elmsdale District Elementary School Enfield District Elementary F.H. MacDonald Academy G.R. Saunders Elementary School Great Village Elementary School Hants East Rural High School Hants North Rural High School Harmony Heights Elementary School Hilden Elementary School Junction Road Elementary School Kennetcook District Elementary School Maple Ridge Elementary School McCulloch Education Centre New Glasgow Academy North Nova Education Centre North River Elementary School Northport Consolidated Elementary School Northumberland Regional High School Oxford Regional Education Centre Parrsboro Regional Elementary School Parrsboro Regional High School Pictou Academy Pugwash District High School Rawdon District Elementary School Redcliff Middle School River Hebert District School Riverside Education Centre Salt Springs Elementary School Scotsburn Elementary School Shubenacadie District Elementary School South Colchester Academy Spring Street Academy Springhill Jr/Sr High School Tatamagouche Regional Academy Thorburn Consolidated School Trenton Elementary School Trenton Middle School Truro Elementary School Truro Junior High School Uniacke District School Upper Stewiacke Elementary School Valley Elementary School Wallace Consolidated Elementary School Walter Duggan Consolidated School West Colchester Consolidated School West End Memorial School West Highlands Elementary School West Pictou Consolidated School Winding River Consolidated School École Beaubassin École Bois-Joli École Pubnico-Ouest École Rose-des-Vents École acadienne de Pomquet École acadienne de Truro École Beau-Port École des Beaux-Marais École Belleville École du Carrefour École secondaire du Sommet École secondaire de Clare Centre scolaire Étoile de l'Acadie École du Grand Portage École Joseph-Dugas École Mer et Monde École NDA Centre scolaire de la Rive-Sud École secondaire de Par-en-Bas École Stella-Maris Centre le Tournesol École Wedgeport A. J. Smeltzer Junior High School Admiral Westphal Elementary School Alderney Elementary School Ash Lee Jefferson Elementary School Astral Drive Elementary School Astral Drive Junior High School Atl. Mem. Elem- Shad Bay Atlantic Memorial-Terence Bay Elementary School Atlantic View Elementary School Auburn Drive High School Basinview Drive Community School Beaver Bank-Kinsac Elementary School Beaver Bank-Monarch Drive Elementary School Bedford South School Beechville-Lakeside-Timberlea Junior Elementary School Beechville-Lakeside-Timberlea Senior Elementary School Bel Ayr Elementary School Bell Park Academic Centre BFEC (Dartmouth) Bicentennial School Brookhouse Elementary School Brookside Junior High School Burton Ettinger Elementary School Caldwell Road Elementary School Caledonia Junior High School Caudle Park Elementary School Cavalier Drive School Central Spryfield Elementary School Charles P. Allen High School Chebucto Heights Elementary School Citadel High School Clayton Park Junior High School Colby Village Elementary School Cole Harbour District High School Colonel John Stuart Elementary School Crichton Park Elementary School Cunard Junior High School Dartmouth High School Dartmouth South Academy Duc d'Anville Elementary School Dutch Settlement Elementary School East St. Margaret's Elementary School Eastern Passage Education Centre Eastern Shore District High School Elizabeth Sutherland School Ellenvale Junior High School Eric Graves Memorial Junior High School Fairview Heights Elementary School Fairview Heights Elementary School - Annex Fairview Junior High School Five Bridges Junior High School Fleming Tower School Gaetz Brook Junior High School George Bissett Elementary School Georges P. Vanier Junior High School Gorsebrook Junior High School Graham Creighton Junior High School Grosvenor-Wentworth Park Elementary School Halifax Central Junior High School Halifax West High School Hammonds Plains Consolidated School Harbour View Elementary School Harold T. Barrett Junior High School Harrietsfield Elementary School Harry R. Hamilton Elementary School Hawthorn Elementary School Herring Cove Junior High School Highland Park Junior High School Hillside Park Elementary School Holland Road Elementary School Humber Park Elementary School Ian Forsyth Elementary School Inglis Street Elementary School Island View High School J. L. Ilsley High School John MacNeil Elementary School John Martin Junior High School John W. MacLeod Elementary School Joseph Giles Elementary School Joseph Howe Elementary School Kingswood Elementary School LeMarchant-St. Thomas Elementary School Leslie Thomas Junior High School Lockview High School Madeline Symonds Middle School Marine Drive Academy Michael Wallace Elementary School Millwood Elementary School Millwood High School Mount Edward Elementary School Musquodoboit Rural High School Musquodoboit Valley Education Centre Nelson Whynder Elementary School Ocean View Elementary School O'Connell Drive Elementary School Oldfield Consolidated School Oxford School Oyster Pond Academy Park West School Porters Lake Elementary School Portland Estates Elementary School Prince Andrew High School Prospect Road Elementary School Ridgecliff Middle School Robert K. Turner Elementary School Rockingham Elementary School Rockingstone Heights School Rocky Lake Elementary Rocky Lake Junior High School Ross Road School Sackviile Heights Junior High School Sackville Heights Elementary School Sackville Heights Junior High School Sackville High School Saint Mary's Elementary School Sambro Elementary School Seaside Elementary School Shannon Park Elementary School Shatford Memorial Elementary School Sir Charles Tupper Elementary School Sir John A. Macdonald High School Sir Robert Borden Junior High School Smokey Drive Elementary School South Woodside School Southdale - North Woodside Elementary School Springvale Elementary School St. Agnes Junior High School St. Catherine's Elementary School St. Joseph's-Alexander McKay Elementary School St. Margaret's Bay Elementary School St. Stephen's Elementary School Sunnyside - Eaglewood Drive Sunnyside - Fort Sackville Sycamore Lane Elementary School Tallahassee Community School Tantallon Junior Elementary School Tantallon Senior Elementary School Upper Musquodoboit Consolidated Elementary School Waverley Memorial Elementary School Westmount Elementary School William King Elementary School Antigonish Education Centre Bayview Education Centre Cape Breton Highlands Education Centre/Academy Chedabucto Education Centre/ Guysborough Academy Dalbrae Academy Dr John Hugh Gillis Regional High School East Antigonish Education Centre/Academy East Richmond Education Centre Fanning Education Centre/ Canso Academy Felix Marchand Education Centre H.M. MacDonald Elementary School Inverness Education Centre/Academy Pleasant Bay School Richmond Education Centre/Academy St. Andrew Junior School St. Andrew's Consolidated School St. Mary's Education Centre/Academy Strait Area Education and Recreation Centre Tamarac Education Centre Whycocomagh Education Centre Aspotogan Consolidated Elementary School Bayview Community School Big Tancook Elementary School Bluenose Academy Bridgewater Elementary School Bridgewater Junior High School Chester Area Middle School Chester District Elementary School Dr. John C. Wickwire Academy Forest Heights Community School Hebbville Academy Liverpool Regional High School New Germany Elementary School New Germany Rural High School New Ross Consolidated School Newcombville Elementary School North Queens Community School Park View Education Centre Pentz Elementary School Petite Rivière Elementary School South Queens Middle School Verge House West Northfield Elementary School Barrington Municipal High School Carleton Consolidated School Clark's Harbour Elementary School Digby Elementary School Digby Neck Consolidated School Digby Regional High School Drumlin Heights Consolidated School Evelyn Richardson Memorial Elementary School Forest Ridge Academy Hillcrest Academy Islands Consolidated School Lockeport Elementary School Lockeport Regional High School Maple Grove Education Centre Meadowfields Community School Plymouth School Port Maitland Consolidated School Shelburne Regional High School St. Mary's Bay Academy Weymouth Consolidated School Yarmouth Consolidated Memorial High School Yarmouth Elementary School
# quick description for the entire data set
df.describe()
Lead (Pb) mg/L | Copper (Cu) mg/L | |
---|---|---|
count | 13603.000000 | 13603.000000 |
mean | 0.022222 | 0.416194 |
std | 0.235173 | 1.053212 |
min | 0.000000 | 0.000000 |
25% | 0.000700 | 0.081000 |
50% | 0.002200 | 0.180000 |
75% | 0.008200 | 0.410000 |
max | 20.000000 | 53.000000 |
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['float', 'int']
df.dropna().describe(percentiles=perc, include=include)
Lead (Pb) mg/L | Copper (Cu) mg/L | |
---|---|---|
count | 13530.000000 | 13530.000000 |
mean | 0.021983 | 0.414757 |
std | 0.235502 | 1.053373 |
min | 0.000000 | 0.000000 |
20% | 0.000400 | 0.064000 |
40% | 0.001300 | 0.133220 |
50% | 0.002200 | 0.180000 |
60% | 0.003700 | 0.240000 |
80% | 0.011000 | 0.506700 |
max | 20.000000 | 53.000000 |
# making boolean series for a region
filter = df["Region"]=="srce"
# filtered data description
df.where(filter, inplace=False).describe()
Lead (Pb) mg/L | Copper (Cu) mg/L | |
---|---|---|
count | 312.000000 | 312.000000 |
mean | 0.006145 | 0.277786 |
std | 0.036554 | 0.542652 |
min | 0.000000 | 0.001800 |
25% | 0.000000 | 0.066000 |
50% | 0.000700 | 0.130000 |
75% | 0.002100 | 0.250000 |
max | 0.450000 | 4.300000 |
# for more than one region
# making boolean series for a region
filter = df["Region"].isin(["srce","csap_en"])
# filtered data description
df.where(filter, inplace=False).describe()
Lead (Pb) mg/L | Copper (Cu) mg/L | |
---|---|---|
count | 782.000000 | 782.000000 |
mean | 0.009734 | 0.558019 |
std | 0.039038 | 1.389291 |
min | 0.000000 | 0.000000 |
25% | 0.000100 | 0.082525 |
50% | 0.001200 | 0.160800 |
75% | 0.005175 | 0.394650 |
max | 0.597300 | 15.280000 |
# maximum Lead
df[df['Lead (Pb) mg/L'] == df['Lead (Pb) mg/L'].max()]
School Name | Location | Lead (Pb) mg/L | Lead Result | Copper (Cu) mg/L | Copper Result | Region | |
---|---|---|---|---|---|---|---|
Sample ID | |||||||
C0G5324/NAG927 | Central Kings Rural High School | C120 - Gym Office - 100022422 | 20.0 | Exceeds Limit | 42.0 | Exceeds Limit | avrce |
# maximum copper
df[df['Copper (Cu) mg/L'] == df['Copper (Cu) mg/L'].max()]
School Name | Location | Lead (Pb) mg/L | Lead Result | Copper (Cu) mg/L | Copper Result | Region | |
---|---|---|---|---|---|---|---|
Sample ID | |||||||
MYR801 | John Martin Junior High School | Boys WR tap 1 by learning centre | 2.2 | Exceeds Limit | 53.0 | Exceeds Limit | hrce |