![]() |
![]() |
![]() |
---|---|---|
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
This notebook demonstrates the steps involved in developing a lightweight prototype to bring together structured data on infrastructure projects and contracting processes and explores the added value of using OCDS data in infrastructure monitoring.
The notebook is divided into the following sections:
Projects
An export of structured data is not available from the front end of the CoST Ukraine portal, therefore a scraper script was written to extract projects data and store this in structured format.
Currently only data from the State Highways Agency is available on the CoST Ukraine portal.
Contracting processes
Structured data is available from the Prozorro platform via the following methods:
The most recent bulk export is dated 2018-04-20.
Currently the only public search interface available to the contracting data is in the Prozorro front end and there is no search API or interface to the bulk OCDS data to provide programmatic search of the data.
To reduce the number of manual steps in searching for and retrieving OCDS data from Prozorro the latest fortnightly bulk export was downloaded into an instance of the ocdsdata ETL tool so that the full dataset could be queried using PostgreSQL.
Using a sample of project titles from the CoST Ukraine portal and searching the tender/title
and tender/description
fields in the OCDS yielded the following results (number of relevant results shown in brackets):
Search term | Project 1 | Project 2 | Project 3 | Project 4 | Project 5 |
---|---|---|---|---|---|
Project title | 1 (1) | 1 (1) | 1 (1) | 3 (3) | 0 (0) |
Highway name | 14 (6) | 5 (3) | 3 (2) | 18 (8) | 73 (n/a) |
Highway name and km marker | 6 (6) | 4 (3) | 2 (2) | 9 (8) | n/a |
Based on this sample, it seems that an automated search for the full project title is not sufficient to discover all related contracting processes and that searching for the highway name and km marker results in the best coverage and accuracy. As such two manual steps are required:
We also noted that:
Suggesting that the list of projects in the CoST Ukraine portal may have been generated from Prozorro (e.g. by looking for tenders from the buyer with an item classification relating to construction).
Projects
Projects in the CoST Portal are not assigned an identifier. For the purposes of this analysis identifiers were constructed from the agency, region and foreign key extracted from the URL for the project, e.g.
portal.costukraine.org/UAD/POL/PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_TECH_DETAILS_handler&fk0=222*&master_viewmode=0
Contracting processes
Contracting processes in Prozorro are assigned an identifier (the ocid
) however identifiers are not provided for the project a contracting process is associated with.
Some projects in the CoST Ukraine portal include a single ocid
, which appears to be for the main construction contract associated with the project (check this).
The CoST Portal appears to list projects at the level of individual construction contracts.
When searching for related contracting processes we noted examples where there were several different construction contracts relating to a single highway, suggesting that there may be overarching projects of which the individual construction contracts form a part.
The OCDS Releases and Records model is designed to support disclosure of both individual updates about a contracting process (releases) and a summary of the latest state of the contracting process (records).
Currently, both the openprocurement API and the bulk OCDS export from Prozorro provide the latest state of the contracting process rather than a history of changes.
Prozorro functionality to publish individual releases to enable tracking change over the life of a process is in development.
There are two possible approaches to tracking change with the existing functionality, however there are obstacles to both of these:
Scrape the openprocurement API daily for specific contracting processes
Download and search historic bulk exports
We weren't able to determine whether the release identifier is updated when the contracting process changes in either the API or bulk exports, so in either case it may be neccessary to compare all fields in the release to determine whether anything has changed.
For each project:
contracts_df.pivot_table(values='startDate', index=['project', 'projectStartDate'], aggfunc=np.min)
contracts_df.pivot_table(values='endDate', index=['project', 'projectEndDate'], aggfunc=np.max)
For each project we were able to identify design, construction and monitoring contracts and use the data from Prozorro to understand the project timeline, e.g.
For some projects the total value of contracts in Prozorro was greater than the total cost in the CoST Portal and for some projects is was less:
pd.options.display.float_format = '{:,}'.format
contracts_df.pivot_table(values='amount', index=['project','projectValue'], aggfunc=np.sum)
Using the data from Prozorro we were able to understand the split of the total project cost across design, construction and monitoring contracts, e.g.
OCDS data can be used to:
The demonstrator suggests there is additional value in using OCDS data for infrastructure monitoring, for example, we found:
Navigate to the CoST Ukraine Portal
Note: Use Google Chrome with auto-translation enabled, unless you speak Ukrainian
Currently only projects of Ukravtodor, the State Highways Agency, are listed, so click the Ukravtodor logo.
Use the map to choose a region to see the projects list for, e.g. the Sumy region
Auto-translate doesn't work on the projects list page, so click "ТАБЛИЦІ" in the grey header bar to get a view which can be translated.
Note: You might need to open the "ТАБЛИЦІ" link in a new tab to get it to load
Choose the project which you want to scrape from the list, e.g. Reconstruction of the bridge crossing on the highway N-12 Sumy-Poltava km 70 + 838
Note: Use of Prozorro for above and below threshold procurement has only been mandatory since 1st August 2016, so try and find a project which starts after this date to give the best chance of finding related contracts
Set the following variables based on the URL of the project you selected:
#update with url segment for the public entity, e.g. UAD for State Highways Agency of Ukraine
publicEntity = "UAD"
#update with region for project, e.g. SUM for region, note for MFO projects the url construction is slightly different (see commented out html = line below)
region = "SUM"
#update with value of &fk0 parameter in URL of project you want to scrape - this identifies the project
foreignKey = "173"
#construct an identifier for saving data
projectID = publicEntity + "-" + region + "-" + foreignKey
from requests import get
from bs4 import BeautifulSoup
import pprint
import json
import ipywidgets as widgets
#function to scrape content of main table
def scrape(url,output):
#get html and convert to nice object
html = get(url,stream=True).content
html = BeautifulSoup(html, "html.parser")
#get name of section we are scraping and create an object for it
section = html.body["id"]
output[section] = {}
main_table = html.find("div", class_="well")
if main_table != None:
for td in main_table.select("td"):
if "data-column-name" in td.attrs:
output[section][td["data-column-name"]] = td.text
return output
#get html of first page and convert to nice object
html = get("http://portal.costukraine.org/uad_mfo/PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_TECH_DETAILS_handler&fk0=" + foreignKey + "&master_viewmode=0",stream=True).content #use this line for MFO projects
#html = get("http://89.185.0.248:8888/"+publicEntity+"/"+region+"/PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_TECH_DETAILS_handler&fk0=" + foreignKey + "&master_viewmode=0",stream=True).content #use this line for all other projects
html = BeautifulSoup(html, "html.parser")
#set up array to store urls for each view of project
urls = []
#get urls of each page
navigation = html.find("ul", class_="nav nav-tabs grid-details-tabs")
for li in navigation.select("li"):
urls.append(li.a["href"])
#put amendments URL in separate variable (no data found for this page yet, so we don't do anything with this)
amendmentsURL = urls.pop()
#set up object for scraped data
project = {}
#scrape summary table (appears on each page, so only do this once)
project["summary"] = {}
summary_table = html.find("div", class_="grid grid-table grid-master js-grid")
for th in summary_table.select("th"):
project["summary"][th["data-name"]] = ""
for td in summary_table.select("td"):
if "data-column-name" in td.attrs:
project["summary"][td["data-column-name"]] = td.text
#scrape main table on each page
for url in urls:
print("scraping " + url)
project = scrape("http://portal.costukraine.org/uad_mfo/" + url, project) #use this line for MFO projects
#project = scrape("http://89.185.0.248:8888/"+publicEntity+"/"+region+"/" + url, project) #use this line for all other projects
print("done scraping")
data = {}
data = {"project": project}
import copy
import os
from google.cloud import translate
#Set google cloud API credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './ocds-f08bd860a3be.json'
#create a copy of the project from the Ukraine portal to translate
data["project_translated"] = copy.deepcopy(project)
# Instantiate translation client
translate_client = translate.Client()
target = 'en'
#translate project
project_translated = data["project_translated"]
for section in project_translated:
for key in project_translated[section]:
if type(project_translated[section][key]) == str:
text = project_translated[section][key]
translation = translate_client.translate(text, target_language = target)
if text != translation:
project_translated[section][key] = translation['translatedText']
import os
import json
if not os.path.exists("data"):
os.makedirs("data")
with open("data/" + projectID + ".json","w") as export:
json.dump(data,export,indent=2)
import getpass
import psycopg2
dbpassword = getpass.getpass("Enter database user password: ") #see /.pgpass file on server
# db connection config
conn = psycopg2.connect(
database = 'ocdsdata',
user = 'ocdsdata',
password = dbpassword,
host = '195.201.163.242',
port = '5432',
)
# clear db user password
dbpassword = ''
# create db cursor
cur = conn.cursor()
import json
import copy
import os
from google.cloud import translate
from IPython.display import display
#Set google cloud API credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/ddewhurst/open_data_services/ocds/repositories/ocinfra.ocinfra_ukraine_demonstrator/ocds-f08bd860a3be.json'
# Instantiate translation client
translate_client = translate.Client()
target = 'en'
#open project data
with open("data/"+projectID+".json") as file:
data = json.load(file)
#check if related contracting processes already exist and confirm overwrite
if "tenders" in data:
overwrite = input("Tenders data already exists for this project, overwrite? (y/n) ")
if overwrite == "y":
del(data["tenders"])
print("\n")
if "tenders" not in data:
#display project name in UK and EN and choose search terms
print("Project Title:")
print(data["project"]["summary"]["name"],"\n")
print("Project Title (translated):")
print(data["project_translated"]["summary"]["name"],"\n")
#prompt for search terms
searchInput = input("Enter pipe (|) delimited list of search terms from Ukrainian project title: ")
searchTerms = searchInput.split("|")
#construct SQL query
searchString = """
SELECT
data
FROM
data
WHERE
data ->> 'ocid' LIKE 'ocds-be6bcu%'
"""
titleSearchString = """ AND (("""
descSearchString = """) OR ("""
for term in searchTerms:
titleSearchString = titleSearchString + "data -> 'tender' ->> 'title' LIKE '%" + term + "%'"
descSearchString = descSearchString + "data -> 'tender' ->> 'description' LIKE '%" + term + "%'"
if searchTerms.index(term) < len(searchTerms) - 1:
titleSearchString = titleSearchString + " AND "
descSearchString = descSearchString + " AND "
else:
descSearchString = descSearchString + "))"
searchString = searchString + titleSearchString + descSearchString
#search database and return results
print("\n","Searching...")
cur.execute("rollback")
cur.execute(searchString)
allTenders = []
relatedTenders = []
for result in cur.fetchall():
allTenders.append(result[0])
print("Found ",len(allTenders)," contracting process(es)\n")
#translate tender title & description
for tender in allTenders:
print("OCID: ",tender["ocid"])
if "title" in tender["tender"]:
text = tender["tender"]["title"]
translation = translate_client.translate(text, target_language = "en")
tender["tender"]["title_en"] = translation["translatedText"]
print("Title: ",tender["tender"]["title_en"])
if "description" in tender["tender"]:
text = tender["tender"]["description"]
translation = translate_client.translate(text, target_language = "en")
tender["tender"]["description_en"] = translation["translatedText"]
print("Description",tender["tender"]["description_en"],"\n")
#prompt to confirm tender is related and categorise as planning, contstruction or monitoring
if input("Is this tender related to the project? (y/n): ") == "y":
relatedTenders.append(tender)
tenderType = input("Is this a (p)lanning, (c)onstruction or (m)onitoring tender?: ")
tender["ocinfra"] = {}
if tenderType == "p":
tender["ocinfra"]["type"] = "planning"
elif tenderType == "c":
tender["ocinfra"]["type"] = "construction"
elif tenderType == "m":
tender["ocinfra"]["type"] = "monitoring"
else:
print("ERROR")
print("\n")
#save data
data["tenders"] = relatedTenders
with open("data/"+projectID+".json","w") as file:
json.dump(data,file,indent=2)
print("done")
import numpy as np
import pandas as pd
import json
import glob
paths = glob.glob("data/*.json")
contracts = []
for path in paths:
if "translated" not in path:
with open(path) as file:
data = json.load(file)
for tender in data["tenders"]:
if "contracts" in tender:
for contract in tender["contracts"]:
contracts.append({
"project": path.split("/")[-1].split(".")[-2],
"projectStartDate": data["project"]["summary"]["term_start"],
"projectEndDate": data["project"]["summary"]["term_end"],
"projectValue": data["project"]["summary"]["total_cost"],
"ocid": tender["ocid"],
"title_en": tender["tender"]["title_en"],
"type": tender["ocinfra"]["type"],
"contractID": contract["contractID"],
"status": contract["status"],
"startDate": contract["period"]["startDate"],
"endDate": contract["period"]["endDate"],
"amount": contract["value"]["amount"]
})
if contract["value"]["currency"] != "UAH":
print("Warning: Multiple currencies")
contracts_df = pd.DataFrame(contracts)
contracts_df = contracts_df[["project","projectStartDate","projectEndDate","projectValue","ocid","type","title_en","contractID","status","startDate","endDate","amount"]]
contracts_df['projectValue'] = contracts_df['projectValue'].str.replace(' ', '')
contracts_df['projectValue'] = pd.to_numeric(contracts_df['projectValue'])
contracts_df.pivot_table(values='startDate', index=['project', 'projectStartDate'], aggfunc=np.min)
contracts_df.pivot_table(values='endDate', index=['project', 'projectEndDate'], aggfunc=np.max)
import plotly
import plotly.figure_factory as ff
from copy import copy
plotly.offline.init_notebook_mode()
# generate gantt chart for each project
gantt_charts = {}
for project in contracts_df["project"].unique():
# create dataframe with correct structure for gantt chart
gantt_charts[project] = {}
gantt_df = contracts_df.rename(index=str, columns={"title_en":"Task", "startDate":"Start","endDate":"Finish"})
gantt_df = gantt_df.loc[gantt_df["project"] == project]
gantt_charts[project]["data"] = copy(gantt_df)
# set colours for contract types
colors= {}
if "planning" in gantt_df["type"].unique():
colors["planning"] = "#d6e100"
if "construction" in gantt_df["type"].unique():
colors["construction"] = "#fb6045"
if "monitoring" in gantt_df["type"].unique():
colors["monitoring"] = "#6c75e1"
gantt_df = gantt_df.sort_values(by="Start",ascending=False)
# create gantt chart
gantt_charts[project]["chart"] = ff.create_gantt(
gantt_df,
colors=colors,
index_col='type',
#reverse_colors=True,
show_colorbar=True,
showgrid_x=True,
showgrid_y=True,
title="Related Contracts"
)
# plot example gantt chart and save image for display in summary
plotly.offline.iplot(gantt_charts[projectID]["chart"], image="png", filename="relatedContracts")