CoST OCDS ODS
In [27]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Out[27]:
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

OCDS for Infrastructure - Ukraine demonstrator

This notebook demonstrates the steps involved in developing a lightweight prototype to bring together structured data on infrastructure projects and contracting processes and explores the added value of using OCDS data in infrastructure monitoring.

The notebook is divided into the following sections:

  • Summary of findings
  • Part 1: Scraping projects data from the CoST Ukraine portal
  • Part 2: Finding related contracting processes in the Prozorro OCDS export
  • Part 3: Comparing data from CoST and Prozorro

Summary of findings

Availability of data

Projects

An export of structured data is not available from the front end of the CoST Ukraine portal, therefore a scraper script was written to extract projects data and store this in structured format.

Currently only data from the State Highways Agency is available on the CoST Ukraine portal.

Contracting processes

Structured data is available from the Prozorro platform via the following methods:

The most recent bulk export is dated 2018-04-20.

Currently the only public search interface available to the contracting data is in the Prozorro front end and there is no search API or interface to the bulk OCDS data to provide programmatic search of the data.

To reduce the number of manual steps in searching for and retrieving OCDS data from Prozorro the latest fortnightly bulk export was downloaded into an instance of the ocdsdata ETL tool so that the full dataset could be queried using PostgreSQL.

Using a sample of project titles from the CoST Ukraine portal and searching the tender/title and tender/description fields in the OCDS yielded the following results (number of relevant results shown in brackets):

Search term Project 1 Project 2 Project 3 Project 4 Project 5
Project title 1 (1) 1 (1) 1 (1) 3 (3) 0 (0)
Highway name 14 (6) 5 (3) 3 (2) 18 (8) 73 (n/a)
Highway name and km marker 6 (6) 4 (3) 2 (2) 9 (8) n/a

Based on this sample, it seems that an automated search for the full project title is not sufficient to discover all related contracting processes and that searching for the highway name and km marker results in the best coverage and accuracy. As such two manual steps are required:

  1. Extract highway name and km marker from project title to use as search term (per project)
  2. Review tender title and description to determine whether the contracting process is related (per search result)

We also noted that:

  • For 3 of the 5 projects, a Prozorro "announcement number" (ocid) was recorded in the CoST Portal
  • For 4 of the 5 projects, the project title in the CoST Portal was an exact match for a construction contract in Prozorro

Suggesting that the list of projects in the CoST Ukraine portal may have been generated from Prozorro (e.g. by looking for tenders from the buyer with an item classification relating to construction).

Identifiers

Projects

Projects in the CoST Portal are not assigned an identifier. For the purposes of this analysis identifiers were constructed from the agency, region and foreign key extracted from the URL for the project, e.g.

portal.costukraine.org/UAD/POL/PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_TECH_DETAILS_handler&fk0=222*&master_viewmode=0

Contracting processes

Contracting processes in Prozorro are assigned an identifier (the ocid) however identifiers are not provided for the project a contracting process is associated with.

Some projects in the CoST Ukraine portal include a single ocid, which appears to be for the main construction contract associated with the project (check this).

Defining a project

The CoST Portal appears to list projects at the level of individual construction contracts.

When searching for related contracting processes we noted examples where there were several different construction contracts relating to a single highway, suggesting that there may be overarching projects of which the individual construction contracts form a part.

Tracking change

The OCDS Releases and Records model is designed to support disclosure of both individual updates about a contracting process (releases) and a summary of the latest state of the contracting process (records).

Currently, both the openprocurement API and the bulk OCDS export from Prozorro provide the latest state of the contracting process rather than a history of changes.

Prozorro functionality to publish individual releases to enable tracking change over the life of a process is in development.

There are two possible approaches to tracking change with the existing functionality, however there are obstacles to both of these:

Scrape the openprocurement API daily for specific contracting processes

  • No search API is available, so requires first using either the Prozorro front end or ocdsdata ETL to find the relevant contracting processes.
  • Results in non OCDS data.

Download and search historic bulk exports

  • Contracting processes do not neccessarily appear in the same segment across exports, so the full export for each fortnight must be downloaded (c. 46GB per export)
  • Exports are fortnightly so changes which occur more frequently than this may be missed

We weren't able to determine whether the release identifier is updated when the contracting process changes in either the API or bulk exports, so in either case it may be neccessary to compare all fields in the release to determine whether anything has changed.

Comparing project and contracting data

Comparing start and end dates

For each project:

  • The end date listed in the CoST portal for the project matched the latest end date of the related contracts found in Prozorro.
  • There was a related contract in Prozorro with a start date earlier than the start date listed in the project.
In [4]:
contracts_df.pivot_table(values='startDate', index=['project', 'projectStartDate'], aggfunc=np.min)
Out[4]:
startDate
project projectStartDate
UAD-DNE-40 18.04.2017 2017-02-09T00:00:00+02:00
UAD-DNE-44 21.07.2017 2017-05-11T00:00:00+03:00
UAD-POL-222 16.06.2017 2017-06-08T00:00:00+03:00
UAD-SUM-173 12.10.2017 2016-09-21T18:00:00+00:00
In [5]:
contracts_df.pivot_table(values='endDate', index=['project', 'projectEndDate'], aggfunc=np.max)
Out[5]:
endDate
project projectEndDate
UAD-DNE-40 31.12.2018 2018-12-31T00:00:00+02:00
UAD-DNE-44 31.12.2017 2017-12-31T00:00:00+02:00
UAD-POL-222 31.12.2018 2018-12-31T00:00:00+02:00
UAD-SUM-173 31.12.2018 2018-12-31T00:00:00+02:00

Understanding the project timeline

For each project we were able to identify design, construction and monitoring contracts and use the data from Prozorro to understand the project timeline, e.g.

Related Contracts

Comparing project value

For some projects the total value of contracts in Prozorro was greater than the total cost in the CoST Portal and for some projects is was less:

In [8]:
pd.options.display.float_format = '{:,}'.format
contracts_df.pivot_table(values='amount', index=['project','projectValue'], aggfunc=np.sum)
Out[8]:
amount
project projectValue
UAD-DNE-40 968,055,500.0 957,555,315.79
UAD-DNE-44 203,382,619.0 204,569,982.24
UAD-POL-222 575,109,925.0 576,803,727.92
UAD-SUM-173 99,202,651.2 102,590,531.98

Value Differences

Understanding the split of project costs

Using the data from Prozorro we were able to understand the split of the total project cost across design, construction and monitoring contracts, e.g.

Cost Split

Conclusions

OCDS data can be used to:

  • Discover contracts related to a project
  • Understand the project timeline
  • Understand the total cost of a project

The demonstrator suggests there is additional value in using OCDS data for infrastructure monitoring, for example, we found:

  • Multiple design contracts for a single project, including intial design contracts for 'new construction' followed by further design contracts for 'reconstruction'
  • Variances in the project values reported in the CoST portal and the total value of contracts in Prozorro
  • Variances in the split of the total project value between design, construction and monitoring contracts

Part 1: Scraping data from CoST Ukraine portal

Navigate to the CoST Ukraine Portal

Note: Use Google Chrome with auto-translation enabled, unless you speak Ukrainian

Currently only projects of Ukravtodor, the State Highways Agency, are listed, so click the Ukravtodor logo.

Use the map to choose a region to see the projects list for, e.g. the Sumy region

Auto-translate doesn't work on the projects list page, so click "ТАБЛИЦІ" in the grey header bar to get a view which can be translated.

Note: You might need to open the "ТАБЛИЦІ" link in a new tab to get it to load

Choose the project which you want to scrape from the list, e.g. Reconstruction of the bridge crossing on the highway N-12 Sumy-Poltava km 70 + 838

Note: Use of Prozorro for above and below threshold procurement has only been mandatory since 1st August 2016, so try and find a project which starts after this date to give the best chance of finding related contracts

Set the following variables based on the URL of the project you selected:

In [7]:
#update with url segment for the public entity, e.g. UAD for State Highways Agency of Ukraine
publicEntity = "UAD"

#update with region for project, e.g. SUM for  region, note for MFO projects the url construction is slightly different (see commented out html = line below)
region = "SUM"

#update with value of &fk0 parameter in URL of project you want to scrape - this identifies the project
foreignKey = "173"

#construct an identifier for saving data
projectID = publicEntity + "-" + region + "-" + foreignKey

Scrape data from CoST Portal

In [46]:
from requests import get
from bs4 import BeautifulSoup
import pprint
import json
import ipywidgets as widgets

#function to scrape content of main table
def scrape(url,output):
    
    #get html and convert to nice object
    html = get(url,stream=True).content
    html = BeautifulSoup(html, "html.parser")
    
    #get name of section we are scraping and create an object for it
    section = html.body["id"]
    output[section] = {} 
    
    main_table = html.find("div", class_="well")
    
    if main_table != None:
        for td in main_table.select("td"):
            if "data-column-name" in td.attrs:
                output[section][td["data-column-name"]] = td.text
    
    return output

#get html of first page and convert to nice object
html = get("http://portal.costukraine.org/uad_mfo/PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_TECH_DETAILS_handler&fk0=" + foreignKey + "&master_viewmode=0",stream=True).content #use this line for MFO projects
#html = get("http://89.185.0.248:8888/"+publicEntity+"/"+region+"/PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_TECH_DETAILS_handler&fk0=" + foreignKey + "&master_viewmode=0",stream=True).content #use this line for all other projects
html = BeautifulSoup(html, "html.parser")

#set up array to store urls for each view of project
urls = []

#get urls of each page
navigation = html.find("ul", class_="nav nav-tabs grid-details-tabs")
for li in navigation.select("li"):
    urls.append(li.a["href"])
    
#put amendments URL in separate variable (no data found for this page yet, so we don't do anything with this)
amendmentsURL = urls.pop()

#set up object for scraped data
project = {}

#scrape summary table (appears on each page, so only do this once)
project["summary"] = {}

summary_table = html.find("div", class_="grid grid-table grid-master js-grid")

for th in summary_table.select("th"):
        project["summary"][th["data-name"]] = ""
        
for td in summary_table.select("td"):
    if "data-column-name" in td.attrs:
        project["summary"][td["data-column-name"]] = td.text

#scrape main table on each page
for url in urls:
    print("scraping " + url)
    project = scrape("http://portal.costukraine.org/uad_mfo/" + url, project) #use this line for MFO projects
    #project = scrape("http://89.185.0.248:8888/"+publicEntity+"/"+region+"/" + url, project) #use this line for all other projects

print("done scraping")

data = {}
data = {"project": project}
scraping PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_TECH_DETAILS_handler&fk0=3&master_viewmode=0
scraping PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_SUBJECTS_handler&fk0=3&master_viewmode=0
scraping PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_CUSTOMER_handler&fk0=3&master_viewmode=0
scraping PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_FINANCING_handler&fk0=3&master_viewmode=0
scraping PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_PROJECT_ORGANIZATION_handler&fk0=3&master_viewmode=0
scraping PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_CONTRACTOR_handler&fk0=3&master_viewmode=0
scraping PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_ENG_SUPERVIZORY_handler&fk0=3&master_viewmode=0
scraping PROJECTS.php?hname=dbo_PROJECTS_dbo_PROJECTS_TECH_SUPERVIZORY_handler&fk0=3&master_viewmode=0
done scraping

Translate scraped data

In [47]:
import copy
import os
from google.cloud import translate

#Set google cloud API credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './ocds-f08bd860a3be.json'

#create a copy of the project from the Ukraine portal to translate
data["project_translated"] = copy.deepcopy(project)

# Instantiate translation client
translate_client = translate.Client()
target = 'en'

#translate project
project_translated = data["project_translated"]
for section in project_translated:
    for key in project_translated[section]:
        if type(project_translated[section][key]) == str:
            text = project_translated[section][key]
            translation = translate_client.translate(text, target_language = target)
            if text != translation:
                project_translated[section][key] = translation['translatedText']

Save data

In [48]:
import os
import json

if not os.path.exists("data"):
    os.makedirs("data")

with open("data/" + projectID + ".json","w") as export:
    json.dump(data,export,indent=2)    

Connect to ocdsdata instance

In [25]:
import getpass
import psycopg2

dbpassword = getpass.getpass("Enter database user password: ") #see /.pgpass file on server

# db connection config
conn = psycopg2.connect(
    database = 'ocdsdata',
    user = 'ocdsdata',
    password = dbpassword,
    host = '195.201.163.242',
    port = '5432',
)

# clear db user password
dbpassword = ''

# create db cursor
cur = conn.cursor()
/home/ddewhurst/open_data_services/ocds/repositories/ocinfra.ocinfra_ukraine_demonstrator/.ve/lib/python3.5/site-packages/psycopg2/__init__.py:144: UserWarning:

The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.

Enter database user password: ········
In [52]:
import json
import copy
import os
from google.cloud import translate
from IPython.display import display

#Set google cloud API credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/ddewhurst/open_data_services/ocds/repositories/ocinfra.ocinfra_ukraine_demonstrator/ocds-f08bd860a3be.json'

# Instantiate translation client
translate_client = translate.Client()
target = 'en'

#open project data
with open("data/"+projectID+".json") as file:
    data = json.load(file)

#check if related contracting processes already exist and confirm overwrite
if "tenders" in data:
    overwrite = input("Tenders data already exists for this project, overwrite? (y/n) ")
    if overwrite == "y":
            del(data["tenders"])
    print("\n")
    

if "tenders" not in data:
    
    #display project name in UK and EN and choose search terms
    print("Project Title:")
    print(data["project"]["summary"]["name"],"\n")
    print("Project Title (translated):")
    print(data["project_translated"]["summary"]["name"],"\n")
    
    #prompt for search terms
    searchInput = input("Enter pipe (|) delimited list of search terms from Ukrainian project title: ")
    searchTerms = searchInput.split("|")
    
    #construct SQL query
    searchString = """
        SELECT
            data
        FROM
            data
        WHERE
            data ->> 'ocid' LIKE 'ocds-be6bcu%'
    """
    
    titleSearchString = """ AND (("""
    descSearchString = """) OR ("""
    
    for term in searchTerms:
        titleSearchString = titleSearchString + "data -> 'tender' ->> 'title' LIKE '%" + term + "%'"
        descSearchString = descSearchString + "data -> 'tender' ->> 'description' LIKE '%" + term + "%'"
        if searchTerms.index(term) < len(searchTerms) - 1:
            titleSearchString = titleSearchString + " AND "
            descSearchString = descSearchString + " AND "
        else:
            descSearchString = descSearchString + "))"

    searchString = searchString + titleSearchString + descSearchString

    #search database and return results
    print("\n","Searching...")
    cur.execute("rollback")
    cur.execute(searchString)
    
    allTenders = []
    relatedTenders = []
    
    for result in cur.fetchall():
        allTenders.append(result[0])
    
    print("Found ",len(allTenders)," contracting process(es)\n")
    
    #translate tender title & description
    for tender in allTenders:
        
        print("OCID: ",tender["ocid"])
        
        if "title" in tender["tender"]:
            text = tender["tender"]["title"]
            translation = translate_client.translate(text, target_language = "en")
            tender["tender"]["title_en"] = translation["translatedText"]
            print("Title: ",tender["tender"]["title_en"])
        
        if "description" in tender["tender"]:
            text = tender["tender"]["description"]
            translation = translate_client.translate(text, target_language = "en")
            tender["tender"]["description_en"] = translation["translatedText"]
            print("Description",tender["tender"]["description_en"],"\n")
        
        #prompt to confirm tender is related and categorise as planning, contstruction or monitoring
        if input("Is this tender related to the project? (y/n): ") == "y":
            relatedTenders.append(tender)
            tenderType = input("Is this a (p)lanning, (c)onstruction or (m)onitoring tender?: ")
            tender["ocinfra"] = {}
            if tenderType == "p":
                tender["ocinfra"]["type"] = "planning"
            elif tenderType == "c":
                tender["ocinfra"]["type"] = "construction"
            elif tenderType == "m":
                tender["ocinfra"]["type"] = "monitoring"
            else:
                print("ERROR")
        print("\n")
    
    #save data
    data["tenders"] = relatedTenders
    
    with open("data/"+projectID+".json","w") as file:
        json.dump(data,file,indent=2)
    
    print("done")
    
Tenders data already exists for this project, overwrite? (y/n) y


Project Title:
Будівництво обходів трьох населених пунктів на автомобільній дорозі М-03 Київ-Харків-Довжанський: с.Покровська Багачка, с.Красногорівка, м.Полтава (Ічерга). 

Project Title (translated):
Construction of detours of three settlements on the motorway M-03 Kiev-Kharkiv-Dovzhansky: p.Pokrovskaya Bagachka, s.Krasnogorovka, Poltava (Icherga). 

Enter pipe (|) delimited list of search terms from Ukrainian project title: М-03 Київ-Харків-Довжанський|Будівництво обходів

 Searching...
Found  0  contracting process(es)

done

Part 3: Compare data in Prozorro to data in CoST Ukraine Portal

Load data on contracts

In [3]:
import numpy as np
import pandas as pd
import json
import glob

paths = glob.glob("data/*.json")

contracts = []

for path in paths:
    if "translated" not in path:
        with open(path) as file:
            data = json.load(file)

        for tender in data["tenders"]:
            if "contracts" in tender:
                for contract in tender["contracts"]:
                    contracts.append({
                        "project": path.split("/")[-1].split(".")[-2],
                        "projectStartDate": data["project"]["summary"]["term_start"],
                        "projectEndDate": data["project"]["summary"]["term_end"],
                        "projectValue": data["project"]["summary"]["total_cost"],
                        "ocid": tender["ocid"],
                        "title_en": tender["tender"]["title_en"],
                        "type": tender["ocinfra"]["type"],
                        "contractID": contract["contractID"],
                        "status": contract["status"],
                        "startDate": contract["period"]["startDate"],
                        "endDate": contract["period"]["endDate"],
                        "amount": contract["value"]["amount"]
                    })
                    if contract["value"]["currency"] != "UAH":
                        print("Warning: Multiple currencies")

contracts_df = pd.DataFrame(contracts)

contracts_df = contracts_df[["project","projectStartDate","projectEndDate","projectValue","ocid","type","title_en","contractID","status","startDate","endDate","amount"]]

contracts_df['projectValue'] = contracts_df['projectValue'].str.replace(' ', '')
contracts_df['projectValue'] = pd.to_numeric(contracts_df['projectValue'])

Compare start dates

In [3]:
contracts_df.pivot_table(values='startDate', index=['project', 'projectStartDate'], aggfunc=np.min)
Out[3]:
startDate
project projectStartDate
UAD-DNE-40 18.04.2017 2017-02-09T00:00:00+02:00
UAD-DNE-44 21.07.2017 2017-05-11T00:00:00+03:00
UAD-POL-222 16.06.2017 2017-06-08T00:00:00+03:00
UAD-POL-222_translated 16.06.2017 2017-06-08T00:00:00+03:00
UAD-SUM-173 12.10.2017 2016-09-21T18:00:00+00:00

Compare end dates

In [4]:
contracts_df.pivot_table(values='endDate', index=['project', 'projectEndDate'], aggfunc=np.max)
Out[4]:
endDate
project projectEndDate
UAD-DNE-40 31.12.2018 2018-12-31T00:00:00+02:00
UAD-DNE-44 31.12.2017 2017-12-31T00:00:00+02:00
UAD-POL-222 31.12.2018 2018-12-31T00:00:00+02:00
UAD-POL-222_translated 31.12.2018 2018-12-31T00:00:00+02:00
UAD-SUM-173 31.12.2018 2018-12-31T00:00:00+02:00

Visualise start and end dates

In [19]:
import plotly
import plotly.figure_factory as ff
from copy import copy

plotly.offline.init_notebook_mode()

# generate gantt chart for each project
gantt_charts = {}

for project in contracts_df["project"].unique():
    
    # create dataframe with correct structure for gantt chart
    gantt_charts[project] = {}
    gantt_df = contracts_df.rename(index=str, columns={"title_en":"Task", "startDate":"Start","endDate":"Finish"})
    gantt_df = gantt_df.loc[gantt_df["project"] == project]
    gantt_charts[project]["data"] = copy(gantt_df)
    
    # set colours for contract types
    colors= {}
    
    if "planning" in gantt_df["type"].unique():
        colors["planning"] = "#d6e100"
    if "construction" in gantt_df["type"].unique():
        colors["construction"] = "#fb6045"
    if "monitoring" in gantt_df["type"].unique():
        colors["monitoring"] = "#6c75e1"
        
    gantt_df = gantt_df.sort_values(by="Start",ascending=False)
    
    # create gantt chart
    gantt_charts[project]["chart"] = ff.create_gantt(
                                            gantt_df,
                                            colors=colors,
                                            index_col='type',
                                            #reverse_colors=True,
                                            show_colorbar=True,
                                            showgrid_x=True,
                                            showgrid_y=True,
                                            title="Related Contracts"
                                            )
# plot example gantt chart and save image for display in summary
plotly.offline.iplot(gantt_charts[projectID]["chart"], image="png", filename="relatedContracts")