Measure calculations with Pandas¶

This notebook demonstrates how to calculate measure ratios and percentiles, using the Methotrexate measure for CCGs as an example. It is straightforward to perform calculations for practices instead, and we will note how to do this below.

Calculations are performed using data in the hscic.normalised_prescribing_standard table in BigQuery.

Setup¶

We need pandas for the computation, and requests to validate the computation against the OpenPrescribing implementation, via the OpenPrescribing API.

In [1]:

import pandas as pd
import requests

Getting the data¶

We'll make two queries against BigQuery, one for the numerator values and the other for the denominator values.

Some notes:

The WHERE conditions on the bnf_code column come from the measure definition.
In order to match the OpenPrescibing implementation exactly, we need to restrict prescriptions to those prescribed by GP practices (setting = 4) in CCGs (org_type = 'CCG').
To calculate ratios and percentiles for practices instead of CCGs, replace pct with practice in the SELECT and GROUP BY clauses.
The CONCAT/CAST/EXTRACT/LPAD dance converts the dates stored in BigQuery to strings of the form YYYY_MM. This is not necessary, but makes the data easier to work with.
Here we're only retrieving data from 2018_06 onwards, but there is data going back to 2010_08.

In [2]:

project_name = 'ebmdatalab'

In [3]:

numerator_query = '''
SELECT
    pct,
    CONCAT(
        CAST(EXTRACT(YEAR FROM month) AS STRING),
        "_",
        LPAD(CAST(EXTRACT(MONTH FROM month) AS STRING), 2, "0")
    ) AS month,
    SUM(items) AS value
FROM
    hscic.normalised_prescribing_standard AS prescriptions
INNER JOIN hscic.practices
    ON prescriptions.practice = practices.code
INNER JOIN hscic.ccgs
    ON prescriptions.pct = ccgs.code
WHERE
    bnf_code LIKE '1001030U0%AC'
    AND setting = 4
    AND org_type = 'CCG'
    AND month >= TIMESTAMP('2018-06-01')
GROUP BY pct, month
'''

denominator_query = '''
SELECT
    pct,
    CONCAT(
        CAST(EXTRACT(YEAR FROM prescriptions.month) AS STRING),
        "_",
        LPAD(CAST(EXTRACT(MONTH FROM prescriptions.month) AS STRING), 2, "0")
    ) AS month,
    SUM(items) AS items,
    SUM(total_list_size) AS population,
    SUM(CAST(JSON_EXTRACT(star_pu, '$.oral_antibacterials_item') AS FLOAT64)) AS star_pu
FROM
    hscic.normalised_prescribing_standard AS prescriptions
INNER JOIN hscic.practices
    ON prescriptions.practice = practices.code
INNER JOIN hscic.practice_statistics_all_years stats
    ON prescriptions.practice = stats.practice
    AND prescriptions.month = stats.month
INNER JOIN hscic.ccgs
    ON prescriptions.pct = ccgs.code
WHERE
    (bnf_code LIKE '1001030U0%AB' OR bnf_code LIKE '1001030U0%AC')
    AND setting = 4
    AND org_type = 'CCG'
    AND prescriptions.month >= TIMESTAMP('2018-06-01')
GROUP BY pct, month
'''

In [4]:

numerators_raw = pd.read_gbq(numerator_query, project_name, dialect='standard')
numerators_raw.head()

Requesting query... ok.
Job ID: 3f017fdd-2a4a-4d93-976c-cc69211526ff
Query running...
Query done.
Processed: 38.5 GB Billed: 38.5 GB
Standard price: $0.19 USD

Retrieving results...
Got 1159 rows.

Total time taken 3.08 s.
Finished at 2019-02-25 12:22:11.

Out[4]:

	pct	month	value
0	06P	2018_08	3
1	10Q	2018_10	5
2	09E	2018_08	9
3	01K	2018_11	3
4	06V	2018_11	1

In [5]:

denominators_raw = pd.read_gbq(denominator_query, project_name, dialect='standard')
denominators_raw.head()

Requesting query... ok.
Job ID: 110b5c54-f744-4b56-8d90-40ed3ac50ded
Query running...
Query done.
Cache hit.

Retrieving results...
Got 1365 rows.

Total time taken 0.8 s.
Finished at 2019-02-25 12:22:13.

Out[5]:

	pct	month	items	population	star_pu
0	06D	2018_09	403	124796	75059.487761
1	01Y	2018_09	549	278827	157100.837373
2	04Q	2018_12	444	141740	84531.203842
3	05H	2018_10	380	177051	102371.390365
4	01J	2018_06	426	165352	92706.343987

Select desired denominator¶

In [9]:

# refer to https://docs.google.com/spreadsheets/d/1F7a92URkQgX244LPFvZxl6tEmWdJbVELm3R1BfKpspw/edit#gid=187146618

denominators_select = denominators_raw.copy()

denominators_select["value"] = denominators_select["items"] # or "population" or "star_pu"
    
denominators_select = denominators_select[["pct","month","value"]]
denominators_select.head()

Out[9]:

	pct	month	value
0	06D	2018_09	403
1	01Y	2018_09	549
2	04Q	2018_12	444
3	05H	2018_10	380
4	01J	2018_06	426

Reshaping the data¶

Querying BigQuery gives us a DataFrame with one row per CCG per month. Instead, we want a DataFrame with one row per CCG and one column per month.

We can achieve this with set_index() and unstack().

Some notes:

If there is no row for a given CCG and month in the raw table, then the corresponding value in the unstacked table will be NaN. We'll resolve this when calculating the ratios below.
To calculate ratios and percentiles for practices instead of CCGs, replace pct with practice.

In [10]:

numerators = numerators_raw.set_index(['pct', 'month']).unstack()['value']
numerators.head()

Out[10]:

month	2018_06	2018_07	2018_08	2018_09	2018_10	2018_11	2018_12
pct
00D	NaN	NaN	NaN	NaN	NaN	2.0	1.0
00J	NaN	NaN	NaN	NaN	1.0	NaN	NaN
00K	2.0	NaN	NaN	NaN	NaN	NaN	1.0
00L	1.0	1.0	1.0	3.0	1.0	3.0	3.0
00M	3.0	3.0	5.0	2.0	3.0	3.0	5.0

In [11]:

denominators = denominators_select.set_index(['pct', 'month']).unstack()['value']
denominators.head()

Out[11]:

month	2018_06	2018_07	2018_08	2018_09	2018_10	2018_11	2018_12
pct
00C	324	347	343	325	364	346	355
00D	856	869	883	829	897	828	851
00J	790	800	833	763	803	780	790
00K	544	586	551	531	549	549	532
00L	1032	1086	1122	1001	1098	1072	1078

Doing the calculations¶

With the data in the right form, we can now do the calculations.

Ratios¶

To match the OpenPrescribing implementation, if either the numerator or denominator is missing for a given CCG and month, we set the ratio to zero. This is what fillna() is doing.

In [12]:

ratios = (numerators / denominators).fillna(0)
ratios.head()

Out[12]:

month	2018_06	2018_07	2018_08	2018_09	2018_10	2018_11	2018_12
pct
00C	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
00D	0.000000	0.000000	0.000000	0.000000	0.000000	0.002415	0.001175
00J	0.000000	0.000000	0.000000	0.000000	0.001245	0.000000	0.000000
00K	0.003676	0.000000	0.000000	0.000000	0.000000	0.000000	0.001880
00L	0.000969	0.000921	0.000891	0.002997	0.000911	0.002799	0.002783

Percentile ranks¶

The simpler ratios.rank(method='min', pct=True) * 100 doesn't produce quite the same results as the OpenPrescribing implementation.

In [13]:

percentile_ranks = (ratios.rank(method='min') - 1) / (ratios.count() - 1) * 100
percentile_ranks.head()

Out[13]:

month	2018_06	2018_07	2018_08	2018_09	2018_10	2018_11	2018_12
pct
00C	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
00D	0.000000	0.000000	0.000000	0.000000	0.000000	25.773196	17.525773
00J	0.000000	0.000000	0.000000	0.000000	20.618557	0.000000	0.000000
00K	33.505155	0.000000	0.000000	0.000000	0.000000	0.000000	23.195876
00L	15.979381	18.556701	17.010309	27.835052	16.494845	27.835052	29.896907

Deciles¶

We use strings for the index keys, as they are easier to work with than floats.

In [14]:

deciles = pd.DataFrame(
    [ratios.quantile(i * 0.1) for i in range(11)],
    index=[str(i * 10) for i in range(11)]
)
deciles

Out[14]:

month	2018_06	2018_07	2018_08	2018_09	2018_10	2018_11	2018_12
0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
10	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
20	0.001537	0.001064	0.001243	0.001538	0.001215	0.001436	0.001590
30	0.003179	0.003416	0.002815	0.003464	0.002763	0.003132	0.002795
40	0.005211	0.004675	0.005813	0.005738	0.004805	0.005750	0.005322
50	0.010710	0.008696	0.009756	0.008830	0.009443	0.009195	0.009009
60	0.016117	0.014645	0.014059	0.013013	0.013431	0.013701	0.012082
70	0.022892	0.022638	0.020071	0.022348	0.023063	0.022093	0.020681
80	0.049535	0.052413	0.049464	0.046032	0.043519	0.046047	0.043762
90	0.104575	0.100952	0.097624	0.093164	0.090902	0.093508	0.088868
100	0.338286	0.287554	0.319460	0.310615	0.291085	0.307860	0.297619

Verifying the calculations¶

We can compare our calculations against the OpenPrescribing implementation by querying the OpenPrescribing API for all deciles, and for a handful of CCGs.

In [15]:

url = 'https://openprescribing.net/api/1.0/measure/'
params = {
    'format': 'json',
    'measure': 'methotrexate',
}
rsp = requests.get(url, params)

for record in rsp.json()['measures'][0]['data']:
    month = record['date'][:4] + '_' + record['date'][5:7]
    if month < '2018_06':
        continue
    for k in record['percentiles']['ccg']:
        if abs(record['percentiles']['ccg'][k] - deciles[month][k]) > 0.001:
            print('Decile', k, 'differs in month', month)

In [16]:

url = 'https://openprescribing.net/api/1.0/measure_by_ccg/'

for ccg_id in ratios.index.to_series().sample(4):
    params = {
        'format': 'json',
        'measure': 'methotrexate',
        'org': ccg_id,
    }
    rsp = requests.get(url, params)

    for record in rsp.json()['measures'][0]['data']:
        month = record['date'][:4] + '_' + record['date'][5:7]
        if month < '2018_06':
            continue
        if abs(record['calc_value'] - ratios[month][ccg_id]) > 0.001:
            print('Ratio differs for CCG', ccg_id, 'in month', month)
        if abs(record['percentile'] - percentile_ranks[month][ccg_id]) > 0.001:
            print('Percentile differs for CCG', ccg_id, 'in month', month)

Percentile differs for CCG 00R in month 2018_08

In [ ]:

month	2018_06	2018_07	2018_08	2018_09	2018_10	2018_11	2018_12
pct
00D	NaN	NaN	NaN	NaN	NaN	2.0	1.0
00J	NaN	NaN	NaN	NaN	1.0	NaN	NaN
00K	2.0	NaN	NaN	NaN	NaN	NaN	1.0
00L	1.0	1.0	1.0	3.0	1.0	3.0	3.0
00M	3.0	3.0	5.0	2.0	3.0	3.0	5.0

month	2018_06	2018_07	2018_08	2018_09	2018_10	2018_11	2018_12
pct
00D	NaN	NaN	NaN	NaN	NaN	2.0	1.0
00J	NaN	NaN	NaN	NaN	1.0	NaN	NaN
00K	2.0	NaN	NaN	NaN	NaN	NaN	1.0
00L	1.0	1.0	1.0	3.0	1.0	3.0	3.0
00M	3.0	3.0	5.0	2.0	3.0	3.0	5.0

month	2018_06	2018_07	2018_08	2018_09	2018_10	2018_11	2018_12
pct
00D	NaN	NaN	NaN	NaN	NaN	2.0	1.0
00J	NaN	NaN	NaN	NaN	1.0	NaN	NaN
00K	2.0	NaN	NaN	NaN	NaN	NaN	1.0
00L	1.0	1.0	1.0	3.0	1.0	3.0	3.0
00M	3.0	3.0	5.0	2.0	3.0	3.0	5.0