This notebook demonstrates how to calculate measure ratios and percentiles, using the Methotrexate measure for CCGs as an example. It is straightforward to perform calculations for practices instead, and we will note how to do this below.
Calculations are performed using data in the hscic.normalised_prescribing_standard
table in BigQuery.
We need pandas
for the computation, and requests
to validate the computation against the OpenPrescribing implementation, via the OpenPrescribing API.
import pandas as pd
import requests
We'll make two queries against BigQuery, one for the numerator values and the other for the denominator values.
Some notes:
WHERE
conditions on the bnf_code
column come from the measure definition.setting = 4
) in CCGs (org_type = 'CCG'
).pct
with practice
in the SELECT
and GROUP BY
clauses.CONCAT/CAST/EXTRACT/LPAD
dance converts the dates stored in BigQuery to strings of the form YYYY_MM
. This is not necessary, but makes the data easier to work with.project_name = 'ebmdatalab'
numerator_query = '''
SELECT
pct,
CONCAT(
CAST(EXTRACT(YEAR FROM month) AS STRING),
"_",
LPAD(CAST(EXTRACT(MONTH FROM month) AS STRING), 2, "0")
) AS month,
SUM(items) AS value
FROM
hscic.normalised_prescribing_standard AS prescriptions
INNER JOIN hscic.practices
ON prescriptions.practice = practices.code
INNER JOIN hscic.ccgs
ON prescriptions.pct = ccgs.code
WHERE
bnf_code LIKE '1001030U0%AC'
AND setting = 4
AND org_type = 'CCG'
AND month >= TIMESTAMP('2018-06-01')
GROUP BY pct, month
'''
denominator_query = '''
SELECT
pct,
CONCAT(
CAST(EXTRACT(YEAR FROM prescriptions.month) AS STRING),
"_",
LPAD(CAST(EXTRACT(MONTH FROM prescriptions.month) AS STRING), 2, "0")
) AS month,
SUM(items) AS items,
SUM(total_list_size) AS population,
SUM(CAST(JSON_EXTRACT(star_pu, '$.oral_antibacterials_item') AS FLOAT64)) AS star_pu
FROM
hscic.normalised_prescribing_standard AS prescriptions
INNER JOIN hscic.practices
ON prescriptions.practice = practices.code
INNER JOIN hscic.practice_statistics_all_years stats
ON prescriptions.practice = stats.practice
AND prescriptions.month = stats.month
INNER JOIN hscic.ccgs
ON prescriptions.pct = ccgs.code
WHERE
(bnf_code LIKE '1001030U0%AB' OR bnf_code LIKE '1001030U0%AC')
AND setting = 4
AND org_type = 'CCG'
AND prescriptions.month >= TIMESTAMP('2018-06-01')
GROUP BY pct, month
'''
numerators_raw = pd.read_gbq(numerator_query, project_name, dialect='standard')
numerators_raw.head()
Requesting query... ok. Job ID: 3f017fdd-2a4a-4d93-976c-cc69211526ff Query running... Query done. Processed: 38.5 GB Billed: 38.5 GB Standard price: $0.19 USD Retrieving results... Got 1159 rows. Total time taken 3.08 s. Finished at 2019-02-25 12:22:11.
pct | month | value | |
---|---|---|---|
0 | 06P | 2018_08 | 3 |
1 | 10Q | 2018_10 | 5 |
2 | 09E | 2018_08 | 9 |
3 | 01K | 2018_11 | 3 |
4 | 06V | 2018_11 | 1 |
denominators_raw = pd.read_gbq(denominator_query, project_name, dialect='standard')
denominators_raw.head()
Requesting query... ok. Job ID: 110b5c54-f744-4b56-8d90-40ed3ac50ded Query running... Query done. Cache hit. Retrieving results... Got 1365 rows. Total time taken 0.8 s. Finished at 2019-02-25 12:22:13.
pct | month | items | population | star_pu | |
---|---|---|---|---|---|
0 | 06D | 2018_09 | 403 | 124796 | 75059.487761 |
1 | 01Y | 2018_09 | 549 | 278827 | 157100.837373 |
2 | 04Q | 2018_12 | 444 | 141740 | 84531.203842 |
3 | 05H | 2018_10 | 380 | 177051 | 102371.390365 |
4 | 01J | 2018_06 | 426 | 165352 | 92706.343987 |
# refer to https://docs.google.com/spreadsheets/d/1F7a92URkQgX244LPFvZxl6tEmWdJbVELm3R1BfKpspw/edit#gid=187146618
denominators_select = denominators_raw.copy()
denominators_select["value"] = denominators_select["items"] # or "population" or "star_pu"
denominators_select = denominators_select[["pct","month","value"]]
denominators_select.head()
pct | month | value | |
---|---|---|---|
0 | 06D | 2018_09 | 403 |
1 | 01Y | 2018_09 | 549 |
2 | 04Q | 2018_12 | 444 |
3 | 05H | 2018_10 | 380 |
4 | 01J | 2018_06 | 426 |
Querying BigQuery gives us a DataFrame
with one row per CCG per month. Instead, we want a DataFrame
with one row per CCG and one column per month.
We can achieve this with set_index()
and unstack()
.
Some notes:
NaN
. We'll resolve this when calculating the ratios below.pct
with practice
.numerators = numerators_raw.set_index(['pct', 'month']).unstack()['value']
numerators.head()
month | 2018_06 | 2018_07 | 2018_08 | 2018_09 | 2018_10 | 2018_11 | 2018_12 |
---|---|---|---|---|---|---|---|
pct | |||||||
00D | NaN | NaN | NaN | NaN | NaN | 2.0 | 1.0 |
00J | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN |
00K | 2.0 | NaN | NaN | NaN | NaN | NaN | 1.0 |
00L | 1.0 | 1.0 | 1.0 | 3.0 | 1.0 | 3.0 | 3.0 |
00M | 3.0 | 3.0 | 5.0 | 2.0 | 3.0 | 3.0 | 5.0 |
denominators = denominators_select.set_index(['pct', 'month']).unstack()['value']
denominators.head()
month | 2018_06 | 2018_07 | 2018_08 | 2018_09 | 2018_10 | 2018_11 | 2018_12 |
---|---|---|---|---|---|---|---|
pct | |||||||
00C | 324 | 347 | 343 | 325 | 364 | 346 | 355 |
00D | 856 | 869 | 883 | 829 | 897 | 828 | 851 |
00J | 790 | 800 | 833 | 763 | 803 | 780 | 790 |
00K | 544 | 586 | 551 | 531 | 549 | 549 | 532 |
00L | 1032 | 1086 | 1122 | 1001 | 1098 | 1072 | 1078 |
With the data in the right form, we can now do the calculations.
To match the OpenPrescribing implementation, if either the numerator or denominator is missing for a given CCG and month, we set the ratio to zero. This is what fillna()
is doing.
ratios = (numerators / denominators).fillna(0)
ratios.head()
month | 2018_06 | 2018_07 | 2018_08 | 2018_09 | 2018_10 | 2018_11 | 2018_12 |
---|---|---|---|---|---|---|---|
pct | |||||||
00C | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
00D | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.002415 | 0.001175 |
00J | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.001245 | 0.000000 | 0.000000 |
00K | 0.003676 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.001880 |
00L | 0.000969 | 0.000921 | 0.000891 | 0.002997 | 0.000911 | 0.002799 | 0.002783 |
The simpler ratios.rank(method='min', pct=True) * 100
doesn't produce quite the same results as the OpenPrescribing implementation.
percentile_ranks = (ratios.rank(method='min') - 1) / (ratios.count() - 1) * 100
percentile_ranks.head()
month | 2018_06 | 2018_07 | 2018_08 | 2018_09 | 2018_10 | 2018_11 | 2018_12 |
---|---|---|---|---|---|---|---|
pct | |||||||
00C | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
00D | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 25.773196 | 17.525773 |
00J | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 20.618557 | 0.000000 | 0.000000 |
00K | 33.505155 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 23.195876 |
00L | 15.979381 | 18.556701 | 17.010309 | 27.835052 | 16.494845 | 27.835052 | 29.896907 |
We use strings for the index keys, as they are easier to work with than floats.
deciles = pd.DataFrame(
[ratios.quantile(i * 0.1) for i in range(11)],
index=[str(i * 10) for i in range(11)]
)
deciles
month | 2018_06 | 2018_07 | 2018_08 | 2018_09 | 2018_10 | 2018_11 | 2018_12 |
---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
10 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
20 | 0.001537 | 0.001064 | 0.001243 | 0.001538 | 0.001215 | 0.001436 | 0.001590 |
30 | 0.003179 | 0.003416 | 0.002815 | 0.003464 | 0.002763 | 0.003132 | 0.002795 |
40 | 0.005211 | 0.004675 | 0.005813 | 0.005738 | 0.004805 | 0.005750 | 0.005322 |
50 | 0.010710 | 0.008696 | 0.009756 | 0.008830 | 0.009443 | 0.009195 | 0.009009 |
60 | 0.016117 | 0.014645 | 0.014059 | 0.013013 | 0.013431 | 0.013701 | 0.012082 |
70 | 0.022892 | 0.022638 | 0.020071 | 0.022348 | 0.023063 | 0.022093 | 0.020681 |
80 | 0.049535 | 0.052413 | 0.049464 | 0.046032 | 0.043519 | 0.046047 | 0.043762 |
90 | 0.104575 | 0.100952 | 0.097624 | 0.093164 | 0.090902 | 0.093508 | 0.088868 |
100 | 0.338286 | 0.287554 | 0.319460 | 0.310615 | 0.291085 | 0.307860 | 0.297619 |
We can compare our calculations against the OpenPrescribing implementation by querying the OpenPrescribing API for all deciles, and for a handful of CCGs.
url = 'https://openprescribing.net/api/1.0/measure/'
params = {
'format': 'json',
'measure': 'methotrexate',
}
rsp = requests.get(url, params)
for record in rsp.json()['measures'][0]['data']:
month = record['date'][:4] + '_' + record['date'][5:7]
if month < '2018_06':
continue
for k in record['percentiles']['ccg']:
if abs(record['percentiles']['ccg'][k] - deciles[month][k]) > 0.001:
print('Decile', k, 'differs in month', month)
url = 'https://openprescribing.net/api/1.0/measure_by_ccg/'
for ccg_id in ratios.index.to_series().sample(4):
params = {
'format': 'json',
'measure': 'methotrexate',
'org': ccg_id,
}
rsp = requests.get(url, params)
for record in rsp.json()['measures'][0]['data']:
month = record['date'][:4] + '_' + record['date'][5:7]
if month < '2018_06':
continue
if abs(record['calc_value'] - ratios[month][ccg_id]) > 0.001:
print('Ratio differs for CCG', ccg_id, 'in month', month)
if abs(record['percentile'] - percentile_ranks[month][ccg_id]) > 0.001:
print('Percentile differs for CCG', ccg_id, 'in month', month)
Percentile differs for CCG 00R in month 2018_08