In June 2018, Michigan approved the nation's toughest lead regulations. The state's law came in the wake the Flint lead crisis and set a strict timeline for lead service line removals, requires utilities to pay for service line removals, and increases public education requirements on lead poisoning.
The new law also changed how utilities with lead service lines are required to sample and test homes for lead. The federal Lead and Copper Rule requires utilities to test the first liter of water drawn from the tap, a requirement that likely misses the highest levels of lead in a home. The Michigan Lead and Copper Rule requires two samples, the first and the fifth liter. Utilities with lead service lines are required to test both and use the higher of the two in calculating their 90th-percentile lead level.
Last year, 2019, was the first year that utilities were required to use first and fifth liter sampling. APM Reports requested 10 years of water system 90th-percentile lead levels and four years of individual sample data and found that the inclusion of just one extra liter more than doubled, on average, the amount of lead utilities reported to the state and EPA.
Water utilities are required to keep the amount of lead in their system below a certain amount, called the action level. When a utility exceeds the action level, they are required to take a series of steps related to public health and lead mitigation. Both Michigan's and the Federal lead action level is 15 ppb, though Michigan is lowering their action level to 12 ppb in 2025.
The 90th-percentile lead level is what states and the Federal Government use to determine compliance with the Lead and Copper Rule. The 90th-percentile means that 10% of samples are above the 90th-percentile value and 90% of samples are below. So if the 90th-percentile value is 10 ppb, that means that 10% of lead samples were above 10 ppb.
APM Reports requested the 90th-percentile lead levels for every water system in Michigan from 2010-2019 in order to see how the state's reported lead levels changed after they updated their lead regulations. We also requested the samples used to calculate each system's lead levels, but only received sampling data for 2016-2019.
import os
import numpy as np
import pandas as pd
import altair as alt
alt.data_transformers.disable_max_rows()
data_dir = os.path.join(os.getcwd(), 'data/source')
data_out = os.path.join(os.getcwd(), 'data/processed')
mi_data = os.path.join(data_dir, 'mi_lead_data')
# This dataset identifies the utilities that included a 5th liter in 2019.
columns_2019 = {
'Public Water\nSupply ID': 'id',
'System Name': 'system_name',
'County': 'county',
'Population': 'system_population',
'Last Monitoring\nPeriod End': 'last_monitoring_period_end',
'Lead 90th\nPercentile (ppb)': '90th_percentile',
'Includes\n5th liter?': 'fifth_liter_included',
'Sampling Next Due\n(subject to change)': 'next_sample_date'
}
water_system_lead_tests_csv = os.path.join(mi_data, 'Public_Water_Supply_90th_Percentiles.csv')
water_system_lead_tests = pd.read_csv(water_system_lead_tests_csv)
water_system_lead_tests = water_system_lead_tests.rename(columns=columns_2019)
water_system_lead_tests.head()
id | system_name | county | system_population | last_monitoring_period_end | 90th_percentile | fifth_liter_included | next_sample_date | |
---|---|---|---|---|---|---|---|---|
0 | MI0000011 | ACME TOWNSHIP - HOPE VILLAGE | GRAND TRAVERSE | 128 | 12/31/2018 | 0 | N | 09/30/2020 |
1 | MI0000012 | ADA TOWNSHIP | KENT | 6523 | 12/31/2019 | 0 | N | 09/30/2020 |
2 | MI0000040 | ADRIAN | LENAWEE | 21133 | 12/31/2017 | 0 | N | 09/30/2020 |
3 | MI0000072 | AKRON TOWNSHIP | TUSCOLA | 219 | 06/30/2019 | 0 | N | 12/31/2019 |
4 | MI0000082 | ALABASTER TOWNSHIP | IOSCO | 86 | 06/30/2019 | 0 | N | 09/30/2022 |
# This helps us identify every system that included a 5th liter in 2019.
def system_has_5th_liter(df):
fifth_liter_ids = water_system_lead_tests[water_system_lead_tests['fifth_liter_included'] == 'Y'].id.to_list()
system_id = df['id']
sample_year = df['monitoring_period_year']
if (system_id in fifth_liter_ids) and (sample_year == 2019):
return 1
else:
return 0
columns = {
'PWS ID': 'id',
'System Name': 'system_name',
'County': 'county',
'Retail Population': 'system_population',
'Prd End': 'last_monitoring_period_end',
'Lead 90th\nPercentile\n(mg/L or ppm)': '90th_percentile_ppm',
'Lead 90th\nPercentile\n(ug/L or ppb)': '90th_percentile'
}
lead_tests_all_years_xlsx = os.path.join(mi_data, 'Pb90_2010-2019.xlsx')
lead_tests_all_years = pd.read_excel(lead_tests_all_years_xlsx)
lead_tests_all_years = lead_tests_all_years.rename(columns=columns)
#The ID column has a ton of trailing space
lead_tests_all_years['id'] = lead_tests_all_years.apply(lambda x: x['id'].strip(), axis=1)
lead_tests_all_years['monitoring_period_year'] = pd.to_datetime(lead_tests_all_years['last_monitoring_period_end']).dt.year
lead_tests_all_years['fifth_liter_present'] = lead_tests_all_years.apply(system_has_5th_liter, axis=1)
lead_tests_all_years.head()
id | system_name | county | system_population | last_monitoring_period_end | 90th_percentile_ppm | 90th_percentile | monitoring_period_year | fifth_liter_present | |
---|---|---|---|---|---|---|---|---|---|
0 | MI0000011 | ACME TOWNSHIP - HOPE VILLAGE | GRAND TRAVERSE | 128 | 2011-12-31 | 0.00 | 0 | 2011 | 0 |
1 | MI0000011 | ACME TOWNSHIP - HOPE VILLAGE | GRAND TRAVERSE | 128 | 2014-12-31 | 0.01 | 10 | 2014 | 0 |
2 | MI0000011 | ACME TOWNSHIP - HOPE VILLAGE | GRAND TRAVERSE | 128 | 2018-12-31 | 0.00 | 0 | 2018 | 0 |
3 | MI0000012 | ADA TOWNSHIP | KENT | 6523 | 2010-12-31 | 0.00 | 0 | 2010 | 0 |
4 | MI0000012 | ADA TOWNSHIP | KENT | 6523 | 2013-12-31 | 0.00 | 0 | 2013 | 0 |
Michigan changed the way they monitor by including the 5th liter in their sampling protocol. The number of systems exceeding the action level went from 16 to 31. On Jan 1, 2025, the state is set to lower the action level further, from 15 ppb to 12 ppb.
If the action level were 12 ppb, the number of systems above the action level would be 46, or 6% of the water systems in the state.
Though 6% of water systems might seem like a small number, if the MI action level were 12 ppb, that means that 993,351 people -- 15% of the state's population -- would live in a water system with levels of lead exceeding the action level.
A note on the code below: we use 15.5 and 12.5 because for compliance purposes, EPA rounds numbers so systems need to test at 15.5 or above to exceed the action level.
# Two action-levels to consider, higher and lower
lead_tests_2019 = lead_tests_all_years[lead_tests_all_years['monitoring_period_year']==2019]
systems_above_action_level = lead_tests_2019[lead_tests_2019['90th_percentile'] >= 15.5].copy()
systems_above_new_action_level = lead_tests_2019[lead_tests_2019['90th_percentile'] >= 12.5].copy()
print(f'''
Year: 2019
Systems above 15 ppb: {len(systems_above_action_level)}, {round(100*(len(systems_above_action_level)/len(lead_tests_2019)),2)}% of systems
Systems above 12 ppb: {len(systems_above_new_action_level)}, {round(100*(len(systems_above_new_action_level)/len(lead_tests_2019)),2)}% of systems
Total Systems: {len(lead_tests_2019)}
''')
print(f'''
Percentage of the population, systems above 15 ppb: \
{round(systems_above_action_level.system_population.sum()/lead_tests_2019.system_population.sum() * 100,2)}% \
({systems_above_action_level.system_population.sum()} people)
Percentage of the population, systems above 12 ppb: \
{round(systems_above_new_action_level.system_population.sum()/lead_tests_2019.system_population.sum() * 100,2)}% \
({systems_above_new_action_level.system_population.sum()} people)
''')
Year: 2019 Systems above 15 ppb: 31, 4.18% of systems Systems above 12 ppb: 46, 6.2% of systems Total Systems: 742 Percentage of the population, systems above 15 ppb: 5.73% (390996 people) Percentage of the population, systems above 12 ppb: 14.56% (993351 people)
print('Counties with highest average 90th percentile lead levels')
lead_tests_2019.groupby(['county'])['90th_percentile'].mean().reset_index().sort_values(by='90th_percentile', ascending=False)[0:10]
Counties with highest average 90th percentile lead levels
county | 90th_percentile | |
---|---|---|
17 | CLARE | 27.000000 |
38 | KALAMAZOO | 19.454545 |
80 | WAYNE | 16.288889 |
10 | BERRIEN | 13.894737 |
53 | MENOMINEE | 13.000000 |
66 | OSCODA | 11.200000 |
33 | IONIA | 8.250000 |
58 | MONTMORENCY | 7.666667 |
37 | JACKSON | 7.090909 |
62 | OCEANA | 6.500000 |
print('Counties with average lead levels above 15 ppb')
pd.pivot_table(lead_tests_2019,
values=['id','system_population', '90th_percentile'],
index=['county'],
aggfunc={
'id': 'count',
'system_population': [np.sum],
'90th_percentile': [np.mean]
},
).sort_values(by=[('system_population','sum')], ascending=False)[0:10]
Counties with average lead levels above 15 ppb
90th_percentile | id | system_population | |
---|---|---|---|
mean | count | sum | |
county | |||
WAYNE | 16.288889 | 45 | 1851780 |
OAKLAND | 4.493151 | 73 | 958664 |
MACOMB | 3.772727 | 22 | 811261 |
GENESEE | 1.486486 | 37 | 509385 |
KENT | 1.772727 | 22 | 498691 |
KALAMAZOO | 19.454545 | 11 | 395834 |
OTTAWA | 1.300000 | 20 | 196762 |
BAY | 3.583333 | 36 | 194958 |
SAGINAW | 2.650000 | 20 | 169194 |
INGHAM | 2.272727 | 11 | 144293 |
Data on lead testing from 2010 to 2019 in MI shows that the addition of a 5th liter increases the amount of lead that utilities detect in the system.
We analyzed this in a few ways. The graph below shows the spread of all 90th-percentile values in the dataset. The year is the last year of the monitoring period, which is why some are in 2020 and 2021.
It appears that systems with a fifth liter have a higher 90th-percentile value.
base = alt.Chart(width=500).mark_circle().encode(
x=alt.X('monitoring_period_year:O'),
y=alt.Y('90th_percentile'),
color=alt.Color('fifth_liter_present:O', scale=alt.Scale(domain=[0,1], range=['blue', 'red']))
)
horiline = alt.Chart().mark_rule(color='red').encode(
y='a:Q'
)
alt.layer(
base, horiline,
data=lead_tests_all_years
).transform_calculate(
a='15'
).interactive()
The average 90th-percentile lead level of systems with 5th-liter sampling is more than three times higher than systems without 5th-liter sampling.
print("All systems, avg 90th percentile lead level of systems with and without 5th liter")
lead_tests_all_years.groupby('fifth_liter_present')['90th_percentile'].mean()
All systems, avg 90th percentile lead level of systems with and without 5th liter
fifth_liter_present 0 2.727039 1 10.519737 Name: 90th_percentile, dtype: float64
Not all systems are required to use the higher of the first and fifth liter though. Utilities without lead service lines are only required to use first liter sampling.
APM Reports analyzed the average lead levels only for systems that added the fifth liter in 2019. This allows us to see the impact of adding an extra liter from further into the plumbing.
The average lead level for systems with lead service lines was 3.15 ppb before they added the fifth liter. In 2019, when sampling changed, the average lead level was 10.5 ppb.
mask = lead_tests_all_years['id'].isin(water_system_lead_tests[water_system_lead_tests['fifth_liter_included'] == 'Y'].id.to_list())
all_systems_with_5th_sample = lead_tests_all_years[mask].copy()
print("Average 90th percentile lead level, only systems that included 5th liter in 2019")
all_systems_with_5th_sample.groupby('fifth_liter_present')['90th_percentile'].mean()
Average 90th percentile lead level, only systems that included 5th liter in 2019
fifth_liter_present 0 3.157773 1 10.519737 Name: 90th_percentile, dtype: float64
The average 90th-percentile lead level for these systems hovered between 2 ppb and 4 ppb per year for nearly a decade. Once testing changed, the average lead level more than doubled.
data = all_systems_with_5th_sample.groupby(['fifth_liter_present','monitoring_period_year' ])['90th_percentile'].mean().reset_index()
display(data)
chart = alt.Chart(data, width=800, title='Avg 90th-percentile lead levels per year, systems that added 5th liter in 2019').mark_bar().encode(
x=alt.X('monitoring_period_year:O'),
y=alt.Y('90th_percentile'),
color=alt.Color('fifth_liter_present:O')
)
display(chart)
fifth_liter_present | monitoring_period_year | 90th_percentile | |
---|---|---|---|
0 | 0 | 2010 | 4.111111 |
1 | 0 | 2011 | 2.289157 |
2 | 0 | 2012 | 2.600000 |
3 | 0 | 2013 | 3.424242 |
4 | 0 | 2014 | 1.621951 |
5 | 0 | 2015 | 4.090909 |
6 | 0 | 2016 | 3.057143 |
7 | 0 | 2017 | 4.294118 |
8 | 0 | 2018 | 4.794872 |
9 | 1 | 2019 | 10.519737 |
def add_system_pop(df):
# 1) This takes the system name from lead_levels_by_year
# 2) looks up the rows from lead_tests_all_years
# 3) grabs the first row from the subsequent table, and pulls the system population by isolating a single row and pulling the value
name = df['system_name']
pop = lead_tests_all_years[lead_tests_all_years['system_name'] == name].iloc[0]['system_population']
return pop
lead_levels_by_year = pd.pivot_table(all_systems_with_5th_sample,
index='system_name',
columns=['monitoring_period_year'],
values=['90th_percentile']).sort_values(by=('90th_percentile', 2019), ascending=False)['90th_percentile'].reset_index()
lead_levels_by_year['system_population'] = lead_levels_by_year.apply(add_system_pop, axis=1)
lead_levels_by_year.head()
monitoring_period_year | system_name | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | system_population |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MELVINDALE | NaN | 6.0 | NaN | NaN | 1.0 | NaN | NaN | 3.0 | NaN | 370.0 | 10715 |
1 | PARCHMENT | NaN | 15.0 | NaN | NaN | 5.0 | NaN | NaN | 10.0 | 16.0 | 73.5 | 3174 |
2 | HIGHLAND PARK | 10.0 | NaN | NaN | 4.0 | NaN | NaN | 5.0 | NaN | NaN | 57.0 | 11398 |
3 | DEARBORN HEIGHTS | NaN | 0.0 | NaN | NaN | 0.0 | NaN | NaN | 5.0 | NaN | 34.0 | 57774 |
4 | BENTON HARBOR | NaN | 11.0 | 5.0 | NaN | NaN | 12.0 | NaN | NaN | 22.0 | 29.5 | 9970 |
lead_levels_by_year.sort_values(by=2019, ascending=False).head()
monitoring_period_year | system_name | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | system_population |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MELVINDALE | NaN | 6.0 | NaN | NaN | 1.0 | NaN | NaN | 3.0 | NaN | 370.0 | 10715 |
1 | PARCHMENT | NaN | 15.0 | NaN | NaN | 5.0 | NaN | NaN | 10.0 | 16.0 | 73.5 | 3174 |
2 | HIGHLAND PARK | 10.0 | NaN | NaN | 4.0 | NaN | NaN | 5.0 | NaN | NaN | 57.0 | 11398 |
3 | DEARBORN HEIGHTS | NaN | 0.0 | NaN | NaN | 0.0 | NaN | NaN | 5.0 | NaN | 34.0 | 57774 |
4 | BENTON HARBOR | NaN | 11.0 | 5.0 | NaN | NaN | 12.0 | NaN | NaN | 22.0 | 29.5 | 9970 |
The point of lead sampling is to give utilities information about how well their lead mitigation strategies are working. The action level is not a measure of health, but a measure of how their other strategies, such as corrosion control, are working.
APM Reports analyzed how the revisions to Michigan's Lead and Copper Rule impacted the number of systems over the action level. Over the last decade, the number of systems over the action level fluctuated from 4 to a max of 16. In 2019, the number of systems over the action level was 31, double the previous year's number.
systems_above_al = pd.pivot_table(lead_tests_all_years,
index='monitoring_period_year',
values=['90th_percentile'],
aggfunc = {'90th_percentile': lambda x: (x>=15.5).sum()})['90th_percentile'].reset_index().rename(columns={'90th_percentile':'systems_over_al'})
display(systems_above_al)
alt.Chart(systems_above_al[systems_above_al.monitoring_period_year < 2020], width=500).mark_bar().encode(
x=alt.X('monitoring_period_year:O', title='Year'),
y=alt.Y('systems_over_al:Q', title='Number of systems above AL')
)
monitoring_period_year | systems_over_al | |
---|---|---|
0 | 2010 | 5 |
1 | 2011 | 9 |
2 | 2012 | 6 |
3 | 2013 | 4 |
4 | 2014 | 8 |
5 | 2015 | 4 |
6 | 2016 | 5 |
7 | 2017 | 15 |
8 | 2018 | 16 |
9 | 2019 | 31 |
10 | 2020 | 0 |
11 | 2021 | 0 |
Michigan is lowering the action level to 12 ppb in 2025. Though this change is still five years away, we can get an idea for what might happen when the level is lowered by looking at the number of systems over 12 ppb over the last decade.
The overall pattern is the same as what we see with the current action level. The number of systems with high lead levels flucuates year to year but more than doubles once fifth liter sampling is implemented.
systems_above_new_al = pd.pivot_table(lead_tests_all_years,
index='monitoring_period_year',
values=['90th_percentile'],
aggfunc = {'90th_percentile': lambda x: (x>=12.5).sum()})['90th_percentile'].reset_index().rename(columns={'90th_percentile':'systems_over_new_al'})
display(systems_above_new_al)
alt.Chart(systems_above_new_al[systems_above_new_al.monitoring_period_year < 2020], width=500).mark_bar().encode(
x=alt.X('monitoring_period_year:O', title='Year'),
y=alt.Y('systems_over_new_al:Q', title='Number of systems above New AL')
)
monitoring_period_year | systems_over_new_al | |
---|---|---|
0 | 2010 | 9 |
1 | 2011 | 17 |
2 | 2012 | 12 |
3 | 2013 | 9 |
4 | 2014 | 10 |
5 | 2015 | 9 |
6 | 2016 | 15 |
7 | 2017 | 21 |
8 | 2018 | 20 |
9 | 2019 | 46 |
10 | 2020 | 0 |
11 | 2021 | 0 |
The action level is found by taking lead samples at high-risk homes throughout a water system. The number of samples required varies based on the system and their previous lead test results.
APM Reports received the individual lead testing results that were used for calculating the 90th-percentile values for all systems, from 2016 to 2019.
columns = {
'PWS ID': 'id',
'System': 'system_name',
'SiteCode': 'sitecode',
'Lab ID': 'lab_id',
'For\nCompliance?': 'for_compliance',
'Collect Date': 'collect_date',
'Address': 'address',
'Liter 1 or 5': 'first_or_fifth',
'Analyte': 'rule',
'Result': 'lead_level',
'Unit': 'unit',
'Accepted/\nRejected?': 'accepted_rejected',
'Reject\nReason': 'reject_reason'
}
lead_sampling_xlsx = os.path.join(mi_data, 'Pb_Individual_Results_2016-2019.xlsx')
lead_sampling = pd.read_excel(lead_sampling_xlsx)
lead_sampling = lead_sampling.rename(columns=columns)
def convert_units(df):
unit = df['unit'].strip()
if unit == 'MG/L':
corrected = df['lead_level'] * 1000
elif unit == 'UG/L':
corrected = df['lead_level']
else:
print(df)
return corrected
def correct_first_fifth(df):
if df['first_or_fifth'] == 'L5':
return 1
else:
return 0
# clean addresses
lead_sampling['address'] = lead_sampling.apply(lambda x: x.address.strip(), axis=1)
lead_sampling['lead_level_ppb'] = lead_sampling.apply(convert_units, axis=1)
lead_sampling['fifth_liter'] = lead_sampling.apply(correct_first_fifth, axis=1)
lead_sampling['collect_year'] = pd.to_datetime(lead_sampling['collect_date']).dt.year
lead_sampling.head()
id | system_name | sitecode | lab_id | for_compliance | collect_date | address | first_or_fifth | rule | lead_level | unit | accepted_rejected | reject_reason | lead_level_ppb | fifth_liter | collect_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MI0000011 | ACME TOWNSHIP - HOPE VILLAGE | DIST | 183572-2 | Y | 2018-07-12 | 109 | LEAD | 0.0 | MG/L | A | 0.0 | 0 | 2018 | ||
1 | MI0000011 | ACME TOWNSHIP - HOPE VILLAGE | DIST | 183572-4 | Y | 2018-07-12 | 1115 | LEAD | 0.0 | MG/L | A | 0.0 | 0 | 2018 | ||
2 | MI0000011 | ACME TOWNSHIP - HOPE VILLAGE | DIST | 183572-5 | Y | 2018-07-12 | 1124 | LEAD | 0.0 | MG/L | A | 0.0 | 0 | 2018 | ||
3 | MI0000011 | ACME TOWNSHIP - HOPE VILLAGE | DIST | 183572-1 | Y | 2018-07-13 | 105 | LEAD | 0.0 | MG/L | A | 0.0 | 0 | 2018 | ||
4 | MI0000011 | ACME TOWNSHIP - HOPE VILLAGE | DIST | 183572-3 | Y | 2018-07-17 | 201 | LEAD | 0.0 | MG/L | A | 0.0 | 0 | 2018 |
The average lead level of all first liter samples is 2 pbb, while the average lead level of all fifth liter samples is 4.8 ppb. This is consistent with an APM Reports analysis of lead levels from Chicago that found lead levels were higher from deeper in the plumbing.
The 90th-percentile values of all first liter samples was 3.4 ppb, while the 90th-percentile of all fifth liter samples was 10.8 ppb.
fifth_liters = pd.pivot_table(lead_sampling,
index=['fifth_liter'],
values='lead_level_ppb', aggfunc=[np.mean,lambda x: np.percentile(x,90)])
display(fifth_liters)
alt.Chart(fifth_liters['mean'].reset_index(),
title='Average lead concentration, 1st vs 5th liter').mark_bar().encode(
x=alt.X('lead_level_ppb'),
y=alt.Y('fifth_liter:O')
)
mean | <lambda> | |
---|---|---|
lead_level_ppb | lead_level_ppb | |
fifth_liter | ||
0 | 1.996813 | 3.400 |
1 | 4.810809 | 10.889 |
Lead levels are notoriously variable. Lead levels will be different based on a large number of different factors. The average lead level for fifth liter samples was 4.8 ppb, but the standard deviation was 26.
pd.pivot_table(lead_sampling,
index='fifth_liter',
values=['lead_level_ppb'],
aggfunc=['mean', 'std'])
mean | std | |
---|---|---|
lead_level_ppb | lead_level_ppb | |
fifth_liter | ||
0 | 1.996813 | 18.029060 |
1 | 4.810809 | 26.119056 |
On average, the fifth liter from each house was .8 ppb higher than the first liter at the same house.
df = pd.pivot_table(lead_sampling[lead_sampling.collect_year == 2019],
index='address',
columns=['first_or_fifth'],
values=['lead_level_ppb'],
aggfunc=['mean'])
first_fifth = df[(df[('mean', 'lead_level_ppb', 'L1')] >= 0) & (df[('mean', 'lead_level_ppb', 'L5')] >= 0)]
flat_first_fifth = first_fifth[('mean', 'lead_level_ppb')].reset_index()[['address', 'L1', 'L5']]
flat_first_fifth['1st_5th_diff'] = flat_first_fifth['L5'] - flat_first_fifth['L1']
print(f'''
Average difference between first and fifth liter samples, at the same home: {round(flat_first_fifth['1st_5th_diff'].mean(),2)} ppb
''')
Average difference between first and fifth liter samples, at the same home: 0.84 ppb
# The average lead concentration of individual first liter samples at sytems with lead lines in 2019, compared to the avg of 5th liter
systems_with_5th_liter = water_system_lead_tests[water_system_lead_tests['fifth_liter_included'] == 'Y'].id.to_list()
fifth_mask = (lead_sampling.id.isin(systems_with_5th_liter)) & (lead_sampling.collect_year == 2019)
pd.pivot_table(lead_sampling[fifth_mask],
index='fifth_liter',
values=['lead_level_ppb'],
aggfunc=[np.mean,lambda x: np.percentile(x,90)])
mean | <lambda> | |
---|---|---|
lead_level_ppb | lead_level_ppb | |
fifth_liter | ||
0 | 3.393858 | 5.2 |
1 | 4.926820 | 11.0 |
lead_tests_all_years_out = os.path.join(data_out, 'mi_system_90th_percentiles_2010-2019.csv')
lead_sampling_out = os.path.join(data_out, 'mi_system_samples_2016-2019.csv')
lead_tests_all_years.to_csv(lead_tests_all_years_out, index=False)
lead_sampling.to_csv(lead_sampling_out, index=False)