Table of Contents

River Arno & Lake Bilancino

This notebook is part of a tripartite which consists of one notebook about the Petrignano aquifer, one notebook about the water spring Madonna di Canneto, and this one combining the methodology and ML algo's for the water bodies river Arno and the artificial lake Bilancino.

Can you help preserve "blue gold" using data to predict water availability?

Competition info

  • TITLE: Acea Smart Water Analytics
  • Start Date: December 10, 2020
  • Entry Deadline and End Date (Final Submission Deadline): February 17th, 2021 11:59 PM UTC
  • COMPETITION SPONSOR: Injenia SRL on behalf of Acea Spa

This competition uses nine different datasets, completely independent and not linked to each other. Each dataset can represent a different kind of waterbody. As each waterbody is different from the other, the related features as well are different from each other. So, if for instance we consider a water spring we notice that its features are different from the lake’s one. The Acea Group deals with four different type of waterbodies: water spring (for which three datasets are provided), lake (for which a dataset is provided), river (for which a dataset is provided) and aquifers (for which four datasets are provided). Thus there are 9 datasets, including springs, aquifers, a river and a lake.

Some datasets had 20+ columns, while other had 3 or 4. Regardless the numbers of columns, there were some long gaps in data series due to numerous nan's. So, after performing an initial EDA, I chose these 2 competition topics: river Arno & lake Bilancino.
A nice reason for this decision is that they are hydrologically connected: the lake is a water storage on the river (La) Sieve, which tributes to the Arno. As such, to study one topic is half the work of the other.

However, there were many obstacles on route to find a reasonable methodology to obtain decent results for this endeavor. As the competition team provided interesting datasets at first glance, it became clear that some crucial info was not provided, and that the datasets had some NaN's holes. For instance, a hole in the river level data for > 6 months, which sadly included a big flooding event. Considering that many rivers in Central Italy are torrential in their core, and a catastrophic flood of the Arno occurs every 150-200 years, I would have provided some more data as competition instigator.

Background info
There is some justification to bring up for this way of omitting data on purpose: most employees working for the administrations or public co. that manage these surface waters, don't have a science background nor higher level degree. But still they have to be able to make decisions based on what they experience in real time while on site. In line of this you can find some hydrological studies which provide understandable decision making tools for a hayman. This might be worthwhile for predicting water levels for most of the storm events, but cannot handle fat tail events.

So, I took the freedom to collect scientifical info and data I could find online, and tried out if this added info provided better results for the model. Also, as the given datasets are so small, at least the ones I chose, I believe that running a training test for more than 5 minutes is waste of energy resources; and we're dealing here with natural resources, right?

What follows is a more detailed analysis on specific topics I spotted to be interesting to work on. This lead to some corrections to the original dataset(s), and also to some new features. Applied corrections are tested by algo's like random forest or other regressors. Most new features did not stand the tests however.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 57
pd.options.display.max_columns = 38
pd.set_option('precision', 2)

from IPython.display import Image, display
%config InlineBackend.figure_format = 'retina';
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:72% !important;}</style>"))

from datetime import datetime, timedelta, timezone
import os
In [5]:
Im= Image( url="", format='jpg', embed=True, width=430, height=400, retina=False,)
display(Im ,)#data='bytes',

The location of the lake Bilancino and the rivers Sieve and Arno.

In [2]:
#!pip install openpyxl
Floods =pd.read_excel(r"",sheet_name="Floods", ); # header=0,sep=";",
#floods["Date"] = pd.to_datetime( River_Arno.Date, format='%d/%m/%Y' ) # euro dates
#floods.set_index("Date", inplace=True)
Requirement already satisfied: openpyxl in c:\program files\python38\lib\site-packages (3.0.3)
Requirement already satisfied: et-xmlfile in c:\program files\python38\lib\site-packages (from openpyxl) (1.0.1)
Requirement already satisfied: jdcal in c:\program files\python38\lib\site-packages (from openpyxl) (1.4.1)
MEDIUM LARGE EXCEPTIONAL ALL Periodicy Count MeanPeriod DayPeriodicy
0 1261.0 1177.0 1333.0 1177 NaN 8.32e-07 14.35 NaN
1 1303.0 1269.0 1547.0 1261 84.0 NaN NaN NaN
2 1305.0 1282.0 1557.0 1269 8.0 NaN NaN NaN
3 1362.0 1284.0 1589.0 1282 13.0 NaN NaN NaN
4 1368.0 1288.0 1740.0 1284 2.0 NaN NaN NaN
5 1378.0 1334.0 1758.0 1288 4.0 NaN NaN NaN
6 1406.0 1345.0 1844.0 1303 15.0 NaN NaN NaN
7 1434.0 1380.0 1966.0 1305 2.0 NaN NaN NaN
8 1490.0 1456.0 NaN 1333 28.0 NaN NaN NaN
9 1491.0 1465.0 NaN 1334 1.0 NaN NaN NaN
Florence during the flood of November 4 1966

Prologue: Historical floods in Florence from 1177 to the present

The source of the first 3 columns of the Floods table:

  • "The Arno River Floods", Caporali E., Rinaldi M., Casagli N. ; Giornale di Geologia Applicata 1 (2005) 177 –192, doi: 10.1474/GGA.2005-01.0-18.0018

Note: you may skip this intro about floods, as it is not essential for the methodology.

In [3]:
fig, ax= plt.subplots(1,3, figsize=(16,4), sharex=True, sharey=True) ; FloodsALL= Floods.ALL; 
sns.histplot( data=Floods.MEDIUM, bins=20, ax= ax[0] ); 
sns.histplot( data=Floods.LARGE, bins=20, ax= ax[1],color="darkorange" ); 
sns.histplot( data=Floods.EXCEPTIONAL, bins=20, ax= ax[2],color="crimson" ); 

Rainfall flood oct. 1992 data


  • "Sensitivity of meteorological high-resolution numerical simulations of the biggest floods occurred over the Arno river basin, Italy, in the 20th century". (2003) Francesco Meneguzzoa, Massimiliano Pasquia, Giovanni Mendunib,Gianni Messeric, Bernardo Gozzinia, Daniele Grifonia, Matteo Rossic,Giampiero Maracchia
In [4]:
floods =pd.read_csv(r"",header=0, engine='python', encoding="UTF-8",parse_dates=True, ); 
floods["count"] = 1 ; #
floods["year"] =pd.to_datetime( floods.Year, format='%Y',infer_datetime_format=True, errors = 'coerce' ) # 
floodsm= pd.pivot_table(data=floods ,index="Year",columns="Floodclass",values="count",fill_value=0 )#pivot_table(floods, id_vars=['Year'], value_vars=['Floodclass']) 
1688 0 1 0
1695 0 0 1
1698 0 0 1
1705 0 1 0
1709 0 1 0
1714 0 1 0
1715 0 0 1
1719 0 1 0
1740 1 0 0
1745 0 0 1
1758 1 0 0
1761 0 0 1
1844 1 0 0
1966 1 0 0
1992 0 0 1
In [5]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57 entries, 1177 to 1992
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   EXCEPTIONAL  57 non-null     int64
 1   LARGE        57 non-null     int64
 2   MEDIUM       57 non-null     int64
dtypes: int64(3)
memory usage: 1.8 KB
In [6]:
floodsm.plot( colormap="magma", figsize=(20,2),); #x=floodsm.Yeary="Floodclass",

River Arno

In [7]:
River_Arno =pd.read_csv(r"",header=0, sep=",", engine='python', encoding="UTF-8",parse_dates=True  ); 
River_Arno["Date"] = pd.to_datetime( River_Arno.Date, format='%d/%m/%Y' ) # euro dates
River_Arno.set_index("Date", inplace=True)
Rainfall_Le_Croci Rainfall_Cavallina Rainfall_S_Agata Rainfall_Mangona Rainfall_S_Piero Rainfall_Vernio Rainfall_Stia Rainfall_Consuma Rainfall_Incisa Rainfall_Montevarchi Rainfall_S_Savino Rainfall_Laterina Rainfall_Bibbiena Rainfall_Camaldoli Temperature_Firenze Hydrometry_Nave_di_Rosano
1998-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.82
1998-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.94
1998-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.11
1998-01-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.84
1998-01-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.88

There are some gaps in the temperature data, which we correct by interpolation. And there are also some errors -zero's - in the river level data.

In [8]:
fig, ax = plt.subplots(1,1, figsize=(20,4),); River_Arno.loc['2007']["Temperature_Firenze"].plot(); 
In [9]:
River_Arno["Temperature_Firenze"]=River_Arno["Temperature_Firenze"].interpolate(method="linear")#"spline", order=2
In [10]:
River_Arno.loc['Mar 2007']["Temperature_Firenze"]
2007-03-01    14.75
2007-03-02    15.45
2007-03-03    15.25
2007-03-04    16.80
2007-03-05    14.55
2007-03-06    13.25
2007-03-07    11.95
2007-03-08    13.30
2007-03-09    14.65
2007-03-10    13.65
2007-03-11    12.65
2007-03-12    14.45
2007-03-13    13.30
2007-03-14    13.30
2007-03-15    16.60
2007-03-16    14.85
2007-03-17    13.15
2007-03-18    13.20
2007-03-19    10.70
2007-03-20     5.65
2007-03-21     7.95
2007-03-22     8.90
2007-03-23     9.15
2007-03-24     8.70
2007-03-25     8.90
2007-03-26     9.30
2007-03-27    11.00
2007-03-28    12.10
2007-03-29    10.30
2007-03-30    11.00
2007-03-31    11.30
Name: Temperature_Firenze, dtype: float64

Potential evapotranspiration

Soil evaporation and plant transpiration are two important components of the soil water balance. Evapotranspiration is the sum of these two evaporative losses (one from the soil and the other from the plant stomates). There is another term, the potential evapotranspiration (PET), which refers to the potential evaporative losses, typically by an reference crop of idealized conditions (e.g. green grass of 8–15 cm height) actively growing under non limiting soil moisture conditions and under the influence of the local weather. This value is used as reference of the maximum water loss from the system due to soil evaporation and plant transpiration.

Knowing the potential evaporation is useful to characterize climate regimes, usually the ratio of annual precipitation to annual PET. Another common use of PET, is in combination with empirical coefficients that can be used to weigh the PET of the reference crop to another crop.
Because potential evapotranspiration represents potential evaporative losses, the atmospheric vapor pressure deficit is a common term in many models. The vapor pressure deficit is computed by subtracting the actual vapor pressure from the saturation vapor pressure at the same ambient temperature. Vapor pressure deficit can be easily approximated using air temperature and relative humidity using Tetens equation for the saturation vapor pressure.

In [11]:
def tetens(T):
    Tetens= 0.6108*np.exp(17.27*T/(T+ 237.3)) 
    return Tetens
In [12]:
River_Arno["Tetens"]= River_Arno["Temperature_Firenze"].apply( tetens)


In [13]:
fig, ax = plt.subplots(1,1, figsize=(20,4),); River_Arno["Hydrometry_Nave_di_Rosano"].plot(); 
In [14]:
River_Arno.loc["25 Dec 2008":"5 Jan 2009"]["Hydrometry_Nave_di_Rosano"]
2008-12-25    0.00
2008-12-26    0.00
2008-12-27    0.00
2008-12-28    0.00
2008-12-29    0.00
2008-12-30    0.00
2008-12-31    0.00
2009-01-01    1.58
2009-01-02    1.84
2009-01-03    1.84
2009-01-04    1.76
2009-01-05    1.54
Name: Hydrometry_Nave_di_Rosano, dtype: float64

Missing flow rate data for the last half of 2008, which will cause bias by a 'simple' replacement. Let's try fill the gap with averages deduced from the other complete years...

In [15]:
averageHyd= River_Arno.groupby('{:%m-%d}'.format).mean()
In [16]:
averageHydro = averageHyd["07-02":"12-31"].Hydrometry_Nave_di_Rosano
In [17]:
averageHydro2008= pd.to_datetime( averageHydro.index, format='%m-%d') 
In [18]:
Timestamp('1900-07-03 00:00:00')

as grouping over the years has stripped away the year, I have to offset 1900 to 2008

In [19]:
from datetime import timedelta
#today = pd.to_datetime('today')
averageHyd2008 = averageHydro2008 + timedelta(days=108*365) #averageHydro2008 
In [20]:
averageHyd2008= averageHyd2008 + timedelta(days=27)
In [21]:
Timestamp('2008-07-03 00:00:00')
In [22]:
averageHydro.index= averageHyd2008
In [ ]:
In [23]:
River_Arno.update(averageHydro ) # insert into original frame
In [24]:
fig, ax = plt.subplots(1,1, figsize=(20,4),); River_Arno.loc['2006']["Hydrometry_Nave_di_Rosano"].plot(); 

We could replace some zero's with the average, or use a better technique: interpolation.

In [25]:
River_Arno["Hydrometry_Nave_di_Rosano"]=River_Arno["Hydrometry_Nave_di_Rosano"].interpolate(method="linear", )#limit_direction="both"
River_Arno["Hydrometry_Nave_di_Rosano"]= River_Arno["Hydrometry_Nave_di_Rosano"].replace(0, 1.52) # np.nan ?
In [26]:
fig, ax= plt.subplots(1,1, figsize=(20,4),); River_Arno.loc['2008']["Hydrometry_Nave_di_Rosano"].plot( lw=0.75); 

s = "2016/12/07 17:45"
d = datetime.strptime(s, '%Y/%m/%d %H:%M').date(); type(d) #

Rainfall in Valdisieve and Valdarno

In [27]:
River_Arno_Rainfall=River_Arno.Rainfall_Le_Croci+ River_Arno.Rainfall_Cavallina+ River_Arno.Rainfall_Mangona+River_Arno.Rainfall_S_Piero +River_Arno.Rainfall_S_Agata 
River_Arno["Rainf365"]= River_Arno_Rainfall.rolling(365, ).sum().fillna(method='bfill', limit=365)

Column River_Arno["Rainfall"] was used once, but later abandonned.

We split the rainfall of the North area up because their location has different impact on the water level at Rosano. After inspection topography maps, I noticed that run off water is not passing the lake for these 2 locations: S. Piero & S.Agata.
And this pitfall is a fact I discovered studying the topography of the area: the rainfall in Vernio seems to have shorter run off times than the run offs of the lake.
Rainfall in Casentino and Bibbiena seems also to have longer time lags.

In [28]:
River_Arno["RainNO_3"]=River_Arno.Rainfall_Le_Croci+ River_Arno.Rainfall_Cavallina+ River_Arno.Rainfall_Mangona 
River_Arno["RainNO_2"]=River_Arno.Rainfall_S_Piero +River_Arno.Rainfall_S_Agata
# River_Arno.Rainfall_Vernio  River_Arno.Rainfall_Incisa River_Arno.Rainfall_Bibbiena 

Vernio rainfall runoff water is flowing via the Bisenzio stream into Firenze. The water might hinder the runoff of rainwater higher up, and influence the flow rate upstream at Nave di Rosano a bit.

Laterina is nowadays called Pergine Laterina .

Obstacles for decent basin-wide predictions are:

  • for 9 stations of the 14, there is only rainfall data for period 2004-> mid 2007 available.
  • Rainwater fallen on the area's Le Croci, Cavallina and Mangona must pass through the lake reservoir.
  • there has been mentioned that the lake has 2 regimes, buffer water in winter, and feed the Arno in summer, but the actual times vary from year to year.
  • 'Laterina' is data for upper level subbasin

Let's use the info which was provided about the lake:

  • There has been mentioned that the lake storage has 2 regimes, buffering water in the winter, and feeding the Arno in summer. It is guessing how the run off is managed in a certain regime: no start or end times were given.
In [29]:
River_Arno["Rainfall_Bibbiena"] = River_Arno.Rainfall_Stia +River_Arno.Rainfall_Bibbiena + River_Arno.Rainfall_Camaldoli # River_Arno.Rainfall_Consuma 

making some estimated shifts in time, so no longer a general shift.

In [30]:
River_Arno["Rainfall_X"]=River_Arno.Rainfall_Incisa+ River_Arno.Rainfall_Montevarchi+River_Arno.Rainfall_S_Savino#+River_Arno.Rainfall_Laterina 
#.shift(1) River_Arno.Rainfall_Consuma + # medium  upper Arno basin

storage=[6,7,8,9,10] if River_Arno.loc[River_Arno.month.isin(storage) ] : River_Arno["Rain_TOT"]= River_Arno["RainNO_2"] +River_Arno["Rainfall_X"]+River_Arno["Rainfall_Bibbiena"] else: River_Arno["Rain_TOT"]= River_Arno["Rainfall"] +River_Arno["Rainfall_X"]+River_Arno["Rainfall_Bibbiena"]

In [31]:
#df= River_Arno ; 
River_Arno['month']= River_Arno.index.month
River_Arno['year']= River_Arno.index.year
River_Arno['week']= River_Arno.index.isocalendar().week

Rainfall regrouping based on the relative month of storage. I had to make a guess at what point storing water begins and ends. Later we return to this issue...

In [ ]:
#River_Arno.loc[River_Arno['month'].isin(storage), "Rain_TOT"] = River_Arno["RainNO_2"] +River_Arno["Rainfall_X"]+River_Arno["Rainfall_Bibbiena"]+River_Arno.Rainfall_Laterina +River_Arno.Rainfall_Consuma +  River_Arno.Rainfall_Incisa
#River_Arno.loc[ ~River_Arno['month'].isin(storage), "Rain_TOT"] = River_Arno["Rainfall"] +River_Arno["Rainfall_X"]+River_Arno["Rainfall_Bibbiena"]+River_Arno.Rainfall_Laterina +River_Arno.Rainfall_Consuma +  River_Arno.Rainfall_Incisa 
In [32]:
River_Arno.loc[River_Arno['month'].isin(storage), "Rainfall"] = River_Arno["RainNO_2"]  +River_Arno["RainNO_3"] # summer
River_Arno.loc[ ~River_Arno['month'].isin(storage), "Rainfall"] = River_Arno["RainNO_2"]  # need some winter overflow

We need still some volume to count for the winter overflow.

In [33]:
River_Arno["winteroverflow"] = River_Arno["RainNO_3"] -(River_Arno["RainNO_3"].mean()+ 2*River_Arno["RainNO_3"].std());  
fig, ax = plt.subplots(1,1, figsize=(20,4),); River_Arno["2005":"2019"]["winteroverflow"].plot();