Developed by: Georgii Bocharov
E-Mail: bocharovgeorgii@gmail.com
pyextremes: https://github.com/georgebv/pyextremes
This notebook demonstrates multiple common methods used to assist when selecting optimal threshold value for the peak-over-threshold extreme value extraction method.
First, we need to setup the environment. Let's import the pyextremes
library:
import pyextremes
print("pyextremes", pyextremes.__version__)
pyextremes 2.0.1
Now, let's import other packages required for this notebook:
import numpy as np
import pandas as pd
Data used in this tutorial was obtained from the New York City the Battery tidal station. Data for this station was extracted using the coastlib library which provides an interface to the NOAA CO-OPS API. Let's load it and pre-process (discussed in more detail in this tutorial).
data = (
pd
.read_csv("../../data/battery_wl.csv", index_col=0, parse_dates=True, squeeze=True)
.sort_index(ascending=True)
.astype(float)
.dropna()
.loc[pd.to_datetime("1925"):]
)
data = data - (data.index.array - pd.to_datetime("1992")) / pd.to_timedelta("365.2425D") * 2.87e-3
data.head()
Date-Time (GMT) 1926-11-20 05:00:00 -0.411120 1926-11-20 06:00:00 -0.777120 1926-11-20 07:00:00 -1.051120 1926-11-20 08:00:00 -1.051121 1926-11-20 09:00:00 -0.808121 Name: Water Elevation [m NAVD88], dtype: float64
The mean residual life plot should be approximately linear above a threshold for which the Generalized Pareto Distribution model is valid. The strategy is to select the smallest (largest for extremes_type="low"
) threshold value immediately above (below for extremes_type="low"
) which the plot is approximately linear.
help(pyextremes.plot_mean_residual_life)
Help on function plot_mean_residual_life in module pyextremes.tuning.threshold_selection: plot_mean_residual_life(ts: pandas.core.series.Series, thresholds=None, extremes_type: str = 'high', alpha: float = 0.95, figsize: tuple = (8, 5)) -> tuple Plot mean residual life for given threshold values. The mean residual life plot should be approximately linear above a threshold for which the Generalized Pareto Distribution model is valid. The strategy is to select the smallest (largest for extremes_type='low') threshold value immediately above (below for extremes_type='low') which the plot is approximately linear. Parameters ---------- ts : pandas.Series Time series of the signal. thresholds : array-like, optional An array of thresholds for which the mean residual life plot is plotted. If None (default), plots mean residual life for 100 equally-spaced thresholds between 90th (10th if extremes_type='low') percentile and 10th largest (smallest if extremes_type='low') value in the series. extremes_type : str, optional high (default) - extreme high values low - extreme low values alpha : float, optional Confidence interval width in the range (0, 1), by default it is 0.95. If None, then confidence interval is not shown. figsize : tuple, optional Figure size in inches in format (width, height). By default it is (8, 5). Returns ------- figure : matplotlib.figure.Figure Figure object. axes : matplotlib.axes._axes.Axes Axes object.
pyextremes.plot_mean_residual_life(ts=data)
(<Figure size 768x480 with 1 Axes>, <AxesSubplot:xlabel='Threshold', ylabel='Mean excess'>)
Mean residual life plots are notoriously hard to interpret. Threshold selection should never be based soleley on this plot and considerable amount of experience is required to develop intuition required to read such plots. The plot above indicates that the region of thresholds where the linearity condition is approximately met appears to lie in the 1.2 to 1.6 range.
The parameter stability plot shows shape and modified scale parameters of the Generalized Pareto Distribution (GPD). Both shape and modified scale parameters should be approximately constant above a threshold for which the GPD model is valid. The strategy is to select the smallest (largest for extremes_type="low"
) threshold value immediately above (below for extremes_type="low"
) which the GPD parameters are approximately constant.
help(pyextremes.plot_parameter_stability)
Help on function plot_parameter_stability in module pyextremes.tuning.threshold_selection: plot_parameter_stability(ts: pandas.core.series.Series, thresholds=None, r: Union[str, pandas._libs.tslibs.timedeltas.Timedelta] = '24H', extremes_type: str = 'high', alpha: Union[float, NoneType] = None, n_samples: int = 100, figsize: tuple = (8, 5)) -> tuple Plot parameter stability plot for given threshold values. The parameter stability plot shows shape and modified scale parameters of the Generalized Pareto Distribution (GPD). Both shape and modified scale parameters should be approximately constant above a threshold for which the GPD model is valid. The strategy is to select the smallest (largest for extremes_type='low') threshold value immediately above (below for extremes_type='low') which the GPD parameters are approximately constant. Parameters ---------- ts : pandas.Series Time series of the signal. thresholds : array-like, optional An array of thresholds for which the mean residual life plot is plotted. If None (default), plots mean residual life for 100 equally-spaced thresholds between 90th (10th if extremes_type='low') percentile and 10th largest (smallest if extremes_type='low') value in the series. r : str or pandas.Timedelta, optional Duration of window used to decluster the exceedances. By default r='24H' (24 hours). extremes_type : str, optional high (default) - extreme high values low - extreme low values alpha : float, optional Confidence interval width in the range (0, 1). If None (default), then confidence interval is not shown. n_samples : int, optional Number of bootstrap samples used to estimate confidence interval bounds (default=100). Ignored if `alpha` is None. figsize : tuple, optional Figure size in inches in format (width, height). By default it is (8, 5). Returns ------- figure : matplotlib.figure.Figure Figure object. axes : matplotlib.axes._axes.Axes Axes object.
pyextremes.plot_parameter_stability(ts=data)
(<Figure size 768x480 with 2 Axes>, (<AxesSubplot:ylabel='Shape, $\\xi$'>, <AxesSubplot:xlabel='Threshold', ylabel='Modified scale, $\\sigma^*$'>))
The parameter stability plot shown above indicates the distribution gets unstable after threshold value of 1.6, which indicates that threshold higher than 1.6 do not provide sufficient number of exceedances. The region of stability appears to be confined in the 1.2 to 1.6 range of threshold values.
The return value stability plot shows return values for given return period for given thresholds. The purpose of this plot is to investigate statibility and sensitivity of the Generalized Pareto Distribution model to threshold value. Threshold value selection should still be guided by the mean residual life plot and the parameter stability plot. This plot should be used as additional check.
See https://docs.scipy.org/doc/scipy/reference/stats.html for a full list of supported distributions.
help(pyextremes.plot_return_value_stability)
Help on function plot_return_value_stability in module pyextremes.tuning.threshold_selection: plot_return_value_stability(ts: pandas.core.series.Series, return_period, return_period_size: Union[str, pandas._libs.tslibs.timedeltas.Timedelta] = '1Y', thresholds=None, r: Union[str, pandas._libs.tslibs.timedeltas.Timedelta] = '24H', extremes_type: str = 'high', distributions: List[Union[str, scipy.stats._distn_infrastructure.rv_continuous]] = ['genpareto', 'expon'], alpha: Union[float, NoneType] = None, n_samples: int = 100, figsize: tuple = (8, 5)) -> tuple Plot return value stability plot for given threshold values. The return value stability plot shows return values for given return period for given thresholds. The purpose of this plot is to investigate statibility and sensitivity of the Generalized Pareto Distribution model to threshold value. Threshold value selection should still be guided by the mean residual life plot and the parameter stability plot. This plot should be used as additional check. Parameters ---------- ts : pandas.Series Time series of the signal. return_period : number Return period. Given as a multiple of `return_period_size`. return_period_size : str or pandas.Timedelta, optional Size of return period (default='1Y'). If set to '30D', then a return period of 12 would be roughly equivalent to a 1 year return period (360 days). thresholds : array-like, optional An array of thresholds for which the mean residual life plot is plotted. If None (default), plots mean residual life for 100 equally-spaced thresholds between 90th (10th if extremes_type='low') percentile and 10th largest (smallest if extremes_type='low') value in the series. r : str or pandas.Timedelta, optional Duration of window used to decluster the exceedances. By default r='24H' (24 hours). extremes_type : str, optional high (default) - extreme high values low - extreme low values distributions : list, optional List of distributions for which the return value curves are plotted. By default these are "genpareto" and "expon". A distribution must be either a name of distribution from with scipy.stats or a subclass of scipy.stats.rv_continuous. See https://docs.scipy.org/doc/scipy/reference/stats.html alpha : float, optional Confidence interval width in the range (0, 1). If None (default), then confidence interval is not shown. n_samples : int, optional Number of bootstrap samples used to estimate confidence interval bounds (default=100). Ignored if `alpha` is None. figsize : tuple, optional Figure size in inches in format (width, height). By default it is (8, 5). Returns ------- figure : matplotlib.figure.Figure Figure object. axes : matplotlib.axes._axes.Axes Axes object.
pyextremes.plot_return_value_stability(
ts=data,
return_period=100,
return_period_size="365.2425D",
thresholds=np.linspace(1.2, 1.6, 50),
r="24H",
extremes_type="high",
distributions=["genpareto", "expon"],
alpha=0.95, # Set to None to get faster results
n_samples=100,
)
(<Figure size 768x480 with 1 Axes>, <AxesSubplot:xlabel='Threshold', ylabel='Return value'>)
The plot above indicates that extreme values above the threshold value of 1.45 appear to be constant for both distributions.
CAUTION: while it may be tempting to select a threshold value which gives return value desired for a particular analysis outcome, it is not a sound analysis procedure and some may call it (for a good reason) "cooking the books".