(c) 2019
Mehmet Süzen
Sylvain Arlot, Alain Celisse pdf
y∈Rn, vector of length n
ti∈Rn where yi=y(ti) and i∈Z+
where tn>tn−1>...>t1
the ordered training dataset reads, (yi,ti)
As a continuation of y is defined as out-of-sample set, w∈Rp, vector of length p, p>n,
where wj=w(tj) and j∈Z+, where j>p
where tm+p>tp+n−1>...>tn+1
the ordered training dataset reads, (wj,tj)
Disjoint k sets of partitions of y are y1,y2,...,yk each set having randomly assigned yi, and partitions are approximately equal in size |y1|≈|y2|≈...≈|yk|.
A training-fold is defined as a union of all partitions except the removed partion, Ym=k⋃l=1,l≠myl,
Due to ordered nature of the series, the standard CV approach would not be used in different folds which yields to an absurd situation of predicting past using the future values. To overcome this, a reconstruction of full training series y is denoted by Rj. Using each training-fold Yj via a secondary model M2. A technique could be interpoloation or more advanced filtering approaches, resulting Rm=Ym+^ym.
The secondary model could retain the given points on the training-fold Yj in this approach. ^ym is the reconstructed portion.
The total error due to reconstructed model M2 expresssed as gr(y,^y), here for example we write down as a mean absolute percent error, different metrics can be used,
gr=1kk∑m=1|(ym−^ym)|/ym.The primary model, M1, is build on each Rm and predictions on out-of-sample set w is applied. This results in set of predictions ^wm, the error is expressed as gp(w,^w)
gp=1kk∑q=1(w−^wm).The total error in rCV is computed as follows. grCV=gr∗gp. The lower the number better the generalisation, however both gr and gp should be judge seperately to detect any anomalies.
Given (yi,ti) ordered pairs as time-series, we aim at inferring, i.e., reconstructing, missing values at time-points tj. The missing values yj can be identified via Bayesian interpretation of kernel regularisation,
yj=KjiL−1yi
where L=Kii+I. The kernel matrices Kji,Kii is build via kernel exp(−Dm/2.0) where Dm is the distance matrix over time-points.
On can generate Ornstein-Uhlenbeck
process drawing numbers from multivariate Gaussian with specific covariance
structure. you(ti)≈N(μ,Σ).
Taking μ as all at 2.0, and Σ build via kernel exp(−Dm/2.0) where Dm is the distance matrix over time-points.
We generated Ornstein-Uhlenbeck (OU)
time-series for a regular 1000 time points with a spacing of Δt=0.1 with different length scales, mean values μ, and additional 250 time points for testing.
Time-series learning curves are not that common due to fact that data sample sizes are limited. However with reconstruction approach we provided above, one can build a learning curve L, based on number of folds
L(k)=gkrCVThe reason behind this, see paper.
%load_ext lab_black
%matplotlib inline
import numpy as np
import sys
import matplotlib
import matplotlib.pyplot as plt
"numpy version:", np.__version__, "matplotlib :", matplotlib.__version__, "Python version:", sys.version
The lab_black extension is already loaded. To reload it, use: %reload_ext lab_black
('numpy version:', '1.17.2', 'matplotlib :', '3.1.1', 'Python version:', '3.7.3 (default, Mar 27 2019, 16:54:48) \n[Clang 4.0.1 (tags/RELEASE_401/final)]')
import numpy as np
def ornstein_uhlenbeck_series_1d(t_vec, seed=424242, mu=np.array([])):
"""
Generate Ornstein-Uhlenbeck process in 1-dimension.
Sigma assumed to be a distance matrix over all t_vec.
t_vec : Ordered values time points
seed : random seed, defaults to 424242
mu : mean vector, defaults to all zeros via []
Returns
y(t) : 1-d Ornstein-Uhlenbeck process on the given time-points
"""
Dm = np.array([np.abs(t - t_vec) for t in t_vec])
sigma = np.exp(-Dm / 2.0)
n = mu.shape[0]
m = t_vec.shape[0]
if n == 0:
mu = np.zeros(m)
np.random.seed(seed)
return np.random.multivariate_normal(mu, sigma, 1)[0]
def ornstein_uhlenbeck_series_1d_oos(ty, nsample, mu_shift=5.0, oosp=0.25, seed=424242):
"""
Generate Ornstein-Uhlenbeck process in 1-dimension and its out-of-sample set.
Sigma assumed to be a distance matrix over time points.
ty : End time point for the training series that fold will be generated. [0,ty].
Out-of-sample portion will be 25% of nsample by default, See oosp.
where t_w > t_y (t_y, tw]
nsample : Number of samples to use between [0, t_y]. See oosp too.
mu_shift : A real number to shift mean value of OU from zeros, defaults to 5.0.
oosp : Defaults to 0.25. Ratio of nsample to be number of
out-of-sample time points.
Returns
OU data with time points (t_y, y, t_w, w)
"""
t_y = np.linspace(0.0, ty, nsample + 1)
mu = np.ones(t_y.shape[0]) + mu_shift
y = ornstein_uhlenbeck_series_1d(t_y, mu=mu, seed=seed)
t_w = np.linspace(ty + ty / nsample, ty + ty * oosp, nsample * oosp + 1)
mu = np.ones(t_w.shape[0]) + 5.0
w = ornstein_uhlenbeck_series_1d(t_w, mu=mu, seed=seed)
return (t_y, y, t_w, w)
def generate_reconstruction_folds(y, t_y, npartition, seed=424242):
"""
Generate folds and missing points
y : y value
t_y : corresponding t values
npartition: Number of folds to generate, k-fold's k
Return
rfolds : List of tuple of (y_m, t_m, t_r)
y_m : fold values
t_m : corresponding time
t_r : missing time points
"""
np.random.seed(seed)
nsample = y.shape[0]
# Each partition upper index: k-fold
inx_part = np.int64(np.round(np.linspace(0.0, 1.0, npartition + 1) * nsample))
# Partition ranges
inx_pairs_part = [(x[0], x[1]) for x in zip(inx_part[:-1], inx_part[1:])]
# Shuffled indices
sample_inx = np.arange(0, nsample)
np.random.shuffle(sample_inx)
rfolds = []
for ps in inx_pairs_part:
missing_inx = sample_inx[ps[0] : ps[1]]
t_r = np.sort(t_y[missing_inx])
mask = np.ones(t_y.shape[0], dtype=bool)
mask[ps[0] : ps[1]] = False
m_inx = sample_inx[mask]
y_m = y[m_inx]
t_m = t_y[m_inx]
ix = np.argsort(t_m)
t_m = t_m[ix]
y_m = y_m[ix]
rfolds.append((y_m, t_m, t_r))
return rfolds
def generate_reconstruction_folds_inx(nsample, npartition, seed=424242):
"""
Generate folds and missing points indices
nsample : Number of points in the time-series
npartitions : Number of folds to consider
seed : random number seed, defaults to 424242
Return
rfolds : List of tuple of (m_inx, r_inx)
m_inx : fold indices
r_inx : missing indices
"""
np.random.seed(seed)
# Each partition upper index: k-fold : Cut [0,1] to npartitions and map to nsample
inx_part = np.int64(np.round(np.linspace(0.0, 1.0, npartition + 1) * nsample))
# Partition ranges
inx_pairs_part = [(x[0], x[1]) for x in zip(inx_part[:-1], inx_part[1:])]
# Shuffled indices
sample_inx = np.arange(0, nsample)
np.random.shuffle(sample_inx)
rfolds = []
for ps in inx_pairs_part:
r_inx = sample_inx[ps[0] : ps[1]]
m_inx = np.setdiff1d(sample_inx, r_inx)
rfolds.append((m_inx, r_inx))
return rfolds
def gp_ou_1d(y_i, t_i, t_j):
"""
Solve for Gaussian process for unknown y_j
Kernel matrices from Ornstein-Uhlenbeck process
with unit penalty term, on identy matrix.
y_i : Ordered values time points
t_i : Time values
t_j : Time values to estimate
Returns
y_j_bar : 1-d Ornstein-Uhlenbeck process estimation on t_j time-points
"""
D_ii = np.array([np.abs(t - t_i) for t in t_i])
K_ii = np.exp(-D_ii / 2.0)
D_ij = np.array([np.abs(t - t_i) for t in t_j])
K_ij = np.exp(-D_ij / 2.0)
L = K_ii + np.identity(K_ii.shape[0])
y_j_bar = np.dot(np.matmul(K_ij, np.linalg.pinv(L)), y_i)
return y_j_bar
def gp_ou_1d_inx(y, t, m_inx, r_inx):
"""
Solve for Gaussian process for unknown y_j
Kernel matrices from Ornstein-Uhlenbeck process
with unit penalty term, on identy matrix.
y : Ordered values time points
t : Time values
m_inx : fold indices
r_inx : missing indices (points to reconstruct)
Returns
y_j_bar : 1-d Ornstein-Uhlenbeck process estimation on t_j time-points
"""
t_i = t[m_inx]
t_j = t[r_inx]
y_i = y[m_inx]
y_j = y[r_inx]
y_j_bar = gp_ou_1d(y_i, t_i, t_j)
return y_j_bar
def gp_ou_1d_reconstruct_folds(y, t, rfolds):
"""
Solve for Gaussian process for given fold indices,
i.e, reconstruction.
Kernel matrices from Ornstein-Uhlenbeck process
with unit penalty term, on identy matrix.
y : values
t : corresponding time points
rfolds : List of tuple of (m_inx, r_inx)
m_inx : fold indices
r_inx : missing indices
Returns
rcon : List of (R_m, g_r)
R_m reconstructed y and correponding MAPE g_r on each
fold
"""
rcon = []
for r in rfolds:
m_inx, r_inx = r
y_i = y[m_inx]
t_i = t[m_inx]
t_j = t[r_inx]
y_j = y[r_inx]
y_j_hat = gp_ou_1d(y_i, t_i, t_j)
g_r = np.sum(np.abs(y_j - y_j_hat) / y_j) / y_j.shape[0]
t_org = np.append(t_i, t_j)
y_hat = np.append(y_i, y_j_hat)
ixx = np.argsort(t_org)
rcon.append((y_hat[ixx], g_r))
return rcon
def gp_ou_1d_folds_predict(rcon, t_y, w, t_w):
"""
Given folds compute predictions.
rcon : List of (R_m, g_r)
R_m reconstructed y and correponding MAPE g_r on each fold
t_y : time points in recunstructed folds
w : future values
t_w : future time-points
Returns
wcon : List of (R_m, g_r)
w_hat predicted w and correponding MAPE g_p on each fold
"""
wcon = []
for rc in rcon:
R_m, _ = rc
w_hat = gp_ou_1d(R_m, t_y, t_w)
g_p = np.sum((w - w_hat) / w) / w.shape[0] # MAPE
wcon.append((w_hat, g_p))
return wcon
import logging
def rCV_ou(ty, nsample, npartition, mu_shift=5.0, oosp=0.25, seed=424242):
"""
Given time values, number of samples, and number of partitions:
Generate Ornstein-Uhlenbeck data, covariance is a distance matrix,
and perform rCV on it.
ty : End time point for the training series that fold will be generated. [0,t_y]/
End time point for the out-of-sample test portion, where t_w > t_y (t_y, tw]
Out-of-sample portion will be 25% of nsample by default, See oosp.
nsample : Number of samples to use between [0, t_y]. See oosp too.
npartitions : k for k-fold.
mu_shift : A real number to shift mean value of OU from zeros, defaults to 5.0.
oosp : Defaults to 0.25. Ratio of nsample to be number of
out-of-sample time points.
seed : Seed to use for randomization, defaults to 424242
Returns:
rCV_data
A dictionary {
'ou_data':ou_data,
'fold_reconstructions':rcon,
'predictions':wcon,
"g_rCV": g_rCV,
}
ou_data : A tuple (t_y, y, t_w, w) OU data for training and
outof sample.
rcon : List of (R_m, g_r)
R_m reconstructed y and correponding MAPE g_r on each fold.
wcon : List of (R_m, g_r)
w_hat predicted w and correponding MAPE g_p on each fold
g_rCV : Reconstructive-CV error
gr : Mean Reconstruction error
gw : Mean Prediction error
"""
t_y, y, t_w, w = ornstein_uhlenbeck_series_1d_oos(ty, nsample, seed=seed)
ou_data = (t_y, y, t_w, w)
rfolds = generate_reconstruction_folds_inx(nsample + 1, npartition, seed=seed)
rcon = gp_ou_1d_reconstruct_folds(y, t_y, rfolds)
wcon = gp_ou_1d_folds_predict(rcon, t_y, w, t_w)
gr = np.mean([r[1] for r in rcon])
gw = np.mean([w[1] for w in wcon])
g_rCV = gr * gw
rCV_data = {
"ou_data": ou_data,
"fold_reconstructions": rcon,
"predictions": wcon,
"g_rCV": g_rCV,
"gr": gr,
"gw": gw,
}
return rCV_data
t_y = np.linspace(0.0, 10, 1001)
mu = np.ones(t_y.shape[0]) + 5.0
y = ornstein_uhlenbeck_series_1d(t_y, mu=mu)
t_w = np.linspace(10.1, 12.5, 251)
mu = np.ones(t_w.shape[0]) + 5.0
w = ornstein_uhlenbeck_series_1d(t_w, mu=mu)
# plt.plot(t_y, y, t_w, w, "-", "--")
ty = 10.0
nsample = 1000
(t_y, y, t_w, w) = ornstein_uhlenbeck_series_1d_oos(ty, nsample)
# plt.plot(t_y, y, t_w, w, "-", "--")
nsample = 1001 # samples
npartition = 10 # fold
t_y = np.linspace(0.0, 10, nsample)
mu = np.ones(t_y.shape[0]) + 2.0
y = ornstein_uhlenbeck_series_1d(t_y, mu=mu)
t_w = np.linspace(10.1, 10.5, 41)
mu = np.ones(t_w.shape[0]) + 2.0
w = ornstein_uhlenbeck_series_1d(t_w, mu=mu)
rfolds = generate_reconstruction_folds_inx(nsample, npartition)
rcon = gp_ou_1d_reconstruct_folds(y, t_y, rfolds)
rcon[0]
(array([3.70006446, 3.838436 , 3.80567831, ..., 1.41902056, 1.6361343 , 1.65034412]), 0.08964514140227492)
wcon = gp_ou_1d_folds_predict(rcon, t_y, w, t_w)
m_inx, r_inx = rfolds[0]
y_i = y[m_inx]
t_i = t_y[m_inx]
t_j = t_y[r_inx]
y_j_hat = gp_ou_1d(y_i, t_i, t_j)
y_j_hat_inx = gp_ou_1d_inx(y, t_y, m_inx, r_inx)
y_org = np.append(y[m_inx], y[r_inx])
y_hat = np.append(y[m_inx], y_j_hat)
t_org = np.append(t_i, t_j)
ixx = np.argsort(t_org)
# plt.plot(t_org[ixx], y_org[ixx], t_org[ixx], y_hat[ixx], "-", "--")
w_hat = gp_ou_1d(y, t_y, t_w)
yw_org = np.append(y, w)
yw_hat = np.append(y, w_hat)
t_yw = np.append(t_y, t_w)
np.sum((w - w_hat) / w) / w.shape[0] # MAPE
0.07653746287282844
ty = 10.0
nsample = 1000
npartition = 10
rCV_data = rCV_ou(ty, nsample, npartition)
rCV_data.keys()
dict_keys(['ou_data', 'fold_reconstructions', 'predictions', 'g_rCV', 'gr', 'gw'])
# g_r erros
[rr[1] for rr in rCV_data["fold_reconstructions"]]
[0.028486701249059415, 0.03203318558792611, 0.03066018781182881, 0.028455409391295778, 0.02670547724709341, 0.030581226711995195, 0.026295584174107694, 0.028457964097179756, 0.028106758965915028, 0.026584472342276965]
t_y, y, t_w, w = rCV_data["ou_data"]
%matplotlib inline
font = {"family": "normal", "weight": "bold", "size": 14}
plt.rc("font", **font)
plt.plot(t_y, y, "-", label="y : Historic for rCV ")
plt.plot(t_w, w, "--", label="w : Out-of-Sample (OOS)")
plt.legend(loc="lower left")
plt.ylim([0, 7.0])
plt.xlabel("Time (arbitrary units)", **font)
plt.ylabel("y, w (arbitrary units)", **font)
plt.title("Ornstein-Uhlenbeck Process ", **font)
plt.savefig("plots/ou_data.eps", format="eps", dpi=1000, bbox_inches="tight")
plt.cla()
plt.clf()
plt.gca()
plt.gcf()
plt.close()
findfont: Font family ['normal'] not found. Falling back to DejaVu Sans. The PostScript backend does not support transparency; partially transparent artists will be rendered opaque. The PostScript backend does not support transparency; partially transparent artists will be rendered opaque. The PostScript backend does not support transparency; partially transparent artists will be rendered opaque. The PostScript backend does not support transparency; partially transparent artists will be rendered opaque.
y_folds = [yy[0] for yy in rCV_data["fold_reconstructions"]]
k = 1
dy_mean = []
for y_f in y_folds:
dy = np.abs(y - y_f)
dy_mean.append(dy.mean())
plt.subplot(10, 1, k)
plt.plot(t_y, dy, "-")
plt.ylabel("1")
plt.axis("off")
k = k + 1
plt.suptitle("Reconstructed series absolute errors", **font)
plt.savefig(
"plots/ou_reconstructed_diff.eps", format="eps", dpi=1000, bbox_inches="tight"
)
plt.cla()
plt.clf()
plt.gca()
plt.gcf()
plt.close()
np.mean(dy_mean)
0.014222398558916067
# Fold construction errors MAPE
[r[1] for r in rCV_data["fold_reconstructions"]]
[0.028486701249059415, 0.03203318558792611, 0.03066018781182881, 0.028455409391295778, 0.02670547724709341, 0.030581226711995195, 0.026295584174107694, 0.028457964097179756, 0.028106758965915028, 0.026584472342276965]
# Prediction errors MAPE
[w[1] for w in rCV_data["predictions"]]
[0.4661134459119515, 0.4701922271506014, 0.4669851635984956, 0.4671724893188641, 0.4693673160492926, 0.47414297989464843, 0.46736160215005484, 0.4678060255919754, 0.4674375313291001, 0.4672283786360845]
print("Mean construction error")
mr = np.mean([r[1] for r in rCV_data["fold_reconstructions"]])
print(mr)
print("Mean prediction error")
mp = np.mean([w[1] for w in rCV_data["predictions"]])
print(mp)
print("g_rCV")
rCV_data["g_rCV"]
Mean construction error 0.02863669675786782 Mean prediction error 0.4683807159631068 g_rCV
0.01341287653026851
np.mean([r[1] for r in rcon]) * np.mean([w[1] for w in wcon])
0.0075447474228444375
We generate the learning curve defined above and report based on different perfromances.
# Run different npartitions to get performance over
ty = 10.0
nsample = 1000
npartitions = np.linspace(2, 30, 29, dtype=np.int64) # k-folds
gr_ = []
gw_ = []
grw_ = []
for k in npartitions:
rCV_data = rCV_ou(ty, nsample, k)
grw_.append(rCV_data["g_rCV"])
gr_.append(rCV_data["gr"])
gw_.append(rCV_data["gw"])
plt.rc("font", **font)
plt.plot(npartitions, np.array(gr_) * 100, linestyle="--", marker="o")
plt.title("Learning curve with reconstruction error")
plt.ylabel("Reconstruction MAPE (%)")
plt.xlabel("Number of partitions in reconstruction")
plt.savefig("plots/learning_curve_gr.eps", format="eps", dpi=1000, bbox_inches="tight")
plt.cla()
plt.clf()
plt.gca()
plt.gcf()
plt.close()
findfont: Font family ['normal'] not found. Falling back to DejaVu Sans. findfont: Font family ['normal'] not found. Falling back to DejaVu Sans.
plt.rc("font", **font)
plt.plot(npartitions, np.array(gw_) * 100, linestyle="--", marker="o")
plt.title("Learning curve with prediction error")
plt.ylabel("Prediction MAPE (%)")
plt.xlabel("Number of partitions in reconstruction")
plt.savefig("plots/learning_curve_gw.eps", format="eps", dpi=1000, bbox_inches="tight")
plt.cla()
plt.clf()
plt.gca()
plt.gcf()
plt.close()
plt.rc("font", **font)
plt.plot(npartitions, np.array(grw_), linestyle="--", marker="o")
plt.title("Learning curve with rCV error")
plt.ylabel("rCV error")
plt.xlabel("Number of partitions in reconstruction")
plt.savefig("plots/learning_curve_grw.eps", format="eps", dpi=1000, bbox_inches="tight")
plt.cla()
plt.clf()
plt.gca()
plt.gcf()
plt.close()