import numpy as np
import pandas as pd
import pylab as pl
%pylab inline --no-import-all
Populating the interactive namespace from numpy and matplotlib
According to most of the recent works this type of time-series normalization is the best known transformation of the raw time-series which preserves original time-series features. Nevertheless, the article by Lin et al. explains the reasons and solutions for some cases when the zero mean and unit standard deviation normalization fails. For example if a signal is constant over most of the time span with minor noise at short intervals, this normalization will overamplify the noise to the maximal amplitude. This will also occur if the time-series contains only single value and the standard deviation is not defined
def znormalization(ts):
"""
ts - each column of ts is a time series (np.ndarray)
"""
mus = ts.mean(axis = 0)
stds = ts.std(axis = 0)
return (ts - mus) / stds
ts1 = np.asarray([2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34])
ts2 = np.asarray([0.50, 1.29, 2.58, 3.83, 3.25, 4.25, 3.83, 5.63, 6.44, 6.25, 8.75, 8.83, 3.25, 0.75, 0.72])
ts = pd.DataFrame({"ts1": ts1, "ts2": ts2})
ts.plot(style = "-+")
zts = znormalization(ts)
zts.plot(style = "-+")
plt.hlines(0, 0, 14, colors = 'r', linestyles='--')
<matplotlib.collections.LineCollection at 0x4977150>
Euclidean distances between time sereis after PAA is lower bounded to their original Euclidean distances.
def paa_transform(ts, n_pieces):
"""
ts: the columns of which are time series represented by e.g. np.array
n_pieces: M equally sized piecies into which the original ts is splitted
"""
splitted = np.array_split(ts, n_pieces) ## along columns as we want
return np.asarray(map(lambda xs: xs.mean(axis = 0), splitted))
split9 = paa_transform(zts, 5)
split9_ext = np.repeat(split9, 3, axis = 0)
for i in [0, 1]:
pl.figure()
pl.plot(zts.iloc[:, i], '-+', label = "ts%i"%i)
pl.plot(split9_ext[:, i], label = "paa%i"%i)
pl.legend(loc = "upper left")
Symbolic Aggregate approXimation (SAX) transforms original time-series data into symbolic strings.
Symbolic Aggregate approXimation was proposed by Lin et al. and extends PAA-based approach inheriting algorithm simplicity and low computational complexity while providing satisfiable sensitivity and selectivity in range-query processing. Moreover, the use of a symbolic representation opens the door to the existing wealth of data-structures and string-manipulation algorithms in computer science such as hashing, regular expression pattern matching, suffix trees etc.
SAX transforms a time-series X of length n into the string of arbitrary length , where $\omega << n$ typically, using an alphabet A of size a > 2. The SAX algorithm consist of two steps: during the first step it transforms the original time-series into a PAA representation and this intermediate representation gets converted into a string during the second step. Use of PAA at the first step brings the advantage of a simple and efficient dimensionality reduction while providing the important lower bounding property as shown in the previous section. The second step, actual conversion of PAA coefficients into letters, is also computationally efficient and the contractive property of symbolic distance was proven by Lin et al.
Discretization of the PAA representation of a time-series into SAX is implemented in a way which produces symbols corresponding to the time-series features with equal probability. The extensive and rigorous analysis of various time-series datasets available to the original authors has shown that time-series that are normalized by the zero mean and unit of energy follow the Normal distribution law. By using Gaussian distribution properties, it's easy to pick a equal-sized areas under the Normal curve using lookup tables for the cut lines coordinates, slicing the under-the-Gaussian-curve area.
def sax_transform(ts, n_pieces, alphabet):
"""
ts: columns of which are time serieses represented by np.array
n_pieces: number of segments in paa transformation
alphabet: the letters to be translated to, e.g. "abcd", "ab"
return np.array of ts's sax transformation
Steps:
1. znormalize
2. ppa
3. find norm distribution breakpoints by scipy.stats
4. convert ppa transformation into strings
"""
from scipy.stats import norm
alphabet_sz = len(alphabet)
thrholds = norm.ppf(np.linspace(1./alphabet_sz,
1-1./alphabet_sz,
alphabet_sz-1))
def translate(ts_values):
return np.asarray([(alphabet[0] if ts_value < thrholds[0]
else (alphabet[-1] if ts_value > thrholds[-1]
else alphabet[np.where(thrholds <= ts_value)[0][-1]+1]))
for ts_value in ts_values])
paa_ts = paa_transform(znormalization(ts), n_pieces)
return np.apply_along_axis(translate, 0, paa_ts)
sax_transform(ts, 9, "abcd")
array([['a', 'a'], ['c', 'b'], ['d', 'b'], ['d', 'c'], ['c', 'd'], ['b', 'd'], ['a', 'b'], ['a', 'a'], ['a', 'a']], dtype='|S1')