In [7]:

import numpy as np
import pandas as pd
import pylab as pl

%pylab inline --no-import-all

Populating the interactive namespace from numpy and matplotlib

Resources¶

ZNormalization - normalization to zero mean and unit of energy¶

According to most of the recent works this type of time-series normalization is the best known transformation of the raw time-series which preserves original time-series features. Nevertheless, the article by Lin et al. explains the reasons and solutions for some cases when the zero mean and unit standard deviation normalization fails. For example if a signal is constant over most of the time span with minor noise at short intervals, this normalization will overamplify the noise to the maximal amplitude. This will also occur if the time-series contains only single value and the standard deviation is not defined

In [2]:

def znormalization(ts):
    """
    ts - each column of ts is a time series (np.ndarray)
    """
    mus = ts.mean(axis = 0)
    stds = ts.std(axis = 0)
    return (ts - mus) / stds

In [17]:

ts1 = np.asarray([2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34])
ts2 = np.asarray([0.50, 1.29, 2.58, 3.83, 3.25, 4.25, 3.83, 5.63, 6.44, 6.25, 8.75, 8.83, 3.25, 0.75, 0.72])
ts = pd.DataFrame({"ts1": ts1, "ts2": ts2})
ts.plot(style = "-+")

zts = znormalization(ts)
zts.plot(style = "-+")
plt.hlines(0, 0, 14, colors = 'r', linestyles='--')

Out[17]:

<matplotlib.collections.LineCollection at 0x4977150>

PAA (Piecewise Aggregate Approximation) Description¶

divide the original time series (usually after normalization) into $M$ equally sized piececes.
compute the mean for each piece - the sequence assembled from the mean values is the PAA transform of the original time series.

Euclidean distances between time sereis after PAA is lower bounded to their original Euclidean distances.

In [21]:

def paa_transform(ts, n_pieces):
    """
    ts: the columns of which are time series represented by e.g. np.array
    n_pieces: M equally sized piecies into which the original ts is splitted
    """
    splitted = np.array_split(ts, n_pieces) ## along columns as we want
    return np.asarray(map(lambda xs: xs.mean(axis = 0), splitted))

In [41]:

split9 = paa_transform(zts, 5)
split9_ext = np.repeat(split9, 3, axis = 0)
for i in [0, 1]:
    pl.figure()
    pl.plot(zts.iloc[:, i], '-+', label = "ts%i"%i)
    pl.plot(split9_ext[:, i], label = "paa%i"%i)
    pl.legend(loc = "upper left")

SAX (Symbolic Aggregate approXimation)¶

Symbolic Aggregate approXimation (SAX) transforms original time-series data into symbolic strings.

Symbolic Aggregate approXimation was proposed by Lin et al. and extends PAA-based approach inheriting algorithm simplicity and low computational complexity while providing satisfiable sensitivity and selectivity in range-query processing. Moreover, the use of a symbolic representation opens the door to the existing wealth of data-structures and string-manipulation algorithms in computer science such as hashing, regular expression pattern matching, suffix trees etc.

SAX transforms a time-series X of length n into the string of arbitrary length , where $\omega << n$ typically, using an alphabet A of size a > 2. The SAX algorithm consist of two steps: during the first step it transforms the original time-series into a PAA representation and this intermediate representation gets converted into a string during the second step. Use of PAA at the first step brings the advantage of a simple and efficient dimensionality reduction while providing the important lower bounding property as shown in the previous section. The second step, actual conversion of PAA coefficients into letters, is also computationally efficient and the contractive property of symbolic distance was proven by Lin et al.

Discretization of the PAA representation of a time-series into SAX is implemented in a way which produces symbols corresponding to the time-series features with equal probability. The extensive and rigorous analysis of various time-series datasets available to the original authors has shown that time-series that are normalized by the zero mean and unit of energy follow the Normal distribution law. By using Gaussian distribution properties, it's easy to pick a equal-sized areas under the Normal curve using lookup tables for the cut lines coordinates, slicing the under-the-Gaussian-curve area.

In [78]:

def sax_transform(ts, n_pieces, alphabet):
    """
    ts: columns of which are time serieses represented by np.array
    n_pieces: number of segments in paa transformation
    alphabet: the letters to be translated to, e.g. "abcd", "ab"
    return np.array of ts's sax transformation
    Steps:
    1. znormalize
    2. ppa
    3. find norm distribution breakpoints by scipy.stats
    4. convert ppa transformation into strings
    """
    from scipy.stats import norm
    alphabet_sz = len(alphabet)
    thrholds = norm.ppf(np.linspace(1./alphabet_sz, 
                                    1-1./alphabet_sz, 
                                    alphabet_sz-1))
    def translate(ts_values):
        return np.asarray([(alphabet[0] if ts_value < thrholds[0]
                else (alphabet[-1] if ts_value > thrholds[-1]
                      else alphabet[np.where(thrholds <= ts_value)[0][-1]+1]))
                           for ts_value in ts_values])
    paa_ts = paa_transform(znormalization(ts), n_pieces)
    return np.apply_along_axis(translate, 0, paa_ts)

In [80]:

sax_transform(ts, 9, "abcd")

Out[80]:

array([['a', 'a'],
       ['c', 'b'],
       ['d', 'b'],
       ['d', 'c'],
       ['c', 'd'],
       ['b', 'd'],
       ['a', 'b'],
       ['a', 'a'],
       ['a', 'a']], 
      dtype='|S1')

In [ ]: