*how do we measure the similarity between two time series*?

The Euclidean distance between two time series $Q$ and $C$ of length $n$ is defined as

$$d(Q,C) = \sqrt{\sum^n_{i=1}[Q(i)-C(i)]^2}$$At first glance, it seems like simply calculating the Euclidean distance between two time series would give us a good idea of the similarity between them. After all, the Euclidean distance between identical time series is zero and the Euclidean distance between very different time series is large. However, before we settle on Euclidean distance as a similarity measure we should clearly state our desired criteria for determining the similarity between two time series

In [34]:

```
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
x=np.linspace(0,50,100)
ts1=pd.Series(3.1*np.sin(x/1.5)+3.5)
ts2=pd.Series(2.2*np.sin(x/3.5+2.4)+3.2)
ts3=pd.Series(0.04*x+3.0)
ts1.plot()
ts2.plot()
ts3.plot()
plt.ylim(-2,10)
plt.legend(['ts1','ts2','ts3'])
plt.show()
```

In [35]:

```
def euclid_dist(t1,t2):
return sqrt(sum((t1-t2)**2))
```

Let's now find the Euclidean distance between $ts1$ and $ts2$

In [36]:

```
print euclid_dist(ts1,ts2)
```

and the Euclidean distance between $ts1$ and $ts3$

In [37]:

```
print euclid_dist(ts1,ts3)
```

In [ ]:

```
def DTWDistance(s1, s2):
DTW={}
for i in range(len(s1)):
DTW[(i, -1)] = float('inf')
for i in range(len(s2)):
DTW[(-1, i)] = float('inf')
DTW[(-1, -1)] = 0
for i in range(len(s1)):
for j in range(len(s2)):
dist= (s1[i]-s2[j])**2
DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
return sqrt(DTW[len(s1)-1, len(s2)-1])
```

Now let's compute the Euclidean distance between $ts1$ and $ts2$ using dynamic time warping.

In [227]:

```
print DTWDistance(ts1,ts2)
```

and now the dynamic time warping distance between $ts1$ and $ts3$

In [228]:

```
print DTWDistance(ts1,ts3)
```

In [1]:

```
def DTWDistance(s1, s2,w):
DTW={}
w = max(w, abs(len(s1)-len(s2)))
for i in range(-1,len(s1)):
for j in range(-1,len(s2)):
DTW[(i, j)] = float('inf')
DTW[(-1, -1)] = 0
for i in range(len(s1)):
for j in range(max(0, i-w), min(len(s2), i+w)):
dist= (s1[i]-s2[j])**2
DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
return sqrt(DTW[len(s1)-1, len(s2)-1])
```

Let's test this faster version.

In [196]:

```
print DTWDistance(ts1,ts2,10)
```

In [197]:

```
print DTWDistance(ts1,ts3,10)
```

*LB Keogh* lower bound of dynamic time warping. It is defined as $$LBKeogh(Q,C)=\sum_{i=1}^n (c_i-U_i)^2I(c_i > U_i)+(c_i-L_i)^2I(c_i < L_i)$$
where $U_i$ and $L_i$ are upper and lower bounds for time series $Q$ which are defined as $U_i=max(q_{i-r}:q_{i+r})$ and $L_i=min(q_{i-r}:q_{i+r})$ for a reach $r$ and $I(\cdot)$ is the indicator function. It can be implemented with the following function.

In [4]:

```
def LB_Keogh(s1,s2,r):
LB_sum=0
for ind,i in enumerate(s1):
lower_bound=min(s2[(ind-r if ind-r>=0 else 0):(ind+r)])
upper_bound=max(s2[(ind-r if ind-r>=0 else 0):(ind+r)])
if i>upper_bound:
LB_sum=LB_sum+(i-upper_bound)**2
elif i<lower_bound:
LB_sum=LB_sum+(i-lower_bound)**2
return sqrt(LB_sum)
```

Let's now test on $ts1$ and $ts2$

In [229]:

```
print LB_Keogh(ts1,ts2,20)
```

and now $ts1$ and $ts3$.

In [230]:

```
print LB_Keogh(ts1,ts3,20)
```

*LB Keogh* lower bound method is linear whereas dynamic time warping is quadratic in complexity which make it very advantageous for searching over large sets of time series.

*LB Keogh* lower bound. Computing *LB Keogh* is much less expensive than performing dynamic time warping. And since $LB Keogh(Q,C) \leq DTW(Q,C)$ , we can eliminate time series that cannot possibly be more similar that the current most similar time series. In this way we are eliminating many unnecessary dynamic time warping computations.

In [2]:

```
from sklearn.metrics import classification_report
def knn(train,test,w):
preds=[]
for ind,i in enumerate(test):
min_dist=float('inf')
closest_seq=[]
#print ind
for j in train:
if LB_Keogh(i[:-1],j[:-1],5)<min_dist:
dist=DTWDistance(i[:-1],j[:-1],w)
if dist<min_dist:
min_dist=dist
closest_seq=j
preds.append(closest_seq[-1])
return classification_report(test[:,-1],preds)
```

*LB Keogh* bound and the dynamic time warping locality contraint, it may still take a few minutes to run.

In [6]:

```
train = np.genfromtxt('datasets/train.csv', delimiter='\t')
test = np.genfromtxt('datasets/test.csv', delimiter='\t')
print knn(train,test,4)
```

*apriori* and similar time series are clustered together.

In [7]:

```
import random
def k_means_clust(data,num_clust,num_iter,w=5):
centroids=random.sample(data,num_clust)
counter=0
for n in range(num_iter):
counter+=1
print counter
assignments={}
#assign data points to clusters
for ind,i in enumerate(data):
min_dist=float('inf')
closest_clust=None
for c_ind,j in enumerate(centroids):
if LB_Keogh(i,j,5)<min_dist:
cur_dist=DTWDistance(i,j,w)
if cur_dist<min_dist:
min_dist=cur_dist
closest_clust=c_ind
if closest_clust in assignments:
assignments[closest_clust].append(ind)
else:
assignments[closest_clust]=[]
#recalculate centroids of clusters
for key in assignments:
clust_sum=0
for k in assignments[key]:
clust_sum=clust_sum+data[k]
centroids[key]=[m/len(assignments[key]) for m in clust_sum]
return centroids
```

Let's test it on the entire data set (i.e. the training set and the test set stacked together).

In [8]:

```
train = np.genfromtxt('datasets/train.csv', delimiter='\t')
test = np.genfromtxt('datasets/test.csv', delimiter='\t')
data=np.vstack((train[:,:-1],test[:,:-1]))
import matplotlib.pylab as plt
centroids=k_means_clust(data,4,10,4)
for i in centroids:
plt.plot(i)
plt.show()
```