So Martin and I (OK, mainly me. I'm "special".) thought it would be a wonderful idea to go through some concepts in time-series analysis.

1: Determining the correlation strength between two 1D timeseries using cross-correlation.

2: Assessing the degree to which two 2D timeseries are related using EOF (or SVD) decomposition.

Applying these ideas we hope to illustrate how to utilise these concepts in an manner designed to illustrate the strength of a correlation and investigate possible mechanistic causes of similar patterns. These methods are ideal for model validation, and thus we hope to touch on both 1D and 2D examples since these need to be treated quite differently. In a nutshell, what we hope you can take away from this little introduction is that accessible methods exist which can boost the impact of your work.

We've chosen to use python to present these ideas, but they are transferrable to other languages such as octave, R, matlab or mathematica.

In [4]:

```
import numpy as np
import matplotlib.pyplot as pyp
import pandas as pnd
from IPython.display import Image
import scipy.sparse.linalg as sp
import h5py
from IPython.display import Math
```

In [290]:

```
a = np.sin(np.arange(1,100,0.1)/np.pi*3.5)+np.sin(np.arange(1,100,0.1)/2.5)*np.random.rand(990)*1.7
b = np.sin(np.arange(1,100,0.1)/np.pi*3.5)+np.sin(np.arange(1,100,0.1)/7)*np.random.rand(990)*1.5
```

In [291]:

```
plot(a,'r',label='a')
plot(b,'k',label='b')
legend()
title('Two timeseries')
xlabel('Time')
```

Out[291]:

Now that we have some data, lets look at the correlation between the two:

In [292]:

```
a = pnd.Series(a) #Here I turn the timeseries into a "Series" to use the pandas library
b = pnd.Series(b)
correlation = a.corr(b)
print "The correlation between a and b is: ", correlation
```

However, the value of the correlation does not tell us anything about how significant the link/correlation between the two timeseries is. Thus, reporting only the value of the correlation is reasonably meaningless, without also quoting the level of significance of the correlation. This is because the points in the dataset could both be coupled tightly to a third process, which swamps the independent variability the two timeseries actually have in common. An allegory here is a solar cycle, which is easily imagined for the sinusoidal nature of the timeseries a and b. Here the "sun" drowns out the independent variability in timeseries a and timeseries b, so it is difficult to say if it is timeseries a and b correlating significantly.

To assess this independence, we determine the "degrees of freedom". This is a measure of the number of independent datapoints, or points that are free to vary. We assess these by looking at the cross correlation (xcorr). This is a way of telling how similar the two timeseries are as a function of the the time-lag between them:

In [293]:

```
k = pyp.xcorr(a, b, maxlags = 200)
title('xcorr of timeseries a and b')
xlabel('Lag')
ylabel('Normalised correlation strength')
```

Out[293]:

In [294]:

```
ind = np.where(np.abs(k[:][0][np.where(np.abs(k[:][1]) < 0.03)]) == np.abs(k[:][0][np.where(np.abs(k[:][1]) < 0.03)]).min())
df = k[:][0][np.where(np.abs(k[:][1]) < 0.03)][ind]
print "The degrees of freedom are: ", df
```

In [123]:

```
Image(filename='/noc/users/mjsp106/SignifOfCorrelations.png')
```

Out[123]:

A lot of the time we don't have just one timeseries. Specifically, we can suspect that patterns in out simulations and in real world data are "similar", but quantifying the similarity is complicated by:

1: The spatial patterns not overlapping perfectly

2: Modes of opperation of the natural system outside the research question

3: Determining the degrees of freedom; How many truly independent datapoints exsist?

To combat situations where you have a lot of data, with a huge amount going on in it, the idea is that we try to reduce the "complexity". We want to look at only certain modes of variability.

This type of problem is so common among such a diverse range of fields, that you find this type of analysis under other names as well, such as: EOFs, principal components, proper orthogonal decomposition, singular vectors, Karhunen-LoÃ¨ve functions, optimals, etc. These are all basically the same thing, but we'll go with EOFs. Here we'll go with the most general approach of a Singular Value Decomposition, to get at the EOFs.

EOF/SVD analysis is not limited to timeseries analysis, but rather is a hugely versatile tool. Here we present a simple example of it's usage but it is worth looking into further!

In [5]:

```
f = h5py.File('/noc/msm/scratch/valor2/maike/beringDump/concatenateData/N1/ORCA1_S_T_SSH_SST_SSS.nc')
SSH=f['sossheig']
ssh = squeeze(SSH[:])
SST=f['sosstsst']
sst = squeeze(SST[:])
#Some land data, just por plotting. Oceanographers tend to politely ignore the land.
land = h5py.File("/noc/msm/scratch/valor2/maike/beringDump/land_N1_global.h5", 'r')
land1=land['land']
```

In [58]:

```
cclim = np.array([-2,2])
imshow(flipud(ssh[0,...]), cmap=cm.coolwarm)
clim(cclim[:])
colorbar()
imshow(flipud(land1[:]))
title('Sea Surface Height (m) January 1978')
```

Out[58]:

Suppose we can write our sea surface height data matrix as:

In [310]:

```
Math(r'\mathrm{M} = \mathrm{U} \Sigma \mathrm{V}^{*}')
```

Out[310]:

This factorization is called a singular value decomposition, and has many nifty properties! We'll explore some now...

First off, we need to make life easier for ourselves, and collapse the latitude and longitude information into a single vector. This is just a mathematical trick, and does not cost us the spatial information. We then determine the U, Î£ and V matrixes using the sparse methods in the ARPACK eigensolver (See "Further Notes" for a more detailed explanation and validation of the sparse approach). The matrix U, Î£ and V are:

U: An (m x m) unitary matrix, where the m columns are the left-singular vectors of M.

Î£: An (m Ã— n) rectangular diagonal matrix with non-negative real numbers on the diagonal. The diagonal entries of Î£ are the singular values of M

V: An (n x n) unitary matrix, where the n columns are the right-singular vectors of M.

In [6]:

```
u_ssh,s_ssh,v_ssh=sp.svds(np.transpose(ssh.reshape((360, 292*362))), k=300)
```

In [61]:

```
eof1issh = u_ssh[:, -1]
eof1ssh=eof1issh.reshape(292, 362)
eof2issh = u_ssh[:, -2]
eof2ssh=eof2issh.reshape(292, 362)
eof3issh = u_ssh[:, -3]
eof3ssh=eof3issh.reshape(292, 362)
```

In [62]:

```
cclim = np.array([-0.01,0.01])
imshow(flipud(eof1ssh), cmap=cm.coolwarm)
clim(cclim[:])
colorbar()
imshow(flipud(land1[:]))
title('Sea Surface Height EOF1')
figure()
imshow(flipud(eof2ssh), cmap=cm.coolwarm)
clim(cclim[:])
colorbar()
imshow(flipud(land1[:]))
title('Sea Surface Height EOF2')
figure()
imshow(flipud(eof3ssh), cmap=cm.coolwarm)
clim(cclim[:])
colorbar()
imshow(flipud(land1[:]))
title('Sea Surface Height EOF3')
```

Out[62]:

We can now look at what percentage of the variability explained by the different EOFs:

In [7]:

```
s2_ssh = s_ssh[:-1]**2
percent_ssh = (100*s2_ssh)/sum(s2_ssh)
sumvar_ssh = sum(percent_ssh[:])
```

In [8]:

```
plot(flipud(percent_ssh[259:]))
plot(flipud(percent_ssh[259:]), 'k*')
title('The first 40 EOFs SSH')
xlabel('EOF number')
ylabel('Percentage of variance (%)')
print sumvar_ssh
```

In [69]:

```
cclim = np.array([-1,31])
imshow(flipud(sst[0,...]), cmap=cm.coolwarm)
clim(cclim[:])
colorbar()
imshow(flipud(land1[:]))
title('Sea Surface Temperature (m) January 1978')
```

Out[69]:

We repeat the analysis from the SSH again with the SST:

In [9]:

```
u_sst,s_sst,v_sst=sp.svds(np.transpose(sst.reshape((360, 292*362))), k=300)
```

In [72]:

```
eof1isst = u_sst[:, -1]
eof1sst=eof1isst.reshape(292, 362)
eof2isst = u_sst[:, -2]
eof2sst=eof2isst.reshape(292, 362)
eof3isst = u_sst[:, -3]
eof3sst=eof3isst.reshape(292, 362)
```

In [73]:

```
cclim = np.array([-0.01,0.01])
imshow(flipud(eof1sst), cmap=cm.coolwarm)
clim(cclim[:])
colorbar()
imshow(flipud(land1[:]))
title('Sea Surface Temperature EOF1')
figure()
imshow(flipud(eof2sst), cmap=cm.coolwarm)
clim(cclim[:])
colorbar()
imshow(flipud(land1[:]))
title('Sea Surface Temperature EOF2')
figure()
imshow(flipud(eof3sst), cmap=cm.coolwarm)
clim(cclim[:])
colorbar()
imshow(flipud(land1[:]))
title('Sea Surface Temperature EOF3')
```

Out[73]:

In [10]:

```
s2_sst = s_sst[:-1]**2
percent_sst = (100*s2_sst)/sum(s2_sst)
sumvar_sst = sum(percent_sst[:])
```

In [11]:

```
plot(flipud(percent_sst[259:]))
plot(flipud(percent_sst[259:]), 'k*')
title('The first 40 EOFs SST')
xlabel('EOF number')
ylabel('Percentage of variance (%)')
print sumvar_sst
```

In [12]:

```
plot(flipud(percent_sst[259:]),label='SST')
plot(flipud(percent_sst[259:]), 'k*')
plot(flipud(percent_ssh[259:]),label='SSH')
plot(flipud(percent_ssh[259:]), 'k*')
legend()
title('The first 40 EOFs SST and SSH')
xlabel('EOF number')
ylabel('Percentage of variance (%)')
```

Out[12]:

In [ ]:

```
```

Due to the numerics involved, the EOF calculations with the singular value decomposition can be hugely memory intensive, and thus take a vary long time to compute. This can be hugely problematic if larger datasets need to be handled.

To illustrate, the "gesdd" function the LAPACK library implements uses a workspace O(N^2), assuming M=N for simplicity. However, in this practical we have employed the sparse methods from the ARPACK library. This may seem curious, since we have illustrated that our ssh timeseries is most definitely not sparse. The covariance matrix does looks quite different however, and below we illustrate the reconstruction of our ssh field and it's fit to the original. This is motivated in part by the workspace of ARPACK scaling O(ncv), where ncv are the number of basisvectors used for the Eigenvalue Algorithm. The image below illustrates the substantial difference in memory usage here.

(Thanks to Ian Bush, Edward Smyth (NAG) and Peter Challenor (NOCS). Image from: http://fa.bianp.net/blog/2012/singular-value-decomposition-in-scipy/)

In [395]:

```
Image(filename='/noc/users/mjsp106/svd_memory.png')
```

Out[395]: