Prediction of future temperature increase

Time series data, like

  • the annual average temperature over a range of years
  • the daily stock exchange rate,

can be predicted by any supervised machine learning algorithm for regression. In this example the increase of the annual average temperature is predicted using a Support Vector Machine for Regression (SVR).

In [1]:
import csv
import numpy as np
from matplotlib import pyplot as plt
from sklearn import svm
np.set_printoptions(precision=3,suppress=True)

Read time series data from file

The temperature data file contains the average annual temperature of all German counties from year 1811 to 2011. Each column in the data file, except the first, refers to a single county, each row refers to a year. The first column contains the years. The first row (header row) contains the county names.

In [2]:
# read data from file########################################################
yearlist=[]
count=0
columnlist=[1]
columnlist.extend(range(3,20))
with open('./Res/temperaturEntwicklungDeutschland.csv', 'r') as csvfile:
    reader=csv.reader(csvfile, delimiter=';')
    for row in reader:
        if count==0:
            header=row
        else:
            yearlist.extend(row) 
        count+=1
yearArray=np.array(yearlist).reshape((-1,18)).astype(float)
print header
print yearArray
###################################################################################################################      
['', 'Hamburg', 'Bremen', 'Berlin', 'Schleswig-Holstein', 'Niedersachsen', 'Nordrhein-Westfalen', 'Rheinland-Pfalz', 'Saarland', 'Baden-Wuerttemberg', 'Hessen', 'Bayern', 'Mecklenburg-Vorpommern', 'Brandenburg', 'Sachen-Anhalt', 'Sachsen', 'Thueringen', 'Deutschland']
[[ 1881.        7.342     7.699 ...,     6.754     6.675     7.335]
 [ 1882.        8.9       9.167 ...,     8.162     7.778     8.366]
 [ 1883.        8.388     8.579 ...,     7.5       7.312     7.91 ]
 ..., 
 [ 2009.        9.712     9.969 ...,     8.911     8.581     9.185]
 [ 2010.        8.127     8.304 ...,     7.455     7.192     7.854]
 [ 2011.        9.9      10.1   ...,     9.4       9.1       9.6  ]]

Next, the county for which the temperature shall be analysed is defined, and the column index of this county is determined. The average annual temperature of the selected county is plotted in the figure below.

In [3]:
county="Baden-Wuerttemberg"   # select the county for which the annual temperature shall be analyzed
idx= header.index(county)
plt.figure(figsize=(12, 10))
plt.plot(yearArray[:,0],yearArray[:,idx],label="annual Temp.")
plt.title("Annual temperature in %s"%county)
plt.hold(True)

Smooth time series data

Since the annual variation of the temperatur is quite large, the entire time series data is first smoothed, using a Gaussian filter of configurable length and variance. The filter operation is implemented in the following function.

In [4]:
def smoothGaussian(list,degree=5,s2=2): 
    """this function returns the gaussian smoothed timeseries data
    degree:    length of filter (in one direction)
    s2:        variance of the gaussian function  
    """ 
    window=degree*2-1  
    weight=np.ones(window)
    gaussweights=np.array([1.0/np.sqrt(2*np.pi*s2)*np.exp(-0.5/s2*x**2) for x in np.arange(-degree+1,degree)])
    #norm=np.sum(gaussweights)
    smoothed=np.array([np.sum(gaussweights*list[i:i+window]) for i in range(list.shape[0]-window)])    
    return smoothed 

Next, the implemented Gaussian filter is applied to the temperature data of the selected county. The smoothed temperature is plotted in the figure below.

In [5]:
smoothDeg=5 # length of Gaussian smoothing function in one direction
smoothedData=smoothGaussian(yearArray[:,idx],smoothDeg)
plt.figure(figsize=(12, 10))
plt.plot(yearArray[:,0],yearArray[:,idx],'b',label="annual Temp.")
plt.plot(yearArray[smoothDeg-1:-smoothDeg,0],smoothedData,'g',label="smoothed")
plt.legend(loc=2)
plt.grid(True)

Create Feature Matrix

The task is to predict future temperature from the temperature of previous years. First the number of previous years that shall be taken into account for the prediction must be fixed. In this example we choose this required history-length to be $IL=10$. Then the serial data must be rearranged such that each sequence of temperatures of $10$ successive years constitutes one feature vector and the temperature of the first year after the last year in the feature vector is the target value. This construction of the feature matrix and the corresponding target values is implemented in the following function.

In [6]:
def createFeatureMatrix(data,influenceLength=5):
    """This function creates the feature-matrix from the timeseries data.
    Each row in the calculated matrix contains influenceLength number of
    successive temperatures. From these temperatures in one row the temperature of 
    the following year is estimated
    """
    numSamples=data.shape[0]-influenceLength
    features=np.zeros((numSamples,influenceLength))
    targets=np.zeros(numSamples)
    for row in range(numSamples):
        features[row,:]=data[row:row+influenceLength]
        targets[row]=data[row+influenceLength]
    return features,targets

The function for the generation of the feature matrix and the target vector is invoked and the returned data structures are printed.

In [7]:
IL=10        # Influence length (=number of features), i.e. the number
            # of previous years used to estimate the temperature of next year
X,y=createFeatureMatrix(smoothedData,IL)
numS=X.shape[0]
print '-'*10+'Feature Matrix'+'-'*10
print X.round(2)
print '-'*10+'Target values'+'-'*10
print y.round(2)
----------Feature Matrix----------
[[ 7.83  7.57  7.26 ...,  7.59  7.74  7.73]
 [ 7.57  7.26  7.07 ...,  7.74  7.73  7.68]
 [ 7.26  7.07  7.04 ...,  7.73  7.68  7.77]
 ..., 
 [ 8.91  8.7   8.53 ...,  9.31  9.28  9.17]
 [ 8.7   8.53  8.6  ...,  9.28  9.17  9.02]
 [ 8.53  8.6   8.85 ...,  9.17  9.02  8.97]]
----------Target values----------
[ 7.68  7.77  8.    8.21  8.24  8.1   7.92  7.88  7.97  8.04  8.    7.89
  7.74  7.64  7.7   7.92  8.1   8.13  8.09  8.05  8.01  7.94  7.87  7.88
  7.98  8.11  8.13  8.01  7.9   7.91  8.03  8.17  8.23  8.18  8.07  7.94
  7.82  7.81  7.95  8.14  8.24  8.25  8.18  7.99  7.69  7.44  7.45  7.71
  8.03  8.25  8.39  8.51  8.62  8.69  8.67  8.56  8.4   8.22  7.99  7.72
  7.53  7.57  7.84  8.17  8.37  8.37  8.13  7.79  7.62  7.69  7.88  8.05
  8.08  7.98  7.86  7.81  7.82  7.88  8.04  8.24  8.37  8.35  8.2   8.01
  7.9   7.94  8.12  8.27  8.22  8.    7.83  7.88  8.15  8.5   8.75  8.81
  8.8   8.85  8.94  8.91  8.7   8.53  8.6   8.85  9.1   9.25  9.31  9.28
  9.17  9.02  8.97  9.05]

Applying a SVR for prediction of future temperature

From the generated feature matrix and target vector the first 110 instances are used for training a SVR with RBF-kernel.

In [8]:
#Select subset of samples for training ####################
numTrain=110
Xtrain=X[:numTrain,:]
ytrain=y[:numTrain]
##########################################################

A SVR object is instanciated and trained. In a first step the output of the trained model when applied to the feature vectors of the training data set is calculated. Then the trained model is applied to predict the temperature of the future 9 years. Note how the feature vector is constructed for this prediction: The prediction for the first future year is used as feature for predicting the temperature in the second year. The predictions of the first 2 future years are used as features in the input for the predictionn of the third year and so on.

In [9]:
# create SVR-Object, train it and apply it on training data ##################
regressor=svm.SVR(C=10,coef0=1.0)
#regressor.fit(X,y)
regressor.fit(Xtrain,ytrain)
p=regressor.predict(Xtrain)

plt.figure(figsize=(12, 10))
plt.plot(yearArray[:,0],yearArray[:,idx],'b',label="annual Temp.")
plt.plot(yearArray[smoothDeg-1:-smoothDeg,0],smoothedData,'g',label="smoothed")
plt.plot(yearArray[smoothDeg-1+IL:smoothDeg-1+IL+numTrain,0],p,'or',label="prediction training")
mad=np.sum(np.abs(p-ytrain))/ytrain.shape[0]
print "Mean Absolute Difference on training data = %2.2f"%mad
#############################################################################

# apply trained SVR on new data #############################################
testin=X[numTrain,:]
print "\nPrediction of average temperature for future years"
preds=[]
for s in range(numS-numTrain+smoothDeg):
    pred=regressor.predict(testin)
    preds.extend(pred)
    print "Input feature vector: %s   prediction: %2.2f"%(np.array(testin).round(2),pred)
    testin=[testin[i+1] for i in range(0,IL-1)]
    #print testin
    testin.extend(pred)
#print y[numTrain:]
#print preds
testSamp=len(preds)
mad=np.sum(np.abs(preds-y[-testSamp:]))/testSamp
print "Mean Absolute Difference on test data = %2.2f"%mad
plt.plot(yearArray[smoothDeg-1+IL+numTrain:,0],preds,'m',label="predicted test")
plt.legend(loc=2)
Mean Absolute Difference on training data = 0.06

Prediction of average temperature for future years
Input feature vector: [ 8.7   8.53  8.6   8.85  9.1   9.25  9.31  9.28  9.17  9.02]   prediction: 9.00
Input feature vector: [ 8.53  8.6   8.85  9.1   9.25  9.31  9.28  9.17  9.02  9.  ]   prediction: 9.04
Input feature vector: [ 8.6   8.85  9.1   9.25  9.31  9.28  9.17  9.02  9.    9.04]   prediction: 9.11
Input feature vector: [ 8.85  9.1   9.25  9.31  9.28  9.17  9.02  9.    9.04  9.11]   prediction: 9.16
Input feature vector: [ 9.1   9.25  9.31  9.28  9.17  9.02  9.    9.04  9.11  9.16]   prediction: 9.17
Input feature vector: [ 9.25  9.31  9.28  9.17  9.02  9.    9.04  9.11  9.16  9.17]   prediction: 9.17
Input feature vector: [ 9.31  9.28  9.17  9.02  9.    9.04  9.11  9.16  9.17  9.17]   prediction: 9.18
Mean Absolute Difference on test data = 0.17
Out[9]:
<matplotlib.legend.Legend at 0xc8efab0>