Notebook

Lesson Brief¶

In this you will learn about the following:

Why do we use regression?
Temperature Data
Loading/Analyzing Data
Linear Regression
SVM (Support Vector Machine)

In [1]:

from IPython.display import HTML

input_form = """
<a id="admin_link" target="_blank" href="#">Ajenti Administration Interface</a>
<p>User: root<br> Password: admin</p>
"""

javascript = """
<script type="text/Javascript">
document.getElementById('admin_link').href = "https://" + window.location.hostname + ":8000"
</script>
"""

HTML(input_form + javascript)

Out[1]:

Ajenti Administration Interface

User: root
Password: admin

Why do we use regression?¶

In classification, we train a predictive model to produce a class or a category. Regression is used when you need a predictive model that produces numeric values instead of classes. For example you would use a classification algorithm to predict the probability of rain but use a regression algorithm to predict the temperature.

Temperature Data¶

We will be using data from the Climatic Research Unit (CRU) of the University of East Anglia (UEA).

You can find the dataset of their website:
http://www.cru.uea.ac.uk/cru/data/temperature/
http://www.cru.uea.ac.uk/cru/data/temperature/CRUTEM4-gl.dat

Data Sample¶

1851 0.823 0.357 -0.564 0.510 0.062 0.062 0.367 0.279 0.172 0.672 -0.132 -0.240 0.197 1851 3 3 3 3 3 3 3 3 3 3 3 3 1852 0.132 0.185 -0.016 -0.889 0.427 0.485 0.539 0.064 0.147 -0.252 0.126 1.834 0.232 1852 3 3 3 3 3 3 3 3 3 3 3 3 1853 0.711 -0.656 -1.348 -0.478 -0.213 -0.196 0.418 0.325 -0.333 0.082 -0.527 -1.230 -0.287 1853 3 3 3 3 3 3 3 3 3 3 3 3 1854 -0.444 0.256 0.200 -0.355 0.428 0.107 0.616 0.399 0.171 0.702 -0.501 0.346 0.161 1854 3 3 3 3 3 3 3 3 3 3 3 3 1855 0.059 -1.359 -0.521 0.444 0.243 0.197 0.277 0.127 -0.303 0.346 -0.431 -1.622 -0.212 1855 3 3 3 3 3 3 3 3 3 3 3 3 1856 -0.303 -0.751 -1.241 -0.022 -0.592 0.232 -0.413 -0.293 -0.601 -0.599 -1.486 -0.141 -0.518 1856 4 4 3 3 3 3 3 3 3 3 3 3 1857 -1.157 0.283 -0.708 -1.460 -1.028 -0.316 -0.007 0.147 -0.325 -0.412 -0.726 0.647 -0.422 1857 3 4 3 4 4 4 4 4 4 4 4 4 1858 0.054 -1.353 -0.734 -0.397 -0.365 0.146 -0.149 -0.344 -0.361 -0.059 -1.819 -0.382 -0.480 1858 4 4 4 4 4 4 4 4 4 4 4 4 1859 -0.167 0.153 -0.024 -0.149 0.021 -0.159 0.049 0.044 -0.551 -0.152 -0.310 -1.072 -0.193 1859 4 4 4 4 4 4 4 4 4 4 4 4

Understanding the data structure¶

Columns:¶

Year
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec

Rows¶

Odd rows: Change from previous month
Even rows: Percentage coverage of globe

Processing data with Calc/Excel¶

This data was processed using OpenOffice Calc (Similar to Excel).

Download and Open file using Excel/Calc¶

This file has fixed width columns. So we using fixed width columns when Opening/Importing this data.

Adding column names¶

Column names are added to row number 1

We add two new columns:

Average
Odd/Even

Average¶

This was calculated using this formula:

=AVERAGE(B2:M2)

Odd/Even¶

This was calculated using this formula:

=MOD(ROW(),2)

Filter out odd rows¶

You should filter out all columns with value of 1. This is what you end up with

Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Average Odd/Even 1851 0.823 0.357 -0.564 0.51 0.062 0.062 0.367 0.279 0.172 0.672 -0.132 -0.24 0.1973333333 0 1852 0.132 0.185 -0.016 -0.889 0.427 0.485 0.539 0.064 0.147 -0.252 0.126 1.834 0.2318333333 0 1853 0.711 -0.656 -1.348 -0.478 -0.213 -0.196 0.418 0.325 -0.333 0.082 -0.527 -1.23 -0.2870833333 0 1854 -0.444 0.256 0.2 -0.355 0.428 0.107 0.616 0.399 0.171 0.702 -0.501 0.346 0.1604166667 0 1855 0.059 -1.359 -0.521 0.444 0.243 0.197 0.277 0.127 -0.303 0.346 -0.431 -1.622 -0.2119166667 0 1856 -0.303 -0.751 -1.241 -0.022 -0.592 0.232 -0.413 -0.293 -0.601 -0.599 -1.486 -0.141 -0.5175 0 1857 -1.157 0.283 -0.708 -1.46 -1.028 -0.316 -0.007 0.147 -0.325 -0.412 -0.726 0.647 -0.4218333333 0

Export as CSV¶

Copy your data and past it in a new file. Then, remove Odd/Even column and save you file as a CSV file

Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Average 1851,0.823,0.357,-0.564,0.51,0.062,0.062,0.367,0.279,0.172,0.672,-0.132,-0.24,0.1973333333 1852,0.132,0.185,-0.016,-0.889,0.427,0.485,0.539,0.064,0.147,-0.252,0.126,1.834,0.2318333333 1853,0.711,-0.656,-1.348,-0.478,-0.213,-0.196,0.418,0.325,-0.333,0.082,-0.527,-1.23,-0.2870833333 1854,-0.444,0.256,0.2,-0.355,0.428,0.107,0.616,0.399,0.171,0.702,-0.501,0.346,0.1604166667 1855,0.059,-1.359,-0.521,0.444,0.243,0.197,0.277,0.127,-0.303,0.346,-0.431,-1.622,-0.2119166667 1856,-0.303,-0.751,-1.241,-0.022,-0.592,0.232,-0.413,-0.293,-0.601,-0.599,-1.486,-0.141,-0.5175 . . . 2009,0.779,0.792,0.633,0.829,0.581,0.659,0.587,0.876,0.976,0.724,0.845,0.492,0.7310833333 2010,0.778,0.826,1.111,1.062,0.936,0.996,0.974,0.863,0.732,0.888,1.195,0.448,0.90075 2011,0.467,0.382,0.585,0.83,0.523,0.778,0.775,0.766,0.852,0.941,0.609,0.827,0.6945833333 2012,0.557,0.323,0.645,1.03,0.963,0.947,0.792,0.891,0.9,0.847,0.849,0.261,0.7504166667 2013,0.935,0.949,0.669,0.581,0.773,0.827,0.671,0.706,0.776,0.821,0.984,0.795,0.7905833333 2014,0.95,0.408,0.929,1.048,,,,,,,,,0.83375

The final file is available on your system in this relative path "data/temp_data.csv"

Loading/Analyzing Data¶

import the required libraries¶

In [2]:

import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

Load CSV data from file¶

In [3]:

csv_data = pd.read_csv("data/temp_data_features.csv")

Visualize data¶

In [4]:

plt.figure(figsize=(20,5))
plt.scatter(x=csv_data["Year"], y=csv_data["Average"], marker="o", s=50, c=csv_data["Average"])
plt.plot(csv_data["Year"], csv_data["Average"], label="Annual Global Average Anomaly", alpha=0.4, linewidth=2, c="grey")
plt.hlines(0,min(csv_data["Year"])-3,max(csv_data["Year"])+5)

plt.legend(loc="best")
plt.xlim(min(csv_data["Year"])-3, max(csv_data["Year"])+5)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")
plt.colorbar()
plt.grid()
plt.show()

In [5]:

plt.figure(figsize=(20,5))
plt.bar(
    csv_data["Year"],
    csv_data["Average"],
    width=0.7,
    edgecolor="none",
    color=(csv_data["Average"]>0).map({True: 'r', False: 'b'}),
    label="Annual Average Global Anomaly",
    )
plt.hlines(0,min(csv_data["Year"])-3,max(csv_data["Year"])+5)
plt.legend(loc="best")
plt.xlim(min(csv_data["Year"])-3, max(csv_data["Year"])+5)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")
plt.grid()
plt.show()

In [6]:

# Final Record is not complete so average of the last year is not reliable
csv_data[-1:]

Out[6]:

	Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Average	TSI	CO2	CH4
163	2014	0.95	0.408	0.929	1.048	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.83375	NaN	NaN	NaN

1 rows × 17 columns

Visualize monthly temperature¶

In [7]:

# Prepare monthly data
monthly_temp = csv_data.drop("Year", 1).drop("Average", 1).drop("TSI", 1).drop("CO2", 1).drop("CH4", 1)
monthly_temp = pd.Series(np.ravel(monthly_temp)).dropna()
month_index = list((monthly_temp.index/12.) + 1851)

In [8]:

plt.figure(figsize=(15,8))
plt.scatter(
    x=month_index,
    y=monthly_temp,
    marker="o",
    label="Monthly Average Global Anomaly",
    c=monthly_temp,
    alpha=0.6
    )
plt.colorbar()
plt.legend(loc="lower right")
plt.xlim(min(month_index)-3,max(month_index)+5)
plt.ylim(min(monthly_temp),max(monthly_temp))
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")
plt.grid()
plt.show()

In [9]:

plt.figure(figsize=(20,5))
plt.bar(month_index, monthly_temp,  width=0.1, edgecolor="none", color=(monthly_temp>0).map({True: 'r', False: 'b'}),
        label="Monthly Average Global Anomaly")
plt.hlines(0,min(month_index)-1,max(month_index)+1)
plt.legend(loc="best")
plt.xlim(min(month_index)-1, max(month_index)+1)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")
plt.grid()
plt.show()

Linear Regression¶

Predicting Annual Temperature¶

In [10]:

annual_temp = csv_data["Average"]
annual_index = list(csv_data["Year"].values)
annual_index_feature = list(csv_data[["Year"]].values)
prediction_annual_index = [[item] for item in range(min(annual_index_feature),max(annual_index_feature)+10)]

In [11]:

# Code source: Jaques Grobler
# License: BSD 3 clause

from sklearn import linear_model


# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(annual_index_feature, annual_temp)

# The coefficients
print 'Coefficients:', regr.coef_
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(annual_index_feature) - annual_temp) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(annual_index_feature, annual_temp))

# Plot outputs
plt.figure(figsize=(20,5))
plt.bar(annual_index, annual_temp,  width=0.7, edgecolor="none", color=(annual_temp>0).map({True: 'r', False: 'b'}),
        label="Annual Average Global Anomaly", alpha=0.3)

plt.plot(prediction_annual_index[:], regr.predict(prediction_annual_index[:]), color='green',
        linewidth=3, alpha=1.0, label="Linear Regression")

plt.grid()
plt.xlim(np.min(annual_index_feature), np.max(annual_index_feature)+5)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")
plt.legend(loc="best")
plt.show()

Coefficients: [ 0.00651672]
Residual sum of squares: 0.06
Variance score: 0.60

Predicting Monthly Temperature¶

In [12]:

month_index_feature = [[item] for item in month_index]
prediction_month_index = [[item[0] + 5] for item in month_index_feature]

In [13]:

# Code source: Jaques Grobler
# License: BSD 3 clause

from sklearn import linear_model

regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(month_index_feature, monthly_temp)

# The coefficients
print 'Coefficients:', regr.coef_
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(month_index_feature) - monthly_temp) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(month_index_feature, monthly_temp))

# Plot outputs
plt.figure(figsize=(20,5))
plt.bar(month_index, monthly_temp,  width=0.1, edgecolor="none", color=(monthly_temp>0).map({True: 'r', False: 'b'}),
        label="Monthly Average Global Anomaly", alpha=0.1)

plt.plot(month_index, regr.predict(month_index_feature), color='black',
        linewidth=3, alpha=0.5, label="Linear Regression")
plt.plot(prediction_month_index[-5*12:], regr.predict(prediction_month_index[-5*12:]), color='green',
        linewidth=3, alpha=1.0, label="Linear Regression Prediction")

plt.grid()
plt.xlim(np.min(month_index_feature), np.max(month_index_feature)+5)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")
plt.legend(loc="best")
plt.show()

Coefficients: [ 0.00645693]
Residual sum of squares: 0.16
Variance score: 0.37

SVM Regression¶

In [14]:

from sklearn.svm import SVR

# Create linear regression object
regr_linear = SVR(kernel="linear")
regr_rbf_1 = SVR(kernel="rbf", C=100.0, gamma=0.004, epsilon=0.01)
regr_rbf_2 = SVR(kernel="rbf", C=10.0, gamma=0.0001, epsilon=0.01)
regr_rbf_3 = SVR(kernel="rbf", C=1.0, gamma=0.0002, epsilon=0.1)

# Train the model using the training sets
regr_linear.fit(annual_index_feature, annual_temp)
regr_rbf_1.fit(annual_index_feature, annual_temp)
regr_rbf_2.fit(annual_index_feature, annual_temp)
regr_rbf_3.fit(annual_index_feature, annual_temp)


# The coefficients
#print 'Coefficients:', regr.coef_
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr_rbf_1.predict(annual_index_feature) - annual_temp) ** 2))
# Explained variance score: 1 is perfect prediction
print('score1: %.2f' % regr_rbf_1.score(annual_index_feature, annual_temp))
print('score2: %.2f' % regr_rbf_2.score(annual_index_feature, annual_temp))
print('score3: %.2f' % regr_rbf_3.score(annual_index_feature, annual_temp))

# Plot outputs
plt.figure(figsize=(20,5))
plt.bar(annual_index, annual_temp,  width=0.7, edgecolor="none", color=(annual_temp>0).map({True: 'r', False: 'b'}),
        label="Annual Average Global Anomaly", alpha=0.3)

plt.plot(prediction_annual_index[:], regr_linear.predict(prediction_annual_index[:]), color='green',
        linewidth=3, alpha=0.5, label="Linear Prediction")
plt.plot(prediction_annual_index[:], regr_rbf_1.predict(prediction_annual_index[:]), color='blue',
        linewidth=3, alpha=0.5, label="RBF1 Prediction")
plt.plot(prediction_annual_index[:], regr_rbf_2.predict(prediction_annual_index[:]), color='orange',
        linewidth=3, alpha=0.5, label="RBF2 Prediction")
plt.plot(prediction_annual_index[:], regr_rbf_3.predict(prediction_annual_index[:]), color='red',
        linewidth=3, alpha=0.5, label="RBF3 Prediction")

plt.grid()
plt.xlim(np.min(annual_index_feature), np.max(annual_index_feature)+10)
plt.xticks(np.arange(np.min(annual_index_feature), np.max(annual_index_feature)+10, 10))
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")
plt.legend(loc="best")
plt.show()

Residual sum of squares: 0.02
score1: 0.87
score2: 0.82
score3: 0.82

In [15]:

from sklearn.svm import SVR
from sklearn.grid_search import GridSearchCV

regr_rbf = SVR(kernel="rbf")
C = [100, 10, 1]
gamma = [0.005, 0.004, 0.003, 0.002, 0.001]
epsilon=[0.01]
parameters = {"C":C, "gamma":gamma, "epsilon":epsilon}

gs = GridSearchCV(regr_rbf, parameters, scoring="r2")

gs.fit(annual_index_feature, annual_temp)

print "Best Estimator:\n%s"  % gs.best_estimator_

Best Estimator:
SVR(C=1, cache_size=200, coef0=0.0, degree=3, epsilon=0.01, gamma=0.001,
  kernel=rbf, max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [16]:

from sklearn.grid_search import GridSearchCV

regr_rbf = SVR(kernel="rbf")
C = np.arange(gs.best_estimator_.C * 0.9, gs.best_estimator_.C * 1.1, gs.best_estimator_.C * 0.01)
gamma = np.arange(gs.best_estimator_.gamma * 0.9, gs.best_estimator_.gamma * 1.1, gs.best_estimator_.gamma * 0.01)
parameters = {"C":C, "gamma":gamma}

gs = GridSearchCV(regr_rbf, parameters, scoring="r2")

gs.fit(annual_index_feature, annual_temp)

print "Best Estimator:\n%s"  % gs.best_estimator_

Best Estimator:
SVR(C=0.93, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.0009,
  kernel=rbf, max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [17]:

from sklearn.svm import SVR

# Create linear regression object
regr_rbf = gs.best_estimator_


# The coefficients
#print 'Coefficients:', regr.coef_
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr_rbf.predict(annual_index_feature) - annual_temp) ** 2))
# Explained variance score: 1 is perfect prediction
print('score: %.2f' % regr_rbf.score(annual_index_feature, annual_temp))

# Plot outputs
plt.figure(figsize=(20,5))
plt.bar(annual_index, annual_temp,  width=0.7, edgecolor="none", color=(annual_temp>0).map({True: 'r', False: 'b'}),
        label="Annual Average Global Anomaly", alpha=0.3)


plt.plot(prediction_annual_index[:], regr_rbf.predict(prediction_annual_index[:]), color='black',
        linewidth=3, alpha=0.7, label="Best RBF Prediction")

plt.grid()
plt.title(regr_rbf)
plt.xlim(np.min(annual_index_feature)+1, np.max(annual_index_feature)+10)
plt.xticks(np.arange(np.min(annual_index_feature)-1, np.max(annual_index_feature)+10, 10))
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")
plt.legend(loc="best")
plt.show()

Residual sum of squares: 0.02
score: 0.84

Examining Extra Features¶

There are three features that we can add to the tempreture data to build a better predective model. We will look at greenhouse gases (CO2 and CH4) and solar activity. So we will use the following extra features:

CO2 Level
CH4 Level
Total Solar Irradiance (TSI)

This data is from the following sources:

CO2 and CH4 Data:¶

Dr. Pieter Tans, NOAA/ESRL (www.esrl.noaa.gov/gmd/ccgg/trends/) and Dr. Ralph Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/).

Total Solar Irradiance:¶

The Solar Radiation and Climate Experiment (SORCE) is a NASA-sponsored satellite mission that is providing state-of-the-art measurements of incoming x-ray, ultraviolet, visible, near-infrared, and total solar radiation. The measurements provided by SORCE specifically address long-term climate change, natural variability and enhanced climate prediction, and atmospheric ozone and UV-B radiation. These measurements are critical to studies of the Sun; its effect on our Earth system; and its influence on humankind.

TSI Dataset

In [18]:

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

In [19]:

import scipy as sp
plt.figure(figsize=(20,5))
plt.bar(annual_index, annual_temp,  width=0.7, edgecolor="none", color=(annual_temp>0).map({True: 'r', False: 'b'}),
        label="Annual Average Global Anomaly", alpha=0.2)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")

tsi_ax = plt.twinx()
tsi_ax.plot(csv_data["Year"], csv_data["TSI"], linewidth=3, c="orange", alpha=1.0)
plt.ylabel(u"TSI Reconstruction from IPCC AR5")

plt.legend(loc="best")
plt.grid()
plt.xlim(np.min(annual_index_feature)+1, np.max(annual_index_feature)+10)

plt.show()
print "Correlation between TSI and Temperature: %s%%" % (round(1000*pearson_def(
                                        csv_data[["Average","TSI"]].dropna()["Average"].values,
                                        csv_data[["Average","TSI"]].dropna()["TSI"].values))/10)

/usr/lib/pymodules/python2.7/matplotlib/axes.py:4747: UserWarning: No labeled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labeled objects found. "

Correlation between TSI and Temperature: 35.2%

In [20]:

plt.figure(figsize=(20,5))
plt.bar(csv_data[["Year", "Average","CO2"]].dropna()["Year"],
        csv_data[["Year", "Average","CO2"]].dropna()["Average"],
        width=0.7, edgecolor="none",
        color=(csv_data[["Year", "Average","CO2"]].dropna()["Average"]>0).map({True: 'r', False: 'b'}),
        label="Annual Average Global Anomaly", alpha=0.2)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")


co2_ax = plt.twinx()
co2_ax.plot(csv_data["Year"], csv_data["CO2"], linewidth=3, c="g", alpha=.8)
plt.ylabel(u"CO2 CCGG (In Situ) ppm")

plt.legend(loc="best")
plt.grid()
plt.xlim(np.min(csv_data[["Year", "Average","CO2"]].dropna()["Year"]),
         np.max(csv_data[["Year", "Average","CO2"]].dropna()["Year"]))

plt.show()
print "Correlation between TSI and Temperature: %s%%" % (round(1000*pearson_def(
                                        csv_data[["Average","CO2"]].dropna()["Average"].values,
                                        csv_data[["Average","CO2"]].dropna()["CO2"].values))/10)

Correlation between TSI and Temperature: 91.0%

In [21]:

plt.figure(figsize=(20,5))
plt.bar(csv_data[["Year", "Average","CH4"]].dropna()["Year"],
        csv_data[["Year", "Average","CH4"]].dropna()["Average"],
        width=0.7, edgecolor="none",
        color=(csv_data[["Year", "Average","CH4"]].dropna()["Average"]>0).map({True: 'r', False: 'b'}),
        label="Annual Average Global Anomaly", alpha=0.3)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")

ch4_ax = plt.twinx()
ch4_ax.plot(csv_data["Year"], csv_data["CH4"], linewidth=3, c="b", alpha=.8)
plt.ylabel(u"CH4 CCGG (Individual Flasks) ppb")
plt.legend(loc="best")
plt.grid()
plt.xlim(np.min(csv_data[["Year", "Average","CH4"]].dropna()["Year"]),
         np.max(csv_data[["Year", "Average","CH4"]].dropna()["Year"]))

plt.show()
print "Correlation between TSI and Temperature: %s%%" % (round(1000*pearson_def(
                                        csv_data[["Average","CH4"]].dropna()["Average"].values,
                                        csv_data[["Average","CH4"]].dropna()["CH4"].values))/10)

Correlation between TSI and Temperature: 80.6%

Forecasting TSI, CO2 and CH4¶

In [22]:

regr_rbf = SVR(kernel="rbf")
C = [30]
gamma_1 = [0.015]
gamma_2 = [0.005, 0.004, 0.003, 0.002, 0.001, 0.0009, 0.0008, 0.0007, 0.0006, 0.0005, 0.0004, 0.0003, 0.0002, 0.0001]
epsilon=[0.01, 0.001]
parameters_1 = {"C":C, "gamma":gamma_1, "epsilon":epsilon}
parameters_2 = {"C":C, "gamma":gamma_2, "epsilon":epsilon}

gs_1 = GridSearchCV(regr_rbf, parameters_1, scoring="r2")
gs_2 = GridSearchCV(regr_rbf, parameters_2, scoring="r2")

gs_1.fit(csv_data[["Year","TSI"]].dropna()[["Year"]], csv_data[["Year","TSI"]].dropna()["TSI"])
gs_2.fit(csv_data[["Year","TSI"]].dropna()[["Year"]], csv_data[["Year","TSI"]].dropna()["TSI"])

print "Best Estimator:\n%s"  % gs_1.best_estimator_
print "Best Estimator:\n%s"  % gs_2.best_estimator_

Best Estimator:
SVR(C=30, cache_size=200, coef0=0.0, degree=3, epsilon=0.01, gamma=0.015,
  kernel=rbf, max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
Best Estimator:
SVR(C=30, cache_size=200, coef0=0.0, degree=3, epsilon=0.01, gamma=0.0001,
  kernel=rbf, max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [23]:

annual_index_feature = np.arange(np.min(csv_data[["Year","TSI"]].dropna()["Year"]),
                                 np.max(csv_data[["Year","TSI"]].dropna()["Year"])+10)
annual_index_feature = [[item] for item in annual_index_feature]


plt.figure(figsize=(20,5))
plt.bar(annual_index, annual_temp,  width=0.7, edgecolor="none", color=(annual_temp>0).map({True: 'r', False: 'b'}),
        label="Annual Average Global Anomaly", alpha=0.3)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")

tsi_ax = plt.twinx()
tsi_ax.plot(csv_data["Year"], csv_data["TSI"], "--", linewidth=3, c="gray", label="TSI")
tsi_ax.plot(annual_index_feature,
            gs_1.best_estimator_.predict(annual_index_feature),
            linewidth=3, c="green", label="TSI Short-term Trend", alpha=0.4)
tsi_ax.plot(annual_index_feature,
            gs_2.best_estimator_.predict(annual_index_feature),
            linewidth=3, c="blue", label="TSI Long-term Trend", alpha=0.4)
plt.ylabel(u"TSI Reconstruction from IPCC AR5")

plt.legend(loc="upper left")

plt.xlim(np.min(annual_index_feature)-1, np.max(annual_index_feature))
plt.title("Total Solar Irradiance (TSI) with a short term and long term predictions")
plt.xticks(np.arange(np.min(annual_index_feature)-1, np.max(annual_index_feature), 10))
plt.grid()
plt.show()

st_prediction = gs_1.best_estimator_.predict(annual_index_feature)
lt_prediction = gs_2.best_estimator_.predict(annual_index_feature)
print u"Long-Term Wave Length \u2248 (%s - %s) * 2 \u2248 %s" % (annual_index_feature[np.argmax(lt_prediction)][0],
                        annual_index_feature[np.argmin(lt_prediction)][0],
                        (annual_index_feature[np.argmax(lt_prediction)][0]-annual_index_feature[np.argmin(lt_prediction)][0])*2
                        )
st_min = csv_data["Year"][np.argmin(csv_data["Year"][5:13])]
st_max = csv_data["Year"][np.argmax(csv_data["Year"][5:13])]
print u"Short-Term Wave Length \u2248 (%s - %s) * 2 \u2248 %s" % (st_max,
                        st_min,
                        (st_max-st_min)*2
                        )

Long-Term Wave Length ≈ (1973 - 1882) * 2 ≈ 182
Short-Term Wave Length ≈ (1863 - 1856) * 2 ≈ 14

In [24]:

regr_rbf = SVR(kernel="rbf")
C = [1,10,20,30,50,100,1000]
gamma_2 = [0.01, 0.005, 0.004, 0.003, 0.002, 0.001, 0.0009, 0.0008, 0.0007, 0.0006, 0.0005, 0.0004, 0.0003, 0.0002, 0.0001,
           0.00009,0.00008,0.00007,0.00006,0.00005,0.00004,0.00003,0.00002,0.00001]
epsilon=[0.01, 0.001]
parameters_2 = {"C":C, "gamma":gamma_2, "epsilon":epsilon}

gs_2 = GridSearchCV(regr_rbf, parameters_2, scoring="r2")

gs_2.fit(csv_data[["Year","CO2"]].dropna()[["Year"]], csv_data[["Year","CO2"]].dropna()["CO2"])

print "Best Estimator:\n%s"  % gs_2.best_estimator_

Best Estimator:
SVR(C=1000, cache_size=200, coef0=0.0, degree=3, epsilon=0.001, gamma=5e-05,
  kernel=rbf, max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [25]:

annual_index_feature = np.arange(np.min(csv_data[["Year","CO2"]].dropna()["Year"]),
                                 np.max(csv_data[["Year","CO2"]].dropna()["Year"])+10)
annual_index_feature = [[item] for item in annual_index_feature]


plt.figure(figsize=(20,5))
plt.bar(annual_index, annual_temp,  width=0.7, edgecolor="none", color=(annual_temp>0).map({True: 'r', False: 'b'}),
        label="Annual Average Global Anomaly", alpha=0.3)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")

tsi_ax = plt.twinx()
tsi_ax.plot(csv_data["Year"], csv_data["CO2"], "--", linewidth=3, c="gray", label="CO2")

tsi_ax.plot(annual_index_feature,
            gs_2.best_estimator_.predict(annual_index_feature),
            linewidth=3, c="blue", label="CO2 Trend", alpha=0.4)
plt.ylabel(u"CO2 CCGG (In Situ) ppm")

plt.legend(loc="upper left")

plt.xlim(np.min(annual_index_feature)-1, np.max(annual_index_feature))
plt.title("CO2 with trend")
plt.xticks(np.arange(np.min(annual_index_feature)-1, np.max(annual_index_feature), 10))
plt.grid()
plt.show()

In [26]:

regr_rbf = SVR(kernel="rbf")
C = [1,10,20,30,40,50,60,70,80,90,100,1000]
gamma_2 = [0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09, 0.005, 0.004, 0.003, 0.002, 0.001]
epsilon=[0.01, 0.001, 0.0001, 0.00001]
parameters_2 = {"C":C, "gamma":gamma_2, "epsilon":epsilon}

gs_2 = GridSearchCV(regr_rbf, parameters_2, scoring="r2")

gs_2.fit(csv_data[["Year","CH4"]].dropna()[["Year"]], csv_data[["Year","CH4"]].dropna()["CH4"])

print "Best Estimator:\n%s"  % gs_2.best_estimator_

Best Estimator:
SVR(C=1000, cache_size=200, coef0=0.0, degree=3, epsilon=0.0001, gamma=0.004,
  kernel=rbf, max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [27]:

annual_index_feature = np.arange(np.min(csv_data[["Year","CH4"]].dropna()["Year"]),
                                 np.max(csv_data[["Year","CH4"]].dropna()["Year"])+10)
annual_index_feature = [[item] for item in annual_index_feature]


plt.figure(figsize=(20,5))
plt.bar(annual_index, annual_temp,  width=0.7, edgecolor="none", color=(annual_temp>0).map({True: 'r', False: 'b'}),
        label="Annual Average Global Anomaly", alpha=0.3)
plt.ylabel(u"CRUTEM4 Temperature Anomaly (\u00B0C)")

tsi_ax = plt.twinx()
tsi_ax.plot(csv_data["Year"], csv_data["CH4"], "--", linewidth=3, c="gray", label="CH4")

tsi_ax.plot(annual_index_feature,
            gs_2.best_estimator_.predict(annual_index_feature),
            linewidth=3, c="blue", label="CH4 Trend", alpha=0.4)
plt.ylabel(u"CH4 CCGG (Individual Flasks) ppb")

plt.legend(loc="upper left")

plt.xlim(np.min(annual_index_feature)-1, np.max(annual_index_feature))
plt.title("Methane (CH4) with trend")
plt.xticks(np.arange(np.min(annual_index_feature)-1, np.max(annual_index_feature), 10))
plt.grid()
plt.show()

End of Lesson 3¶

For questions please leave them on:

Previous Lesson - Introduction to Machine Learning

Next Lesson - Clustering

In the next lesson:

Why do we use Clustering?
Finding Certoids
K-Means Algorithm