Machine Learning: Curve Fitting Basics - Featuring ggplot

This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.

Visit my webpage for more.

Email me: [email protected]

I'd love for you to share if you liked this post.

In [51]:
submit to reddit

This notebook covers or includes:

  • Test data creation
  • ggplot plotting
  • Polynomial curve fitting and prediction
  • Complex curve fitting with scipy

Data Sources:


Getting Started:

The goal of any curve fitting is to determine the underlying trend in the data in the presence of variance. This tutorial gets us going with the basics.

In [3]:
import scipy as sp
import numpy as np
import pandas as pd

from ggplot import *
# Enable inline plotting
%matplotlib inline

Create some fake data

In [4]:
x = sp.linspace(-5, 5,100) # 1000 values from 0 - 10
# Quadratic function
y = 2*x**2 + 5*x + 0

Load data into pandas data frame

In [693]:
df = pd.DataFrame(zip(x,y), columns = ['x','y'])

Lets plot it quick to take a look

What is this amazing ggplot library you are using?

In [694]:
ggplot(df, aes("x", "y"))+geom_line()+geom_point()
<ggplot: (279439225)>

Next we will add some error (noise) to make the data more "life-like", and then plot it without the line.

In [668]:
noisy_y = y + np.random.randn(y.size) # Random normal noise
In [696]:
ggplot(df, aes("x", noisy_y))+geom_point()
<repr(<ggplot.ggplot.ggplot at 0x10a8b3250>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>

Fitting the data

To do this we can utilize SciPy's polyfit ( ) function. We will then fit several polynomial shapes (starting with a straight line) and aim to find the order of the polynomial that minimizes the error function defined above.

First we will define an error function, which will guide us in choosing the correct model. The error is simply the squared distance of the model's prediction to the observed data above.

In [697]:
def error(f,x,y):
    return sp.sum((f(x)-y)**2)

Where f will be our fitted model and x / y contain our original data.

First Order Polynomial

The most basic model fit would be a straight line. This is actually "linear regression" but it is also a special case of a more general function: a first-order polynomial. First-Order means that the highest exponent of x in the equation is 1.

In [698]:
pf1, residuals, rank, sv, rcond = sp.polyfit(x, noisy_y, 1, full =True)

By setting the full parameter to True we get additional output coefficients from the fitting process, but we only really care about the residuals, which is the error of the estimation.

In [700]:
print ("Model parameters: %s" % pf1)

print "\n", residuals
print pf1
Model parameters: [  4.88882894  16.42093795]

[ 28290.55679858]
[  4.88882894  16.42093795]

Therefore the minimized straight line fits the following function:

In [701]:
print "f(x) = {} * x^2 + {}".format(pf1[0], pf1[1])
f(x) = 4.88882893728 * x^2 + 16.4209379538

Now we use poly1d ( ) to create a model function object from the model parameters derived above.

In [702]:
f1 = sp.poly1d(pf1)
print(error(f1, x, noisy_y))

Lets take a look at what we've fit.

In [704]:
ggplot(df, aes("x", noisy_y))+geom_point() + \
geom_line(aes(x, y = f1(x)), color = "red")
<repr(<ggplot.ggplot.ggplot at 0x10a8d6650>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>

This clearly is not the best we can do.

Lets do the same thing, but increase the variance.

In [705]:
noisy_y = y + np.random.randn(y.size)*10

pf1 = sp.polyfit(x, noisy_y, 1)

f1 = sp.poly1d(pf1)
In [707]:
ggplot(df, aes(x, noisy_y))+geom_point() + \
geom_line(aes(x, y = f1(x)), color = "red") + \
geom_line(aes(x, 2*x**2 + 5*x + 0), color = "blue", size = 2)
<repr(<ggplot.ggplot.ggplot at 0x10a44a690>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>
  • There are some small bugs with the ggplot python module, which cuts off the legends.

  • The dotted blue line is the actual data behind the noise

  • The red line is our current estimation.

Second Order Polynomial

We can see that we have are capturing some of the trend with the red line, but not all of it. Lets try higher-order polynomials.

Recall the error term from our first order polynomial model:

In [708]:
print error(f1, x, noisy_y)

Alone this error means nothing, but by comparing competing models, we can use the reported error to judge the most appropriate model.

In [711]:
# Second order Polynomial
pf2 = sp.polyfit(x, noisy_y, 2)
f2 = sp.poly1d(pf2)
print f2
1.909 x + 5.027 x + 1.714
In [713]:
print error(f2, x, noisy_y)
In [714]:
print "The error was reduced from {} to {}, a difference of {}".format(
error(f1, x, noisy_y), error(f2, x, noisy_y), error(f1, x, noisy_y)-error(f2, x, noisy_y))
The error was reduced from 29972.3791058 to 8901.42029545, a difference of 21070.9588103
In [715]:
ggplot(df, aes("x", noisy_y))+geom_point(size = 4) + \
geom_line(aes(x, y = f1(x)), color = "red") + \
geom_line(aes(x, 2*x**2 + 5*x + 0), color = "blue", linetype = "dashed") + \
geom_line(aes(x, y = f2(x)), color = "green", size=2) 
<repr(<ggplot.ggplot.ggplot at 0x10af98c50>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>

Here we go, we can clearly see that the green line (second-order polynomial) is almost exactly the actually underlying trend.

How about we test a few polynomial errors at once to be a bit more efficient.

In [910]:
def poly_test(range, x, y):
    func_l = []
    error_l = []
    i = 1
    for order in range:
        pf = sp.polyfit(x, y, order)
        func = sp.poly1d(pf)
        err = error(func, x, y)
        print "Order: {}, Error = {}".format(i, err)
        error_l.append([i, err])
        i += 1
    error_l = pd.DataFrame((error_l), columns = ["Order", "Error"])
    return error_l, func_l
In [717]:
error_tests, poly_tests = poly_test(xrange(1,20), x, noisy_y)
Order: 1, Error = 29972.3791058
Order: 2, Error = 8901.42029545
Order: 3, Error = 8889.43020048
Order: 4, Error = 8827.05812909
Order: 5, Error = 8785.33129955
Order: 6, Error = 8733.05890881
Order: 7, Error = 8498.3744727
Order: 8, Error = 8407.08751693
Order: 9, Error = 8384.44871846
Order: 10, Error = 8074.70017756
Order: 11, Error = 8035.84966396
Order: 12, Error = 7942.28166492
Order: 13, Error = 7698.98413081
Order: 14, Error = 7612.94832123
Order: 15, Error = 7602.49546217
Order: 16, Error = 7567.91985168
Order: 17, Error = 7368.93508242
Order: 18, Error = 7089.18637868
Order: 19, Error = 6979.14519774

Plot the best fit

In [734]:
ggplot(df, aes("x", noisy_y))+geom_point(size = 4) + \
geom_line(aes(x, 2*x**2 + 5*x + 0), color = "blue", linetype = "dashed") + \
geom_line(aes(x, y = poly_tests[18](x)), color = "green")
<repr(<ggplot.ggplot.ggplot at 0x10b7f9750>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>

Woah! Hang on... Are we actually capturing the underlying process in this data? By using this higher order polynomial (9), we are actually capturing the underlying process and the noise. This is called overfitting

A simplistic way to tell if we are overfitting is to examine the error improvement decay with increasing polynomial levels. In these data, we can see that there isnt really much error reduction after polynomial 2, therefore this would be our chosen model fit. There are more sophisticated ways of doing this, which will be discussed later.

In [733]:
breaks = xrange(0,21)
ggplot(error_tests, aes(x='Order', y='Error'))+geom_line() + \
scale_x_continuous(breaks = breaks, labels = breaks)
<ggplot: (280480369)>


Ok, so say we have chosen a second-order polynomial to fit our model. Lets utilize our ability to estimate the underlying trend and predict future data. Generally more sophisticated statistics would be used here to analyze the variance when predicting farther and farther into the future or unknown.

To find the desired unknown y value, we subtract y at desired position i from the polynomial. We then use fsolve from SciPy's optimize module to simply solve the polynomial function we created. Say we want to know what x will be when y = 150.

In [746]:
print f2

from scipy.optimize import fsolve

converged = fsolve(f2 - 150, 0)
print ("y = 150 is expected when x = {}". format(converged[0]))
1.909 x + 5.027 x + 1.714
y = 150 is expected when x = 7.59426937669

Power Law Functions

We can take advantage of the fact that a power law is linear in log-space.

Lets make some new data.

In [930]:
t = sp.linspace(0.1,5, 1000)
# Power function
a = 1.5
b = 2
z = a * t**b

noisy_y = z + sp.random.randn(z.size)**1/2

df = pd.DataFrame(zip(t,noisy_y), columns = ["x", "y"])

# Plot data points and actual data as a line
ggplot(df, aes("x", "y"))+geom_point(size = 4)+ \
geom_line(df,aes(x="x", y=z), color = "blue")
<ggplot: (279250721)>

First we need to compute the error for a polynomial fit on untransformed data

In [960]:
error_tests, poly_tests = poly_test(xrange(1,10), t, noisy_y)
Order: 1, Error = 7582.21726058
Order: 2, Error = 239.360725199
Order: 3, Error = 239.229365939
Order: 4, Error = 238.469422373
Order: 5, Error = 238.424914846
Order: 6, Error = 238.396983769
Order: 7, Error = 238.3955276
Order: 8, Error = 237.987503728
Order: 9, Error = 237.975969779

Again, a second-order polynomial is most appropriate.

Next lets try linearizing the data with a natural log

In [939]:
# Change the variables using a natural log
x = np.log(t)
y = np.log(noisy_y) # Add error

df = pd.DataFrame(zip(x, y), columns = ["x", "y"])

ggplot(df, aes("x", "y"))+geom_point(size = 4)+ \
geom_line(aes(x="x", y=np.log(a * t**b)), color = "blue")
<ggplot: (279196401)>

The blue line is actually our observed trend, without noise. We can see that power laws are very sensitive to variation and noise. Some data points are not even plotted here because they became NaN after we logged them. We could either:

  • Transform the data to positive numbers before we log.
  • Just remove the NaN

To make this simpler, we will just remove these values

In [980]:
# First make sure we didnt create any NaN values by logging a negative number
print "# of missing values in y: {}".format(len(y[pd.isnull(y)]))
# of missing values in y: 39

To remove these values, we will just subset the data which are not NaN. We do this by applying a boolean mask. There are several ways to do this, but an easy one is to use pandas isnull ( ), which will check for all different types of NaN values, such as NA, NAN, NULL, NaN.

In [949]:
bad = pd.isnull(y)
array([False,  True, False, False, False], dtype=bool)

Once we have all the index's where noisy_y = NaN, we pass it to the array as an index array. We use - before the mask because we want everything that is not bad.

In [950]:
clean_y = y[-bad]
clean_x = x[-bad]

Now finally we can fit our first-order polynomial.

In [984]:
power_law_fit = sp.polyfit(clean_x, clean_y, 1)
power_law_func = sp.poly1d(power_law_fit)

#report error
print "Error improvement is {}".format(error_tests[1:2]['Error'].values - \
                                       (error(power_law_func, clean_x, clean_y)))

fitted_y = sp.polyval(power_law_fit, clean_x)
Error improvement is [ 38.86933803]
In [989]:
df = pd.DataFrame(zip(clean_x, clean_y), columns = ["x", "y"])
expected_y = np.log(a* t**b)
# Red line is new fit
# Blue line is actual data
ggplot(df, aes("x", "y"))+geom_point(size = 4)+ \
geom_line(aes(x="x", y=fitted_y, color = "red")) + \
geom_line(aes(x=np.log(t), y = np.log(z), color = "blue"))
<ggplot: (279260945)>

By combining the results of the error and the plot we can see that this new fit (red line) is slightly more accurate compared to our regular second-order polynomial fitting.

However, to achieve this result, I had to reduce the error quite significantly.

What about more complex curves?

Let's create some new fake data.

In [1034]:
x = np.linspace(0,2*np.pi, 50)

def sinfun(x,a,b):
    return a*np.sin(x-b)

y = sinfun(x,1,0)

df = pd.DataFrame(zip(x,y), columns = ['x','y'])

ggplot(df, aes(x,y))+geom_line(color="blue")+geom_point()
<repr(<ggplot.ggplot.ggplot at 0x10b7f9f90>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>

Lets add some error, just like before.

In [1035]:
noisy_y = y + np.random.randn(y.size)

ggplot(df, aes("x", noisy_y))+geom_point()+geom_line(aes("x", "y", color = "blue"))
<repr(<ggplot.ggplot.ggplot at 0x10a4de350>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>

Sometimes it's best to let SciPy handle more complex curves using the curve_fit ( ) function from scipy.optimize. The function returns the optimized fitted parameters and a covariance matrix.

The diagonals of a covariance matrix are the variances and we take the root of them to get the standard deviation.

In [1039]:
from scipy.optimize import curve_fit

fitpars, covm = curve_fit(sinfun, x, noisy_y)

variance = np.sqrt(covm.diagonal())
print fitpars, variance
[ 1.49563761 -0.01659767] [ 0.18290973  0.11987486]
In [1046]:
ggplot(df, aes("x", noisy_y))+geom_point()+ \
geom_line(aes("x", "y", color="blue")) + \
geom_line(aes("x", sinfun(x, *fitpars), color = 'red'))
<repr(<ggplot.ggplot.ggplot at 0x10a404c10>) failed: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()>

This is a decent fit, but you can see that it is catching quite a bit of the noise still. Granted, there is a lot of error present in this graph.

In [1]:
from IPython.core.display import HTML

def css_styling():
    styles = open("/users/ryankelly/desktop/custom_notebook.css", "r").read()

    return HTML(styles)
In [50]:
def social():
    code = """
    <a style='float:left; margin-right:5px;' href="" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+'://';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;' href="" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+'://';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;'target='_parent' href="" onclick="window.location = '' + encodeURIComponent(window.location); return false"> <img src="" alt="submit to reddit" border="0" /> </a>
<script src="//" type="text/javascript">
  lang: en_US
<script type="IN/Share"></script>
    return HTML(code)