This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.

Email me: [email protected]

I'd love for you to share if you liked this post.

In [51]:

```
social()
```

Out[51]:

- Test data creation
- ggplot plotting
- Polynomial curve fitting and prediction
- Complex curve fitting with scipy

- Goodness of fit tests
- Other suggestions? Email me:
__[email protected]__

The goal of any curve fitting is to determine the underlying trend in the data in the presence of variance. This tutorial gets us going with the basics.

In [3]:

```
import scipy as sp
import numpy as np
import pandas as pd
from ggplot import *
# Enable inline plotting
%matplotlib inline
```

Create some fake data

In [4]:

```
x = sp.linspace(-5, 5,100) # 1000 values from 0 - 10
# Quadratic function
y = 2*x**2 + 5*x + 0
```

Load data into pandas data frame

In [693]:

```
df = pd.DataFrame(zip(x,y), columns = ['x','y'])
```

Lets plot it quick to take a look

In [694]:

```
ggplot(df, aes("x", "y"))+geom_line()+geom_point()
```

Out[694]:

Next we will add some error (noise) to make the data more "life-like", and then plot it without the line.

In [668]:

```
noisy_y = y + np.random.randn(y.size) # Random normal noise
```

In [696]:

```
ggplot(df, aes("x", noisy_y))+geom_point()
```

Out[696]:

To do this we can utilize SciPy's **polyfit ( )** function. We will then fit several polynomial shapes (starting with a straight line) and aim to find the order of the polynomial that minimizes the error function defined above.

First we will define an error function, which will guide us in choosing the correct model. The error is simply the squared distance of the model's prediction to the observed data above.

In [697]:

```
def error(f,x,y):
return sp.sum((f(x)-y)**2)
```

Where f will be our fitted model and x / y contain our original data.

The most basic model fit would be a straight line. This is actually "linear regression" but it is also a special case of a more general function: a first-order polynomial. **First-Order** means that the highest exponent of **x** in the equation is 1.

In [698]:

```
pf1, residuals, rank, sv, rcond = sp.polyfit(x, noisy_y, 1, full =True)
```

By setting the **full** parameter to **True** we get additional output coefficients from the fitting process, but we only really care about the residuals, which is the error of the estimation.

In [700]:

```
print ("Model parameters: %s" % pf1)
print "\n", residuals
print pf1
```

Therefore the minimized straight line fits the following function:

In [701]:

```
print "f(x) = {} * x^2 + {}".format(pf1[0], pf1[1])
```

Now we use **poly1d ( )** to create a model function object from the model parameters derived above.

In [702]:

```
f1 = sp.poly1d(pf1)
print(error(f1, x, noisy_y))
```

Lets take a look at what we've fit.

In [704]:

```
ggplot(df, aes("x", noisy_y))+geom_point() + \
geom_line(aes(x, y = f1(x)), color = "red")
```

Out[704]:

This clearly is not the best we can do.

Lets do the same thing, but increase the variance.

In [705]:

```
noisy_y = y + np.random.randn(y.size)*10
pf1 = sp.polyfit(x, noisy_y, 1)
f1 = sp.poly1d(pf1)
```

In [707]:

```
ggplot(df, aes(x, noisy_y))+geom_point() + \
geom_line(aes(x, y = f1(x)), color = "red") + \
geom_line(aes(x, 2*x**2 + 5*x + 0), color = "blue", size = 2)
```

Out[707]:

There are some small bugs with the ggplot python module, which cuts off the legends.

The dotted

**blue**line is the*actual*data behind the noise- The
**red**line is our current estimation.

We can see that we have are capturing some of the trend with the red line, but not all of it. Lets try higher-order polynomials.

Recall the error term from our first order polynomial model:

In [708]:

```
print error(f1, x, noisy_y)
```

Alone this error means nothing, but by comparing competing models, we can use the reported error to judge the most appropriate model.

In [711]:

```
# Second order Polynomial
pf2 = sp.polyfit(x, noisy_y, 2)
f2 = sp.poly1d(pf2)
print f2
```

In [713]:

```
print error(f2, x, noisy_y)
```

In [714]:

```
print "The error was reduced from {} to {}, a difference of {}".format(
error(f1, x, noisy_y), error(f2, x, noisy_y), error(f1, x, noisy_y)-error(f2, x, noisy_y))
```

In [715]:

```
ggplot(df, aes("x", noisy_y))+geom_point(size = 4) + \
geom_line(aes(x, y = f1(x)), color = "red") + \
geom_line(aes(x, 2*x**2 + 5*x + 0), color = "blue", linetype = "dashed") + \
geom_line(aes(x, y = f2(x)), color = "green", size=2)
```

Out[715]:

Here we go, we can clearly see that the **green line** (second-order polynomial) is almost exactly the actually underlying trend.

How about we test a few polynomial errors at once to be a bit more efficient.

In [910]:

```
def poly_test(range, x, y):
func_l = []
error_l = []
i = 1
for order in range:
pf = sp.polyfit(x, y, order)
func = sp.poly1d(pf)
err = error(func, x, y)
print "Order: {}, Error = {}".format(i, err)
func_l.append(func)
error_l.append([i, err])
i += 1
error_l = pd.DataFrame((error_l), columns = ["Order", "Error"])
return error_l, func_l
```

In [717]:

```
error_tests, poly_tests = poly_test(xrange(1,20), x, noisy_y)
```

Plot the best fit

In [734]:

```
ggplot(df, aes("x", noisy_y))+geom_point(size = 4) + \
geom_line(aes(x, 2*x**2 + 5*x + 0), color = "blue", linetype = "dashed") + \
geom_line(aes(x, y = poly_tests[18](x)), color = "green")
```

Out[734]:

Woah! Hang on... Are we actually capturing the underlying process in this data? By using this higher order polynomial (9), we are actually capturing the underlying process **and** the noise. This is called **overfitting**

A simplistic way to tell if we are overfitting is to examine the error improvement decay with increasing polynomial levels. In these data, we can see that there isnt really much error reduction after polynomial 2, therefore this would be our chosen model fit. There are more sophisticated ways of doing this, which will be discussed later.

In [733]:

```
breaks = xrange(0,21)
ggplot(error_tests, aes(x='Order', y='Error'))+geom_line() + \
scale_x_continuous(breaks = breaks, labels = breaks)
```

Out[733]:

Ok, so say we have chosen a second-order polynomial to fit our model. Lets utilize our ability to estimate the underlying trend and predict future data. Generally more sophisticated statistics would be used here to analyze the variance when predicting farther and farther into the future or unknown.

To find the desired unknown **y** value, we subtract **y** at *desired position i* from the polynomial. We then use **fsolve** from SciPy's **optimize** module to simply solve the polynomial function we created. Say we want to know what **x** will be when **y** = 150.

In [746]:

```
print f2
from scipy.optimize import fsolve
converged = fsolve(f2 - 150, 0)
print ("y = 150 is expected when x = {}". format(converged[0]))
```

We can take advantage of the fact that a power law is linear in log-space.

Lets make some new data.

In [930]:

```
t = sp.linspace(0.1,5, 1000)
# Power function
a = 1.5
b = 2
z = a * t**b
noisy_y = z + sp.random.randn(z.size)**1/2
df = pd.DataFrame(zip(t,noisy_y), columns = ["x", "y"])
# Plot data points and actual data as a line
ggplot(df, aes("x", "y"))+geom_point(size = 4)+ \
geom_line(df,aes(x="x", y=z), color = "blue")
```

Out[930]:

First we need to compute the error for a polynomial fit on untransformed data

In [960]:

```
error_tests, poly_tests = poly_test(xrange(1,10), t, noisy_y)
```

Again, a second-order polynomial is most appropriate.

Next lets try linearizing the data with a natural log

In [939]:

```
# Change the variables using a natural log
x = np.log(t)
y = np.log(noisy_y) # Add error
df = pd.DataFrame(zip(x, y), columns = ["x", "y"])
ggplot(df, aes("x", "y"))+geom_point(size = 4)+ \
geom_line(aes(x="x", y=np.log(a * t**b)), color = "blue")
```

Out[939]:

The **blue** line is actually our observed trend, without noise. We can see that power laws are very sensitive to variation and noise. Some data points are not even plotted here because they became NaN after we logged them. We could either:

- Transform the data to positive numbers before we log.
- Just remove the NaN

To make this simpler, we will just remove these values

In [980]:

```
# First make sure we didnt create any NaN values by logging a negative number
print "# of missing values in y: {}".format(len(y[pd.isnull(y)]))
```

To remove these values, we will just subset the data which are not NaN. We do this by applying a boolean mask. There are several ways to do this, but an easy one is to use pandas **isnull ( )**, which will check for all different types of NaN values, such as NA, NAN, NULL, NaN.

In [949]:

```
bad = pd.isnull(y)
bad[0:5]
```

Out[949]:

Once we have all the index's where **noisy_y** = NaN, we pass it to the array as an index array. We use **-** before the mask because we want everything that **is not** bad.

In [950]:

```
clean_y = y[-bad]
clean_x = x[-bad]
```

Now finally we can fit our first-order polynomial.

In [984]:

```
power_law_fit = sp.polyfit(clean_x, clean_y, 1)
power_law_func = sp.poly1d(power_law_fit)
#report error
print "Error improvement is {}".format(error_tests[1:2]['Error'].values - \
(error(power_law_func, clean_x, clean_y)))
fitted_y = sp.polyval(power_law_fit, clean_x)
```

In [989]:

```
df = pd.DataFrame(zip(clean_x, clean_y), columns = ["x", "y"])
expected_y = np.log(a* t**b)
# Red line is new fit
# Blue line is actual data
ggplot(df, aes("x", "y"))+geom_point(size = 4)+ \
geom_line(aes(x="x", y=fitted_y, color = "red")) + \
geom_line(aes(x=np.log(t), y = np.log(z), color = "blue"))
```

Out[989]:

By combining the results of the error and the plot we can see that this new fit (**red line**) is slightly more accurate compared to our regular second-order polynomial fitting.

However, to achieve this result, I had to reduce the error quite significantly.

Let's create some new fake data.

In [1034]:

```
x = np.linspace(0,2*np.pi, 50)
def sinfun(x,a,b):
return a*np.sin(x-b)
y = sinfun(x,1,0)
df = pd.DataFrame(zip(x,y), columns = ['x','y'])
ggplot(df, aes(x,y))+geom_line(color="blue")+geom_point()
```

Out[1034]:

Lets add some error, just like before.

In [1035]:

```
noisy_y = y + np.random.randn(y.size)
ggplot(df, aes("x", noisy_y))+geom_point()+geom_line(aes("x", "y", color = "blue"))
```

Out[1035]:

Sometimes it's best to let SciPy handle more complex curves using the **curve_fit ( )** function from **scipy.optimize**. The function returns the optimized fitted parameters and a covariance matrix.

The diagonals of a covariance matrix are the variances and we take the root of them to get the standard deviation.

In [1039]:

```
from scipy.optimize import curve_fit
fitpars, covm = curve_fit(sinfun, x, noisy_y)
variance = np.sqrt(covm.diagonal())
print fitpars, variance
```

In [1046]:

```
ggplot(df, aes("x", noisy_y))+geom_point()+ \
geom_line(aes("x", "y", color="blue")) + \
geom_line(aes("x", sinfun(x, *fitpars), color = 'red'))
```

Out[1046]:

This is a decent fit, but you can see that it is catching quite a bit of the noise still. Granted, there is a lot of error present in this graph.

Visit my webpage for more. Email me: __[email protected]__

In [1]:

```
from IPython.core.display import HTML
def css_styling():
styles = open("/users/ryankelly/desktop/custom_notebook.css", "r").read()
return HTML(styles)
css_styling()
```

Out[1]:

In [50]:

```
def social():
code = """
<a style='float:left; margin-right:5px;' href="https://twitter.com/share" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a style='float:left; margin-right:5px;' href="https://twitter.com/Ryanmdk" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a style='float:left; margin-right:5px;'target='_parent' href="http://www.reddit.com/submit" onclick="window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false"> <img src="http://www.reddit.com/static/spreddit7.gif" alt="submit to reddit" border="0" /> </a>
<script src="//platform.linkedin.com/in.js" type="text/javascript">
lang: en_US
</script>
<script type="IN/Share"></script>
"""
return HTML(code)
```