Data generation for polyglot data science example

In order to run the full Polyglot Data Science with IPython notebook, you will need to install Julia, and then the following (assuming a conda-based deployment that will automatically pull in R, otherwise you also need ot install R):

conda install jupyter cython pandas matplotlib seaborn
conda install rpy2
pip install julia fortran-magic
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

We generate synthetic data according to

$$ y(x) = a x + b x^2 + c \sin(x^2) + \cal{N}(0, \epsilon) $$
In [2]:
npts = 300
eps = 0.2  # noise
a, b, c  = 1, -0.2, 1 # model coefficients

np.random.seed(1234)
x = np.linspace(0, 2*np.pi, npts)
y = a*x + b*x**2 + c*np.sin(x**2) + np.random.normal(scale=eps, size=npts)
plt.plot(x, y, 'o');

Write it to a CSV file for convenient retrieval in a "typical" workflow, Pandas does the job nicely:

In [3]:
data = pd.DataFrame({'x':x, 'y':y})
data.head(3)
Out[3]:
x y
0 0.000000 0.094287
1 0.021014 -0.216828
2 0.042028 0.329982
In [4]:
data.to_csv('data.csv', index=False)
!head -3 data.csv
x,y
0.0,0.09428703274649862
0.02101399768287487,-0.21682787079387694

Sanity check

In [5]:
data2 = pd.read_csv('data.csv')
data2.head(3)
Out[5]:
x y
0 0.000000 0.094287
1 0.021014 -0.216828
2 0.042028 0.329982
In [6]:
(data2-data).abs().sum()
Out[6]:
x    5.894937e-14
y    1.431537e-14
dtype: float64