Notebook

Numerical Data Analysis¶

Dr Matteo Degiacomi Department of Physics, Durham University matteo.t.degiacomi@durham.ac.uk

This workshop is centered on using Python packages to load, manipulate, display and analyse numerical data. You will learn:

how to use Numpy library to manipulate multi-dimensional numerical data
- using inbuild functions to generate 1D and 2D arrays and check their properties;
- to load arrays to/from a file;
- to access parts of arrays by slicing 1D 2D arrays;
- to perform mathematical operations on arrays;
- to analyse arrays using inbuilt functions, e.g. finding minimum and maximum values, sorting the array, calculating mean and median] sandard deviation;
- to carry out Boolean operations, such as boolean tests and indexing;
how to plot your data with Matplotlib
- choosing styles of lines and colours;
- labels;
- subplots;
- saving plots for use in your report/thesis/paper.
how to analyse numerical data with Numpy as Scipy:
- fitting noisy data with linear and non-linear models;
- integrating differential equations;
- calculating the derivative of a signal and area under the curve

Acknowledgements: thanks to Dr Erastova, University of Edinburgh (www.erastova.xyz), for having shared teaching materials and ideas.

Table of Contents:¶

Part 1 - Arrays with Numpy ¶

Part 3 - Data Analysis with Numpy and Scipy ¶

Part 3.1 - fitting data with models ¶

3.1.1 - polynomial fitting
3.1.2 - non-linear curve fit

Part 3.2 - integrating differential equations ¶

Part 3.3 - derivative and integration ¶

Part 1 - Arrays with Numpy ¶

NumPy, which stands for Numerical Python, is a library consisting of multidimensional array objects and a collection of routines for processing those arrays.
Using NumPy, mathematical and logical operations on arrays can be performed. As it works on arrays, it is compatible with pandas.

Like above, we import the numpy library as an alias which is standard for python.

In [2]:

import numpy as np

When we use a numpy function it is always prefaced with np.

1.1 - Defining Arrays ¶

Numpy arrays are homogeneous in nature, i.e. they comprise one data type (integer, float, double, etc.) unlike lists that can be a mix of data types.

1.1.1 - Creating an array from a list ¶

Let's create an array of integers (single numbers like 1, 2, 3, 4, 5):

In [25]:

#Creating a 1D numpy array from a list
a = [1, 2, 3, 4, 5]
my_array = np.array(a)

#confirm we have created a numpy array
print(my_array, type(my_array))

[1 2 3 4 5] <class 'numpy.ndarray'>

If you have multiple dimensions, array can be defined from a "list of lists". For a 2D array for instance, the larger list represents lines, the inner lists represent all the elements of a single line (its columns).

In [26]:

#Creating a 2D numpy array (list of lists)
aa = [[0, 1, 2, 3], [10, 11, 12, 13], [20, 21, 22, 23]]
my_array2 = np.array(aa)
print(my_array2)

[[ 0  1  2  3]
 [10 11 12 13]
 [20 21 22 23]]

Arrays have specific properties. Let's examine those of our 2D array, as shown in the image below.

In [30]:

# get array properties
print("dimensions:", my_array2.ndim)
print("shape:", my_array2.shape)
print("size:", my_array2.size)
print("dtype:", my_array2.dtype)

dimensions: 2
shape: (3, 4)
size: 12
dtype: int32

1.1.2 - Generating an array ¶

Besides defining array by hand, you can also use methods build in numpy to generate them.

Using the methods zeros and ones you can initialise arrays having a desired shape, filled with 0s or 1s, respectively. This is useful if you are planning to fill an array with specific values at specific positions in a second step.

In [50]:

#creating an 1D array containing 10 zeros
z = np.zeros(10)
print("my array of zeros", z, "is of a type", z.dtype)

my array of zeros [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.] is of a type float64

To create an array that has a given sequence, instead of having to type:

sequence = np.array( [ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20] )

you can use functions such as:

np.arange(start, finish, stepsize)

TASK 1.1.2b : Try to go from 0 to 20 in steps of 2, using np.arrange

In [ ]:

# your solution here!

SOLUTION:

np.arange(0,21,2)

Q: What number did you have to go stop until to get 20 as a last number? Why?

Answer:

Since python counts from 0, if we declare <code>np.arange(0,20,2) </code> the count will stop at 18

TASK 1.1.2b : try to generate the same array using np.linspace(start, finish, no. of steps) How is it different?

In [ ]:

# your solution here!

SOLUTION:

np.linspace(0, 20, 11)

1.1.3 - Loading an array from/to a file ¶

We can also load 1D or 2D arrays from a plain text file.

In [52]:

#read the file in
data = np.loadtxt('data/slice_me.txt')
print(data.shape) #shape of the data

(6, 8)

Lots of options available for loading the file.

For example, if your file has a header you would like to skip (e.g. the first 5 rows of a file and also to ignore lines commented out with #, you can use:

numpy.loadtxt('array.txt', comments='#', skiprows=5)

To save the array into the file, use savetxt function enables saving arrays in a file:

numpy.savetxt('array.txt', data)

In [ ]:

#write an array out
np.savetxt('slice_me_copy.txt', data) #you saved a copy of an array

1.2 - Slicing an array ¶

slicing means cutting out from the array only data we are interested about.

1.2.1. - Slicing a 1D array ¶

We can select only regions of a 1-dimensional array as follows.

If no information is provided about start, end or step size, numpy uses default values, e.g.:

a[:2] means a[0:2]
a[1:] means a[0:len(a)]
a[::4] means a[0:len(a):4]

TASK 1.2 : generate a 1D array of 20 elements and fill it with random numbers, using package np.random.random then pick every 3rd value within first 10.

In [ ]:

# your solution here!

SOLUTION:

random = np.random.random(20)
picked = random[0:10:3]
    ```



</details>

1.2.2. - Slicing a 2D array ¶

Slicing a 2D array is similar to a 1D array. The regions of interest for every axis are separated by commas.

Additional dimensions are accessed in the same way. For a 3D array a, the first entry in the array will be at position a[0, 0, 0], for a 4D array it will be at a[0, 0, 0, 0], and so on.

Now, let's see what is within the file we have loaded. In the same way as above we can do so by slicing it.

In [53]:

print('column one', data[:, 0]) #select first column
print('row four', data[3, :]) #select 4th line
print('selected area', data[0:2, 3:7]) #select area
print('samples', data[1:5:2, 3:10:3]) #select samples in a given space

column one [  0.  10.  20.  30.  40.  50.]
row four [ 30.  31.  32.  33.  34.  35.  36.  37.]
selected area [[  3.   4.   5.   6.]
 [ 13.  14.  15.  16.]]
samples [[ 13.  16.]
 [ 33.  36.]]

1.3 - Mathematical operations with arrays ¶

1.3.1 - Mathematical operations ¶

All mathematical operations between arrays act element by element. Operations with scalars (= a single number) act on every element of the array.

You can think of the array as a vector, and perform math operations in a compact way (no need for loops or complex notations here):

1.2 - Mathematical operations on arrays ¶

In [45]:

print(my_array)
print(my_array+3)

[1 2 3 4 5]
[4 5 6 7 8]

You can also do mathematical operations between two arrays. Note they have to be same dimensions

In [47]:

a = np.array([1, 2, 3])
b = np.array([0, 1, 2])
print('multiplication a x b = ', a*b)
print('substraction a - b = ', a-b)
print('addition a + b = ', a+b)

multiplication a x b =  [0 2 6]
substraction a - b =  [1 1 1]
addition a + b =  [1 3 5]

TASK 1.3.1 : calculate the square of each value in my_array

Hint : ** is the _to the power of_ operator, i.e. $x^2$ will be x**2 in the code

In [ ]:

# your solution here!

SOLUTION:

my_array**2

1.3.2. Axis of operations ¶

All mathematical operations between arrays act element by element, we can also chose an axis of operation, i.e. a direction in which to carry out a specific operation.

TASK 1.3.2a : Calculate sum of all the elements in the file **data/slice_me.txt**, as well as vertical and horizontal sums

In [ ]:

# your solution here!

SOLUTION:

a = np.loadtxt('slice_me.txt')

print(np.sum(a))
print("vertical sum", np.sum(a, axis=0))
print("horizontal sum", np.sum(a, axis=1))

TASK 1.3.2b : In the folder you will find the file data/ms.txt. It contains 2 columns of numbers - m/z and intensity. Read the file, extract one every 10th line (i.e. subsample), save only the intensities column into a separate file.

In [ ]:

# your solution here!

SOLUTION:

# read file
data = np.loadtxt('ms.txt')

#extract every 10th line for the second column only
subdata = data [::10, 1] 

#check the shapes of the data sets to check you subsampled correctly
print (data.shape)
print (subdata.shape)

#save the file
np.savetxt('sub_intensities.txt', subdata)

Advanced TASK : can you do the same without numpy?

In [ ]:

# your solution here!

SOLUTION:

#read in a file 
fin = open("data/ms.txt", "r")

#create a file to write into
fout = open("sub_intensities2.txt", "w")

#start counter   
cnt=0 

#for each line in the file
for line in fin:

    #if the line number is divisable by 10
    if cnt %10 ==0:

        #split into columns, assigning the second one to intensity
        columns = line.split()
        intensity = columns[1]

        #write into the file
        fout.write(intensity)

        #return to next line 
        fout.write("\n")

    #increment the count
    cnt += 1

fin.close()
fout.close()

1.3.3. - Analysing and processing arrays ¶

You can now benefit from a collection of methods within numpy to process and analyse your data

numpy.min(a) find min value in the array
numpy.argmin(a) find position (AKA index) of the min value in the array
numpy.max(a) find max value in the array
numpy.argmax(a) find position (AKA index) of the max value in the array
numpy.unique(a) selects a subset of unique elements
numpy.sort(a) sorts the array max to min
numpy.mean(a) and numpy.std(a) compute mean and standard deviation of array values
numpy.median(a)
numpy.sum(a) sum the elements of an array

Let's try these on the data/ms.txt data file we have previously loaded.

In [55]:

#example of using methods for analysing MS data
data = np.loadtxt('data/ms.txt') # just in case, let's reload it again...

print('min', np.min(data)) # find smallest value
print('index of min', np.argmin(data)) # find index of the minimum
print('unique elements', np.unique(data)) # selects unique elements
print('mean', np.mean(data)) # compute mean of array values 
print('median', np.median(data)) # compute median of array values 
print('std', np.std(data)) # compute standard deviation of array values 
print('sum', np.sum(data)) # sum the elements of an array

min 0.0
index of min 1
unique elements [  0.00000000e+00   1.00000000e+00   2.00000000e+00 ...,   1.27830000e+04
   1.27840000e+04   1.27850000e+04]
mean 3757.48236359
median 1990.5
std 4201.62508877
sum 564802204.0

TASK 1.3.3 : Load the data file **data/ms.txt** provided and find the highest intensity peak, return its m/z value.

In [ ]:

# your solution here!

SOLUTION:

data = np.loadtxt("data/ms.txt")
maxval = np.max(data[:, 1])
idx = np.argmax(data[:,1])
mz=data[idx,0]

print ("peak", maxval, "at m/z", mz)

Part 1.4 - Boolean Indexing ¶

George Boole was a 19th century self-taught English mathematician, philosopher and logician. He is known for Boolean algebra, that is based on variables being True or False, denoted as 1 and 0 respectively.

The operations in Boolean algebra are and denoted as $\wedge$, or denoted as $\vee$ , and not denoted as $\neg$.

Venn diagram

Often there are a few ways to do the same thing. Here we show two types of expressions Bitwise and NumPy logical. At the end both do the same.

a | b ~= np.logical_or (A, B)
a & b ~= np.logical_and (A, B)
a ^ b ~= np.logical_xor (A, B)
a ~ b ~= np.logical_not (A, B)

1.4.1 - Boolean Tests ¶

Boolean tests on an array produce an array of booleans:

The function numpy.any(test) tests for existence of at least a True value in a boolean array

In [ ]:

#declare an array
a = np.array([32, 2, 65, 29, 7, 14, 57, 81, 27, 0, 56])

#declare tests
c = a>15
d = a<0

print ("condition c = a>15 ", np.any(c))
print ("condition d = a<0 ", np.any(d))

While the functions numpy.logical_or(test1, test2) and numpy.logical_and(test1, test2) apply element-wise boolean operations between two tests.

Can a value satisfy both conditon c and condition d?

In [ ]:

#Boolean indexing
print(np.logical_and(c, d))

indeed, but it can be either condition c or d...

In [ ]:

#Boolean indexing
print(np.logical_or(c, d))

1.4.2 - Boolean Indexing ¶

We can also use an array of booleans to index another array, i.e. only elements coresponding to True are extracted from the indexed array.

In [ ]:

#print the values that satisfy either c or d
a_slice = a[np.logical_or(c, d)]
print(a_slice)

TASK 1.4.2 : Load the file containing a mass spectrum (data/ms.txt), find m/z values in the region between m/z 6400 and 6600.

In [ ]:

# your solution here!

SOLUTION:

# your solution here!

data = np.loadtxt("data/ms.txt")

#test and slicing 
test = np.logical_and(data[:,0]>6400, data[:,0]<6600)
sliced_array = data[test, :]

maxval = np.max(sliced_array[:,1])
idx = np.argmax(sliced_array[:,1])
mz=sliced_array[idx,0]

print ("peak", maxval, "at m/z", mz)

Part 2 - Plotting data with Matplotlib ¶

matplotlib is a Python package helping us to plot neatly the data we are analysing.

Let's generate our first plot! As always, we load the module via import and call it plt. Then, to create a plot, we use command plt.plot()

In [15]:

#importing matplotlib
import matplotlib.pyplot as plt

# some values
x = [1, 2, 3, 4]
y = [1, 4, 9, 16]

#plot x against y
plt.plot(x, y)
plt.show()

2.1.1 - Customise the line style ¶

Now we can customise the line style with additional parameters, for example plt.plot(x, y, 'o') will change the line to set of points, as the third optional parameter o defines line style. It can also be x : - -- ...

We can (and should) always Google to learn more and see examples:

scrolling down the page you will see examples, then information on the input Parameters, where x, y is our data that is array-like or scalar, while other parameters are optional.

**kwargs set the properties of the plot and it is a big list. Ones I find using often are: -color defines the line’s colour and accepts several kinds of parameters:

r k b g y m c for colors defined by one letter
RGB codes (hexadecimal code) in a format #RRGGBB
named colours - Click here for the list
it can be can be abreviated to c = 'lightsalmon' meaning the colour is a chosen named colour.

When deciding what colours to use, remember that colours that seem clear to you may appear indistinguishable to others, or when printed. Matplotlib offers a wide selection of color palettes, including perceptually uniform sequential colour palettes, like viridis (now default). These are colour blind friendly and printer-safe. Find out more here.

linewidth changes line thickness, e.g. linewidth='2.0' changes it to 2 pixels
label name of the data used on the line
marker marker for the points
markersize marker size in pixel
alpha level of transparency from 0.0 to 1.0 o the marker (will not work for lines)

2.1.2 - Labeling the plot and the data ¶

It is always a good practice to label the plots.

Use the following commands to add the labels to your plot:

xlabel()
ylabel()
title()

Hint : To neatly write sub- and superscripts on th plots, like $x_2$ or $x^2$ in the example above, use the $LaTeX$ notationin the code - $x_2$ and $x^2$ respectively. For more examples see here.

TASK 2.1 : generate a 1D array, x, and plot $x^2$ using non-default line types and colours, label the plot

In [ ]:

#your plot here

SOLUTION:

#generating an array
x = np.linspace(-10, 10, 21) 
y = x**2

#plotting with X in a named colour, connected by a dotted line of a declared width
plt.plot(x, y, 'x:', color='tomato', linewidth='1.5') 

#adding labens
plt.xlabel('x')
plt.ylabel('y')
plt.title('my plot $y=x^2$')

plt.show()

Often you will find yourself needing to plot a few data sets on one graph. In this situation it is essential to label your data sets.

For example, let's generate some arrays and plot them in variouss colours and styles:

In [ ]:

#generate array x
x = np.arange(10, 17, 0.5) 

#generate array y of equivalent shape 
y = np.linspace(-11, 28, x.shape[0])

#let's check what we have made
print(x.shape, x)
print(y.shape, y)

In [ ]:

#let's do some math on these arrays
a = x + 3
b = x ** 2
c = x * y
d = np.sqrt((x*50)+y**4)

# to print numbers we got out, uncomment lines below
#print (a)
#print (b)
#print (c)
#print (d)

In [ ]:

#plot these 
plt.plot(x, a, '.--', c='k', label='$x+3$')
plt.plot(x, b, 'x:', c='coral', label='$x^2$')
plt.plot(x, d, ':',  c= '#7E317B', linewidth=3, label='$\sqrt{50x+y^4}$')

#set one label to use a few times
l='$xy$'
plt.plot(x, c, '-', c='teal', label=l)
plt.plot(x, c, marker='o', alpha=0.2, markersize=10, label=l)

#set plot labels, show title and leend
plt.xlabel('x')
plt.ylabel('y')
plt.title('functions')
plt.legend(loc="upper left")
plt.show()

2.1.3 - Creating sub-plots ¶

The above plot is quite overwhelming, instead, we can create a few sub-plots

2.1.4 - Saving your plots ¶

Also, don't forget to save your beautiful plot for latter use in lab report, paper, thesis.

this cab be done with plt.savefig('name.extention')

while you ned to name your plot and decide on the format there may be other parameters you want to use, e.g.

>python
plt.savefig('myfigure.png', dpi=300)

where dpi is dots per inch, i.e. resolution of your figure. Normally 300 will get you a good quality figure for a paper. There are other paranmeters you may want to use, like transparency, metadata, etc. read documentation here.

In [ ]:

# create a figure
plt.figure()

#create subplots
plt.subplot(2, 1, 1) #2 plots per line, 1 per column, plot 1            
plt.plot(x, a, '.--', c='k')
plt.xlabel('x')
plt.ylabel('y')
plt.title('function $a=x+3$')

plt.subplot(2, 1, 2) #2 plots per line, 1 per column, plot 2
plt.plot(x, b, 'x:', c='coral')
plt.xlabel('x')
plt.ylabel('y')
plt.title('function $b=x^2$')

#adjust the spacing between subplots, so they do not overlap
plt.subplots_adjust(hspace=1.)

#create another figure
palatinate='#7E317B'
plt.figure(2)
plt.plot(x, c, '.-', c='teal', label='$xy$' )
plt.plot(x, d, ':',  c= palatinate, linewidth='3', label='$\sqrt{50x+y^4}$')
plt.legend()

plt.xlabel('x')
plt.ylabel('y')
plt.title('functions')

#save image as .png (or .eps or .pdf or many other)
plt.savefig("myfigure.png") 

#plt.show should be called just ONCE
plt.show() 

There are many styles for plots that matplotlib can offer.

Today we just covered the very basics to get you started. Let's apply what we have learned for some data analysis!

TASK 2.2 : load the mass spectrum data **data/ms.txt** and plot it. Add axes labels, a legend, and make sure it looks nice! Advanced: can you find the highest peak using numpy?

In [21]:

# your solution here!

Fun fact: The mass spectrum you are plotting looks more complicated than you expected? This is because it is the measure of a polydisperse protein, studied by native mass spectrometry.

Part 3 - Data Analysis with Numpy and Scipy ¶

3.1. Fitting Models to Data ¶

Linear least squares is a convex problem. As long as the number of parameters in the model to be fitted is greater than the number of datapoints (i.e. the problem is overdetermined) an analytical solution exists. A problem is linear only if model's parameters are related to eachother via sums.

A typical example of linear least squares is that of polynomials fitting. The functions polyfit and polyval, provided within numpy, are dedicated to polynomial fitting.

In [112]:

def fct(x, a, b, c):
    return a*x**2 + b*x + c

# generate a noisy signal
coeffs= [1, 1, 1]
x = np.linspace(0, 10, 10)
y_clean = fct(x, *coeffs)
noise = np.random.normal(0, 20, len(x))
y = y_clean + noise

# polynomial fit of noisy signal
poly = np.poly1d(coeffs)
x2 = np.linspace(0, 10, 100)
y2 = poly(x2)

# plot noisy signal and polynomial fit
fig = plt.figure()
plt.plot(x, y, 'ko', label="data")
plt.plot(x2, y2, 'r-', label="model")
plt.xlabel("x")
plt.ylabel("y")

#if you want to display lines representing residuals (toggle replacing "True" with "False")
if True:
    y_predict=poly(x)
    for i, y_tmp in enumerate(y_predict):
        if i == 0:
            plt.plot([x[i], x[i]], [y[i], y_tmp], 'r--', alpha=0.5, label="residuals")
        else:        
            plt.plot([x[i], x[i]], [y[i], y_tmp], 'r--', alpha=0.5)

    
plt.legend(frameon=False, loc="upper left")
plt.savefig("simple_fit.png")

plt.show()

How do we know a fit is good? We can study the residuals, i.e. the difference between values in input data, and what the model predicts. Residuals should be Gaussian distributed, and feature no distinguishable trends. Let's see an example with some more complicated data.

In [113]:

# polynomial function
def fct(x, a, b, c, d):
    return d*x**3 + c*x**2 - b*x + a

# create some noisy data
x = np.linspace(-10, 10, 101)
y_clean = fct(x, 1, 2, 1, 0.5)
noise = np.random.normal(0, 100, len(x))
y = y_clean + noise

# calculate its third-order polynomial fit
coef, pcov = np.polyfit(x, y, 3, cov=True)
poly = np.poly1d(coef)

# residuals (difference between data and model)
residuals = y-poly(x)

# plot noisy data with fitted model
fig = plt.figure()
x2 = np.linspace(-14, 14, 501)
plt.plot(x, y, 'ko')
plt.plot(x2, poly(x2), 'r-')
plt.xlim([x[0], x[-1]])
plt.ylim([-600, 600])
plt.savefig("data_noise_fit.png")

#plot the model's residuals (left plot) and their histogram (right plot)
fig = plt.figure(dpi=80, figsize=(8, 4))
ax = fig.add_subplot(1, 2, 1)
ax.plot(x, residuals, 'o', color='k') # residuals
ax.hlines(0, -10, 10)
ax.set_ylabel("residual")
ax.set_xlabel("x")

ax = fig.add_subplot(1, 2, 2)
ax.hist(residuals, orientation="horizontal",  color = "skyblue", ec="black") # residuals
ax.set_yticks([])
ax.set_yticklabels([])
fig.subplots_adjust(wspace=0, hspace=0)
ax.set_xlabel("count (#)")

plt.show()
plt.savefig("data_residuals_all.png")

<matplotlib.figure.Figure at 0x1d68ee51390>

In [ ]:

# SOME EXTRA INFORMATION
# you can calculate the properties of your fit
# e.g. analysing its covariance matrix and calculating its coefficient of determination
# To see those analysis, substitute "False" with "True" below"

if False:
    print("polynome coefficients: ", coef)
    pcov[pcov < 0.000000001]= 0
    print("covariance matrix: ", pcov)
    
    # Calculate coefficient of determination (R2)
    ybar = np.sum(y)/len(y)
    ssreg = np.sum((poly(x)- ybar)**2)
    sstot = np.sum((y - ybar)**2)
    det = ssreg / sstot
    print("coeff. of determination: ", det)

    # an alternative way of calculating R2:
    ssres = np.sum((y - poly(x))**2)
    sstot = np.sum((y - ybar)**2)
    det = 1- ssres / sstot
    print("coeff. of determination (method 2): ", det)

What happens if we fit the data with higher order polynomials? We may overfit the data! There are many ways of determing whether a model is overfitting, and thus if we should use a simpler model instead. For instance, for polynomial fits one can observe:

the sum of squared residuals for incrementally higher order polynomers. A good model is usually located right after an inflection point.
the highest polynomial parameter. The best model is usually the one with the smallest parameters.

In [120]:

# fit polynomials of increasingly high order (from 1 to 20)
p = []
order = range(1, 21)
res = []
maxv = []
for i in order:
    vals, resid, rank, sv, rcond = np.polyfit(x, y, i, full=True)
    poly = np.poly1d(vals)
    p.append(poly)
    res.append(resid) 
    maxv.append(np.max(np.abs(vals)))

# plot noisy data with polynomials of increasing order (a subset of all fits previously generated)
fig = plt.figure()
plt.plot(x, y, 'ko')

x2 = np.linspace(-14, 14, 501)
toplot = [0, 1, 2, 5, 8, 18]
for i in toplot:
    plt.plot(x2, p[i](x2), '-', label="%s"%order[i])
    
plt.xlim([x2[0], x2[-1]])
plt.ylim([-1000, 1000])
plt.legend(loc="lower center", ncol=2, frameon=False)
plt.savefig("data_lots_of_poly.png")

#plot sum of squared residuals for every fit
fig = plt.figure()
plt.plot(order, res, 'ko')
plt.xlabel("polynome order")
plt.ylabel("sum of squared residuals")
plt.xlim([0.5, 20.5])
plt.savefig("data_residuals.png")

# plot highest parameter value for every polynomial fit
best = np.argmin(maxv)
fig = plt.figure()
plt.plot(order, maxv, 'ko')
plt.plot([order[best]], [maxv[best]], 'ro')
plt.xlabel("polynome order")
plt.ylabel("max polynome coeff.")
plt.xlim([0.5, 20.5])

plt.show()
plt.savefig("data_coefficients.png")

<matplotlib.figure.Figure at 0x1d68eb46dd8>

A sufficiently high polynomial can even fit pure noise!

TASK 3.1 : What happens if you fit data with a polynome of an order equal to the number of datapoints? Does the result make sense? Plot your fit and make it look nice using matplotlib!

In [ ]:

# this is some random data
N = 20 # number of datapoints
x = np.linspace(-10, 10, N)
y = np.random.normal(0, 10, len(x))

# here, write your code to fit a polynomial of the same order of data (N)
# ...

# here, plot the noisy data and the fitted model
# ...

# ADVANCED: calculate and plot residuals
# ...

SOLUTION:

```python x = np.linspace(-10, 10, 20) y = np.random.normal(0, 10, len(x))

vals_20, resid_20, _, _, _ = np.polyfit(x, y, 20, full=True) poly_20 = np.poly1d(vals_20)

plt.plot(x, y, 'ko')

x2 = np.linspace(-12, 12, 501) plt.plot(x2, poly_20(x2), 'r-')

plt.xlim([x2[0], x2[-1]]) plt.ylim([-30, 30])

plt.show() plt.savefig("fit_random_noise.png")


</details>

To fit non-linear models, we cannot use linear least squares. We will then use a general curve fitting algorithm provided by another Python package: scipy.

The curve_fit function available in scipy’s optimize module provides methods for fitting models to data (default is Levenberg-Marquardt method).

In the example below, we generate a noisy Gaussian signal, and fit with a Gaussian model. We then observe its residuals to make sure the fit is ok.

In [110]:

from scipy.optimize import curve_fit

# define a gaussian function
def gaussian(x, ampl, center, width):
    return ampl*np.exp(-(x-center)**2/width)

# create gaussian-distributed noisy data
x = np.linspace(-10, 10, 101)
y = gaussian(x, 2.3, 0.2, 1.5) + np.random.normal(0, 0.2, len(x)) 

# find parameters ideally fitting a gaussian to noisy data
init = [1, 0, 1]
vals, covar = curve_fit(gaussian, x, y, p0=init)

# generate fitted curve profile
yfit = gaussian(x, vals[0], vals[1], vals[2])

# plt noisy data and gaussian fit
plt.figure()
plt.plot(x, y, 'o', color='k') # original data
plt.plot(x, yfit, '-', color = 'r', linewidth=3) # fit
plt.xlabel("x")
plt.ylabel("y")

#plot the model's residuals (left plot) and their histogram (right plot)
residuals = y-yfit

fig = plt.figure(dpi=80, figsize=(8, 4))
ax = fig.add_subplot(1, 2, 1)
ax.plot(x, residuals, 'o', color='k') # residuals
ax.hlines(0, -10, 10)
ax.set_ylabel("residual")
ax.set_xlabel("x")

ax = fig.add_subplot(1, 2, 2)
ax.hist(residuals, orientation="horizontal",  color="skyblue", ec="black") # residuals
ax.set_yticks([])
ax.set_yticklabels([])
fig.subplots_adjust(wspace=0, hspace=0)
ax.set_xlabel("count (#)")

plt.savefig("gaussian_fit.png")
plt.show()

Residuals are are distributed normally around 0, and feature no sytematic trends. This indicates the fit is good.

3.2. Integrating Differential Equations ¶

Scipy offers the integrate module, that allows integrating ordinary differential equations. Here we will use the method odeint.

The prey-predator (a.k.a. Lotka-Volterra) equations are a famous example of coupled differential equations. They describe the trend in the population of two species. Predators diminish when there are not enough preys, prey diminish when there are lots of predators. Let's integrate them!

In [81]:

from scipy.integrate import odeint

def dA_dt(A, t=0): # function to integrate
    prey = (b-p*A[1])*A[0]
    predator = (r*A[0]-d)*A[1]
    return np.array([prey, predator])

t = np.linspace(0, 5, 1000) #time

# define constants and initial parameters
b = 5.0
d = 5.0
p = 1.0
r = 1.0
A_init = np.array([10, 10])

# integrate the differential equation
A_int = odeint(dA_dt, A_init, t)

# plot time versus populations
plt.plot(t, A_int[:, 0], "b", label="prey")
plt.plot(t, A_int[:, 1], "r", label="predator")
plt.legend()

plt.xlabel("time (a.u.)")
plt.ylabel("population (#)")

plt.show()
plt.savefig("prey_predator.png")

<matplotlib.figure.Figure at 0x1d68eba6390>

An example of irreversible process is that of radioactive decay, where an element converts into another through a series of a series of irreversible steps. In the example below, an element A decays into C via an intermediate B.

TASK 3.1 : Integrate the system of three coupled differential equations above, describing radioactive decay.

In [ ]:

# your solution here

SOLUTION:

```python def dt(vals, t=0): # function to integrate dA = -k1*vals[0] dB = k1*vals[0] - k2*vals[1] dC = k2*vals[1] return np.array([dA, dB, dC])

t1 = np.linspace(0, 15, 100) #time k1 = 1.5 k2 = 0.5 pop_init = np.array([1, 0, 0]) pop_int1 = odeint(dt, pop_init, t1)

plot the time evolution of the three species¶

plt.figure(dpi=80, figsize=(12, 4)) plt.subplot(1,1,1) plt.plot(t1, pop_int1[:, 0], "b", label="A") plt.plot(t1, pop_int1[:, 1], "r", label="B") plt.plot(t1, pop_int1[:, 2], "k", label="C") plt.legend(loc="upper right")

plt.show()

plt.savefig("reactions_all.png")```

3.3. Differentiating and Integrating ¶

Given a signal, common procedures to analyse this data include calculating its derivative and area under the curve. Calculating the derivative is common practive to find edges in images, or identifying maxima/minima. Here are two possible ways to calculate it.

Calculating the area under the curve (i.e., for infinitely small bins, the integral) can also be calculated in several ways:

In [108]:

def gaussian(x, mu, sig):
    return 1./(np.sqrt(2.*np.pi)*sig)*np.exp(-np.power((x - mu)/sig, 2.)/2)

step = 0.1
x = np.arange(-10, 11, step)
y = gaussian(x, 0, 2)

# differentiation
y_grad = np.gradient(y, step)
y_diff = np.diff(y)/step

#integration (calculate are under the curve)
print("Area under the curve (integration)):")
print("- sum: ", np.sum(y)*step)
print("- trapezoid:", np.trapz(y, dx=step))

# plot original data
plt.figure(figsize=(5, 8), dpi=80)
plt.subplot(2, 1, 1)
plt.plot(x, y, '-', color='k')
plt.xlim([-10, 10])

# plot differentiation
plt.subplot(2, 1, 2)
plt.hlines(0, -10, 10)
plt.xlim([-10, 10])
plt.plot(x, y_grad, '.-', color='tomato', label="grad")
plt.plot(x[:-1], y_diff, 'x-', color='navy', label="diff")
plt.legend(frameon=False)
plt.show()

Area under the curve (integration)):
- sum:  0.999999727164
- trapezoid: 0.999999686456

Thank you for joining! back to top

Numerical Data Analysis¶

Table of Contents:¶

Part 1 - Arrays with Numpy¶

Part 1.1 - Defining Arrays¶

Part 1.2 - Slicing an Array¶

Part 1.3 - Mathematical operations on Arrays¶

Part 1.4 - Boolean Indexing¶

Part 2 - Plotting data with Matplotlib¶

Part 3 - Data Analysis with Numpy and Scipy¶

Part 3.1 - fitting data with models¶

Part 3.2 - integrating differential equations¶

Part 3.3 - derivative and integration¶

Part 1 - Arrays with Numpy ¶

1.1 - Defining Arrays ¶

1.1.1 - Creating an array from a list ¶

1.1.2 - Generating an array ¶

1.1.3 - Loading an array from/to a file ¶

1.2 - Slicing an array ¶

1.2.1. - Slicing a 1D array ¶

1.2.2. - Slicing a 2D array ¶

1.3 - Mathematical operations with arrays ¶

1.3.1 - Mathematical operations ¶

1.2 - Mathematical operations on arrays ¶

1.3.2. Axis of operations ¶

1.3.3. - Analysing and processing arrays ¶

Part 1.4 - Boolean Indexing ¶

1.4.1 - Boolean Tests ¶

1.4.2 - Boolean Indexing ¶

Part 2 - Plotting data with Matplotlib ¶

2.1.1 - Customise the line style ¶

2.1.2 - Labeling the plot and the data ¶

2.1.3 - Creating sub-plots ¶

2.1.4 - Saving your plots ¶

Part 3 - Data Analysis with Numpy and Scipy ¶

3.1. Fitting Models to Data ¶

3.2. Integrating Differential Equations¶

plot the time evolution of the three species¶

3.3. Differentiating and Integrating ¶

Part 1 - Arrays with Numpy ¶

Part 1.1 - Defining Arrays ¶

Part 1.2 - Slicing an Array ¶

Part 1.3 - Mathematical operations on Arrays ¶

Part 1.4 - Boolean Indexing ¶

Part 2 - Plotting data with Matplotlib ¶

Part 3 - Data Analysis with Numpy and Scipy ¶

Part 3.1 - fitting data with models ¶

Part 3.2 - integrating differential equations ¶

Part 3.3 - derivative and integration ¶

3.2. Integrating Differential Equations ¶