#!/usr/bin/env python
# coding: utf-8
# # Basic statistics
#
#
# This notebook is an element of the free [risk-engineering.org courseware](https://risk-engineering.org/). It can be distributed under the terms of the [Creative Commons Attribution-ShareAlike licence](https://creativecommons.org/licenses/by-sa/4.0/).
#
# Author: Eric Marsden .
#
# ---
#
# This notebook contains an introduction to use of Python and the NumPy library for basic statistical calculations.
# See the [associated course materials](https://risk-engineering.org/statistical-modelling/) for background information and to download this content as a Jupyter notebook.
#
# We start by importing the numpy library, which makes it possible to use functions and variables from the library, prefixed by `numpy`.
# In[1]:
import numpy
# We can use Python as simple interactive calculator:
# In[2]:
2 + 3 + 4
# Here we call the `sqrt` function from the numpy library.
# In[3]:
numpy.sqrt(2 + 2)
# Some useful constants are predefined.
# In[4]:
numpy.pi
# In[5]:
numpy.sin(numpy.pi)
# The notation `e-16` above means $10^{-16}$; the number above is very very small (it’s a numerical approximation to the mathematical answer of zero).
# We can generate a random number from a uniform distribution between 20 and 30. If you evaluate this several times (in most Jupyter interfaces, press `Shift-Enter` or press on the `Run` button in the toolbar above), it will generate a different random number each time.
# In[6]:
numpy.random.uniform(20, 30)
# In[7]:
numpy.random.uniform(20, 30)
# We can generate an **array** of random numbers by passing a third argument to the `numpy.random.uniform` function, saying how many random numbers we want. We store the array in a *variable* named `obs`.
# In[8]:
obs = numpy.random.uniform(20, 30, 10)
obs
# The builtin function `len` in Python tells us the length of an array or a list.
# In[9]:
len(obs)
# We can do arithmetic on arrays, adding them together or subtracting a constant from each element.
# In[10]:
obs + obs
# In[11]:
obs - 25
# We can apply a numpy function to all the elements of an array.
# In[12]:
numpy.sqrt(obs)
# The array has *methods*, a kind of function that acts on the array.
# In[13]:
obs.mean()
# In[14]:
obs.sum()
# In[15]:
obs.min()
# There are similar functions in the `numpy` library that take an array as argument:
# In[16]:
numpy.mean(obs)
# In[17]:
numpy.sum(obs)
# In[18]:
numpy.min(obs)
# ## Simple plotting
# The matplotlib library allows you to generate many types of plots and statistical graphs in a convenient way. The [online gallery](https://matplotlib.org/gallery.html) shows the variety of plots available, and the [documentation](https://matplotlib.org/contents.html) is also available online. We import the `pyplot` component of matplotlib and give it an alias `plt`.
# In[19]:
import matplotlib.pyplot as plt
plt.style.use("bmh") # this affects the style (colors etc.) of plots
get_ipython().run_line_magic('config', 'InlineBackend.figure_formats=["svg"]')
# In[20]:
X = numpy.random.uniform(20, 30, 10)
Y = numpy.random.uniform(50, 100, 10)
plt.scatter(X, Y);
# In[21]:
x = numpy.linspace(-2, 10, 100)
plt.plot(x, numpy.sin(x));
# We can add two vectors together, assuming that all their dimensions are identical. Our array $x$ has one dimension of size 100. We can add another random vector of size 100 to it, containing numbers drawn from a uniform probability distribution between -0.1 and 0.1 (these represent some random “noise” which is added to our sine curve).
# In[22]:
x = numpy.linspace(0, 10, 100)
obs = numpy.sin(x) + numpy.random.uniform(-0.1, 0.1, 100)
plt.plot(x, obs);
# The **central limit theorem** states that the sum of a number of independent random variables tends toward a normal distribution even if the original variables themselves are not normally distributed. We illustrate this result by examining the distribution of the sums of 1000 realizations of a uniformly distributed random variable, plotting the distribution as a histogram.
# In[23]:
N = 10_000
sim = numpy.zeros(N)
for i in range(N):
sim[i] = numpy.random.uniform(30, 40, 100).sum()
plt.hist(sim, bins=40, alpha=0.5, density=True);
# In[ ]: