Notebook

In [48]:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Image

Lecture 15:¶

Learn some basic statisics - samples versus populations and empirical versus theorectical distributions.
Learn to calculate central tendencies, spreads.
Learn about significant figures and more about formatting output.
Learn some useful functions in NumPy and SciPy for simulating distributions and calculating statistics.

Basic statistical concepts¶

Much of this lecture has been cribbed from the open source paper by Olea (2008) available in its entirety here: https://pubs.usgs.gov/of/2008/1017/ofr2008-1017_rev.pdf which in turn was cribbed from Davis, J. (2002: Statistical and Data Analysis in Geology, 3rd Ed, Wiley, Hoboken).

While this is not a class in statistics, many Earth Scientists write programs to calculate statistics, evaluate significance, and estimate averages and their uncertainties.

So, what is (are?) statistics? Statistics is the way we analyze, interpret and model data. To do it properly we need to understand a few concepts:

accuracy versus precision: accuracy is how close your data are to the "truth" while precision is the reproducibility of your data.
population versus sample: the population is the set of all possible outcomes of a given measurement (if you had an infinite number of data points), while the sample is what you have - a finite number of data points.
probability: Probability is the measure of how likely it is for a particular event to occur. If something is impossible, it has a probability ($P$) of 0. If it is a certainty, it has a probability $P$ of 1.
Theoretical versus empirical distributions: Empirical distributions are measured data. Theoretical distributions are analytical probabililty functions (mathematical equations) that can be described with an equation. These can be applied to data, allowing interpretations about the likelihood of observing a given measurement, whether sets of data are "different" from theoretical or other empirical data sets and other powerful statistical tests.

Now we will go through each of these concepts in a bit more detail.

Accuracy versus precision¶

In [49]:

Image(filename='Figures/accuracy_precision.png',width=700)

Out[49]:

Figure from https://pubs.usgs.gov/of/2008/1017/ofr2008-1017_rev.pdf

Populations versus samples¶

The population is what you would have if you had all possible outcomes - but you never do. We can describe the distribution of a population by using equations which will predict, say, the fraction of measurements (the density) expected to fall within a given range ($x$ between 0 and 1), assuming a particular theoretical distribution. This curve is called the probability density function.

There are equations describing many different types of distributions and evaluating the equations gives us a theoretical distribution. In this lecture, we will look at a few common distributions (binomial, uniform, normal, and log-normal).

Samples are finite collections of observations which may belong to a given distribution. In this lecture, it will be handy to simulate 'measurements' by drawing 'observations' from a theoretical distribution, instead of making actual measurements. This is the Monte Carlo approach (after the gambling town).

Examples of theoretical versus empirical distributions:

Binomial distribution:¶

Theoretical¶

Perhaps the most straight forward distribution is the binomial distribution which describes the probability of a particular outcome when there are only two possibilities (yes or no, heads or tails, 1 or 0). For example, in a coin toss experiment (heads or tails), if we flip the coin $n$ times, what is the probability of getting $x$ 'heads'? We assume that the probability $p$ of a head for any given coin toss is 50%; put another way $p$ = 0.5.

The binomial distribution can be described by an equation:

$$P=f(x,p,n)= \frac{n!}{x!(n-x)!}p^x(1-p)^{n-x}, x=0,1,2,...,n.$$

We can look at this kind of distribution by evaluating the probability for getting $x$ 'heads' out of $n$ attempts using our old friend from Lecture 11, the lambda function. We'll code the equation as a lambda function, and calculate the probability $P$ of a particular outcome (e.g., $x$ heads in $n$ attempts). We can collect all the answers in a list and plot the probability ($P$) versus the number of heads ($x$) out of $n$ attempts.

Here is a lambda function called Binomial, which returns the probability of a given number of heads ($x$) out of $n$ attempts. Note that for a coin toss, $p$ is 0.5, but other yes/no questions can be investigated as well (e.g., chance of winning the lottery given purchase of $n$ tickets). Oh and we finally have a use for the np.factorial( ) function we ran accross in Lecture 11. Or you could our own reduce funtion. :)

In [50]:

Binomial=lambda x,n,p :(np.math.factorial(n)/(np.math.factorial(x)*np.math.factorial(n-x)))\
                   *(p**(x))*(1.-p)**(n-x)
    

We can use Binomial to look at the predicted likelihood of getting $x$ heads out of $n=12$ attempts (coin tosses) with a $p$ (probability) of 0.5.

In [51]:

 # calculate the probability  of x heads in n attempts with a probability of 0.5
n,p=12,0.5 # number of attempts in each trial, probability of getting a head
xs=range(n+1) # range of test values from 0,N
Probability=[] # probability of getting x heads out of n attempts
for x in xs: # step trough the test values
    Probability.append(Binomial(x,n,p))# collect the theoretical probability of getting x heads
plt.bar(xs,Probability,width=.5,color='cyan') # plot as bar plot
plt.plot(xs,Probability,'r-',linewidth=2) # plot as solid line
plt.xlabel('Number of heads out of $n$ attempts') # add labels
plt.ylabel('Probability') 
# place a note in upper left in axes coordinates with a fontsize of 14.
plt.text(1,.2, 'n = %i'%(n),  fontsize=14)
plt.title('Binomial Distribution');

Red line is the theoretical probability distribution function. Cyan bars are the same, but plotted as a bar graph.

I plotted the outcome as both a bar plot and a solid red line, to demonstrate two different ways of visualizing the results. Note that the red line is the probability density function.

What you learn from this is that the most probable outcome is 6 out of 12 heads (with a $P$ of ~23%), but other outcomes can very well occur.

Empirical¶

One great feature about computers is that we can simulate a data sample to compare to our theoretical predictions. We can use the module scipy.stats to generate examples of simulated data sets in a process called Monte Carlo simulation and get probability functions for different kinds of probability distributions. In this lecture we will discover a few more, starting with stats.binom( ) which makes an object tied to a binomial distribution.

To use all the functions in stats we must first import it.

In [52]:

from scipy import stats

To generate some data, you can either patiently do the experiment with a coin toss, or we can just use the stats.binom( ) function to simulate 'realistic' data.

stats.binom( ) requires 2 arguments, $n$ and $p$ where $n$ is the number of data points and $p$ is the probability of a given outcome (e.g., True or False).

In [53]:

n,p=12,0.5 # same as for the theoretical case
binom=stats.binom(n,p) # size = 1 by default

To get the number of heads flipped in a single trial of $n$ coin tosses, given the probablity $p=0.5$, we can use a method of our binomial distribution object binom.rvs( )

binom.rvs( ) takes an argument for how many times you want to repeat your experiment (number of trials). If this isn't supplied, it will default to 1.

Let's try it with $n=12$ and $p=0.5$. This little code block returns the number of heads out of $n$ attempts. You will get a different answer every time you run it. It also gives the probabilty of getting that result from our lambda function Binomial( ).

In [54]:

x=binom.rvs()
print (x, 'heads, with a likelihood of: ',Binomial(x,n,p))

7 heads, with a likelihood of:  0.193359375

Our binom object also handily has the probability function built into it, so we don't have to call Binomial( ) at all. Instead we can call binom.pmf(x).

In [55]:

print(x, 'heads, with a likelihood of: ', binom.pmf(x))

7 heads, with a likelihood of:  0.19335937499999992

As the number of times you repeat this 'experiment' approaches infinity, the distribution of outcomes will approach the theoretical distribution (i.e., you will get an average of 9 heads out of 12 attempts 5% of the time).

So let's compare the results simulated via Monte Carlo for some number of experiments ($Nmc$) with the theoretical distribution. To do this, we pretend that each of the say, 20, students in the class ($Nmc=20$) flips a coin $n=12$ times and reports the number of heads. We can collect the number of heads flipped by each student in a list called Simulated. [Instead of actually having students do this, we will use the stats.binom( ) simulator instead.]

In [56]:

n,p=12,0.5
Nmc=20 # number of simulated experiments each with n  attempts
binom=stats.binom(n,p)
Simulated=binom.rvs(Nmc)
plt.bar(xs,Probability,color='cyan', label='Theoretical') # theoretical curve as bar graph
plt.hist(Simulated,density=True,stacked=True, color='orange',histtype='step',linewidth=3, label='Simulated') # note the normed key word - 
    #  this normalizes the total to be unity
plt.xlabel('Number of heads out of $n$ attempts')
plt.ylabel('Fraction of simulated experiments')
plt.text(1,.3, 'n = %i'%(n),  fontsize=14)
plt.legend();

The orange line is a bar chart of the Monte Carlo results. The blue line is the theoretical probability distribution function from before.

Notice that every time you repeat this, the Monte Carlo results are a little different. And if you change $Nmc$ to be, say 10, you get more and more "weird" results. But if you set $Nmc$ to 5000 - your results would look consistently closer to the theoretical predictions.

It is worth pointing out here that it is possible to make sure that every time you repeat a "random draw" that you get the SAME answer (for reproducibility of your code, for example if you share it with others). To do this you can use the function np.random(seed) option where $seed$ is set in your program. Here is examples with and without the seed:

Also, notice the argument density in plt.hist( ). We set it to "True" in the above plot. This means that the area under the histogram will sum to 1. plt.hist( ) does this by dividing the count by the number of observations times the bin width. If stacked is also True, the sum of the histograms is normalized to 1. We've done both in the above plot.

In [57]:

import numpy.random as random
seed=2020
print ('random numbers generated with seed')
for i in range(10):
    random.seed(seed)
    print (binom.rvs(Nmc))
print ('random numbers generated without seed')
for i in range(10):
    print (binom.rvs(Nmc))

random numbers generated with seed
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
[10  8  6  5  5  5  5  5  8  4  4  7  7  5  5  7  5  6  4  5]
random numbers generated without seed
[9 4 6 9 6 7 3 6 4 6 4 8 5 5 8 7 6 7 4 8]
[6 8 7 8 3 4 7 6 4 6 6 5 6 5 5 3 6 9 7 7]
[6 7 7 6 6 3 8 8 7 8 5 4 7 2 5 7 9 8 4 4]
[ 6  4 10  4  8  2  8  8  6  7  9  6  6  5  9  5  9  6  4  6]
[5 9 5 4 5 8 8 7 8 3 3 8 7 7 6 7 8 6 6 5]
[ 7  6  5  4  7  9  5  7  5  7  5  3  5  2  4 11  8  8  5  4]
[9 7 5 8 7 3 6 6 6 7 6 3 5 7 7 5 5 8 4 7]
[ 7  6  3  7  5  9  6  8  5  5  6 10  8  4  6  6  4  6  6  6]
[ 5  4 10  2  3  7  7 11  6  4  4  8  7  5  4  5  9  8  5  6]
[4 4 5 4 5 9 6 5 7 7 4 7 7 6 7 6 5 5 6 7]

For the purposes of this lecture, we want different random numbers every time, so we will work without the seed.

There are many other distributions that we could play with. In the rest of the lecture, we will look at a few of the more common ones in Earth Science: uniform, normal and log normal distributions. We'll approach each in the same way as for the binomial distributions by starting with the theoretical distribution followed by a Monte Carlo simulation of empirical results.

Uniform distribution¶

Theoretical¶

A uniform distribution is pretty much what it sounds like. All outcomes within the bounds of $a,b$ are equally likely. Outside those bounds, the probability is zero. For example, when playing dice, what is the likelihood of getting a particular number of dots (1 in 6).

The analytical form for a uniform distribution is:

$$P=f(x,a,b)= \frac{1}{b-a}, x=a \rightarrow b$$

where $a$ and $b$ are bounds ($a \le x<b$). $P$ is the probability of getting a value of $x$. It is zero if $x$ is not between $a$ and $b$, and between $a$ and $b$ the probability is a constant. In fact the probability is $1/n$ where $n$ is $b-a$, the number of possible outcomes.

Say we have a die with 1 to 6 dots on each face. The probability of getting one dot is 1 in 6, but getting seven dots is zero.

We code up the lambda function as before:

In [58]:

Uniform=lambda a,b : (1./(b-a)) # function for calculating P from a uniform distribution.  

And calculate the probability density function as before:

In [59]:

# calculate the probability of returning a value x with a uniform distribution between a,b
xs=range(11) # range of possible number of dots from 1 to 10. 
a,b=1,7 # bounds for the uniform distribution
Probability=[] # container for the theoretical results. 
for x in xs: # step through test values
    if x not in range(a,b+1): # if x<a or x>b,  there is zero probability of rolling this number
        Probability.append(0) # save the probability
    else: # otherwise
        Probability.append(Uniform(a,b)) # get the probability from our little function
plt.bar(xs,Probability,color='cyan') # plot as a bar plot
plt.title('Uniform')
plt.xlabel('Number of dots')
plt.ylabel('Probability');

Empirical¶

And now we can look at the Monte Carlo simulation of an empirical distribution using stats.uniform( ). Note that this uses the lower bound and the width of the distribution. This means we have to use the arguments $a$ and $b-a$.

In [60]:

a,b=1,7 # keep the same bounds
Nmc=20 # number of "students" rolling the dice
uni=stats.uniform(a,b-a) #Uniform takes a and b-a
Simulated=uni.rvs(Nmc).astype(int) # get Nmc test values in one go.  :)
# the .astype(int) makes this an array of integers, as only integers are possible outcomes
plt.bar(xs,Probability, color='cyan',label='Theoretical')
# plot results as histogram normed to sum to unity and use histtype of 'step' to make it see-through
plt.hist(Simulated,density=True, stacked=True, histtype='step',color='orange',linewidth=3.,label='Simulated') 
plt.xlabel('Number of dots')
plt.ylabel('Fraction')
plt.legend();

There are two more closely related distributions which we need to discuss: the normal and the log-normal distributions. We'll do these with the same approach as before, looking first at the theoretical probability function and then as a Monte Carlo simulation.

Normal distributions¶

The so-called "normal" distribution (also known as a Gaussian distribution after the guy who thought it up) describes data, for example, measurement data, that have uncertainties associated with them - they are more or less precise. There is some 'true' answer but all the measurements have some slop. Imagine measuring the width of a sedimentary bed, or the length of a fossil thigh bone or the distance between two points. The measurement data will have some average (a.k.a. central tendency) and some degree of spread. For normal distributions, the average is the arithmetic mean $\mu$ and the spread is the standard deviation, $\sigma$.

Theoretical:¶

The analytical form for a normal distribution is:

$$P=f(x,\mu, \sigma)= \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}},-\infty < x < \infty$$

We can put this equation into a lambda function called Normal and work on it as before with one small modification. We can write the Normal lambda function such that it returns an array if $x$ is passed as an array. This is a trick called 'vectorization' and is numerically more efficient than calling the function many times and appending to a list of results, for example. Use this trick any time you can!

In [61]:

Normal=lambda x,mu,sigma : (1./(sigma*np.sqrt(2.**np.pi)))*np.e**(-(x-mu)**2/(2.*sigma**2))

And, we calculate the probability of observing a measurement $x$ from a normal distribution with $\mu$ and $\sigma$.

In [62]:

mu,sigma,incr=10,1,.2 # set the mean,  standard deviation and bin width
xs=np.arange(5,15,incr) # make an array of test values
Probability=Normal(xs,mu,sigma) # get probabilities 
plt.bar(xs,Probability,width=incr,color='cyan', edgecolor='k') # make the bar chart
plt.plot(xs,Probability,'r-',linewidth=2) # plot as a continuous probability distribution
plt.xlabel('x')
plt.ylabel('Probability')
plt.text(5,.3,'$\mu$ = '+str(mu)) # stick on some notes
plt.text(5,.27,'$\sigma$ ='+str(sigma))
plt.title('Normal');

That should look a lot like the so-called 'bell curve' - the curve used for grading for example - because that is exactly what it is.

Empirical¶

So let's compare the bell curve with some simulated data. For this, we can use stats.norm( ).

In [63]:

mu,sigma=10,1 # set the mean,  standard deviation
Nmc=20 # number of monte carlo simultions
normal=stats.norm(mu,sigma)
Simulated=normal.rvs(Nmc) # get Nmc  simulated data points from distribution
plt.bar(xs,Probability,width=incr, edgecolor='k',label='Theoretical',color='cyan') # make the bar chart
plt.hist(Simulated,density=True,stacked=True, histtype='step',color='orange',linewidth=3.,label='Simulated') # plot them
plt.legend();

We talked about the mean and standard deviation of a theoretical distribution. Now we need to learn how to calculate them for our simulated sample (in case you don't know already!). Note that for the population (as defined above), the average is $\mu$ but for our sample, it is $\bar x$:

$$ \bar{x} = \frac{\sum x_i}{N},$$

where $\sum x_i$ means the sum of all individual values of $x$ from the first to the last.

Similarly, the standard deviation of the population is $\sigma$, while for our sample, it is $s$:

$$s = \sqrt {\frac {1}{N}\sum_{i=1}^{N}(x_i-\bar x)^2 }$$

Note that the variance of the data is $s^2$ (and for the population it is $\sigma^2$).

With NumPy, we can calculate the mean and standard deviation with the methods .mean( ) and .std( ) like this:

In [64]:

print ('Mean of the simulated distribution = ',Simulated.mean()) 
print ('standard deviation of simulated distribution =',Simulated.std()) 

Mean of the simulated distribution =  10.096084002293932
standard deviation of simulated distribution = 0.881821816548661

Now we is useful to remember about formating and significant figures from a few lectures ago. For our purposes, only the first few of these trailing decimals are significant, so we should format our print string accordingly. These are floating point variables and say we want a total of up to, say, 6 characters with two after the decimal point. We would write this with a string formatting statement with the form '%XXX.XXf' %(variable).

In [65]:

print ('Mean of the simulated distribution = %6.2f'%(Simulated.mean() ))
print ('standard deviation of simulated distribution =','%6.2f'%(Simulated.std() ))

Mean of the simulated distribution =  10.10
standard deviation of simulated distribution =   0.88

We can also use the round( ) function like so:

In [66]:

print (Simulated.mean().round(2))
print (Simulated.std().round(2))

10.1
0.88

The difference here is that the formatting option returns a string, while round() returns a number with the desired precision (e.g., 2 decimal places in the above example).

Let's plot the mean and $s$ on our histogram:

In [67]:

plt.bar(xs,Probability,width=incr,color='cyan',label='Theoretical') # make the bar chart
#stderr=Simulated.std()/np.sqrt(len(Simulated))
plt.hist(Simulated,density=True,stacked=True,histtype='step',color='orange',linewidth=3.,label='Simulated') # plot them
plt.plot([Simulated.mean(),Simulated.mean()],[0,.45],'k-',linewidth=2,label='Mean')
plt.plot([Simulated.mean()-Simulated.std(),Simulated.mean()-Simulated.std()],[0,.45],
         'g--',linewidth=2,label='Standard deviation')
plt.plot([Simulated.mean()+Simulated.std(),Simulated.mean()+Simulated.std()],[0,.45],
         'g--',linewidth=2,label='_nolegend_') # notice how to suppress a legend entry!
plt.ylim(0,.6)
plt.legend(loc=2);

Notice two things:

The standard deviation includes ~67% of the data (not 95% - that would be 1.97$\sigma$, or 2-sigma, informally). The $\pm \sigma$ bounds are the dashed lines in the above plot.
The mean of our sample is generally not the same as the mean of the distribution ($\bar x \ne \mu$). In fact, the 95% confidence bounds for the MEAN is related to the 'standard error', which is:

$s_e = \frac {s}{\sqrt N}$.

The 95% confidence bounds for the mean is given by 1.97$s_e$. This means in practice that the mean will be more than 1.97$s_e$ away from true mean 5% of the time. We could test that statement with another little Monte Carlo type simulation, but I leave that to student curiosity.

Now let's look at one last distribution: the log-normal distribution.

Log normal¶

Many things in nature are not normally distributed. Many processes lead to distributions in which the log of the quantity is normally distributed instead of the quantity itself. For example, grain sizes in sedimentary sequences are often log normally distributed. These distributions are log-normal and behave differently than normal distributions.

Theoretical¶

The equation for the log-normal probability density function is this:

$$P=f(x,\mu, \sigma)= \frac{1}{x\sigma \sqrt{2\pi}} \exp {\bigl(-\frac{(\text{ln}(x)-\mu)^2}{2\sigma^2}}\bigr),0< x < \infty$$

Notice that $x$ is in the denominator so we can't evaluate the probability when $x$=0.

As before we make a lambda function to evaluate the probability density of log-normal distributions:

In [68]:

# Here is a log normal probability density function maker: 
LogNormal=lambda x,mu,sigma : (1./(x*sigma*np.sqrt(2.*np.pi)))*np.e**(-((np.log(x)-mu)**2)/(2.*sigma**2))

And we evaluate it for a range of possible values of $x$ for given $\mu, \sigma$.

In [69]:

# calculate the probability of returning a value x with a normal distribution with mu and sigma
mu,sigma,incr=0,.5,.1 # set the mean,  standard deviation and bin width
xs=np.arange(incr,4,incr) # make an array of test values
Probability=LogNormal(xs,mu,sigma) # get probabilities for the array of Xs
plt.bar(xs,Probability,width=incr,color='cyan', edgecolor='k') # make the bar chart
plt.plot(xs,Probability,'r-',linewidth=2) # plot as a continuous probability distribution
plt.title('Log Normal Distribution')
plt.xlabel('X')
plt.ylabel('Probability');

Empirical¶

To simulate data, we can use stats.lognorm( ), specifying mean and standard deviation as for the normal distribution.

Here we plot the theoretical distribution and the simulated data, along with its mean and the $\pm \sigma$ bounds calculated as we did in the last lecture for normal distributions.

In [70]:

mu,sigma=0,.5 # set the mean,  standard deviation
Nmc=100 # number of monte carlo simultions
lognorm=stats.lognorm(scale=np.exp(mu),s=sigma)
Simulated=lognorm.rvs(Nmc) # get Nmc  simulated data points from distribution
plt.bar(xs,Probability,width=incr,color='cyan',edgecolor='k',label='Theoretical') # make the bar chart
plt.hist(Simulated,density=True,stacked=True,histtype='step',color='orange',linewidth=3.,label='Simulated') # plot them
plt.plot([Simulated.mean(),Simulated.mean()],[0,1],'k-',linewidth=2,label='Mean')
plt.plot([Simulated.mean()-Simulated.std(),Simulated.mean()-Simulated.std()],[0,1],
        'g--',linewidth=2,label='lower bound')
plt.plot([Simulated.mean()+Simulated.std(),Simulated.mean()+Simulated.std()],[0,1],
        'r--',linewidth=2,label='upper bound')
plt.xlabel('x')
plt.ylabel('Frequency')
plt.legend();

Hey!¶

Why is the mean way off to the right? And those confidence bounds don't look right at all!

It turns out that the mean as defined here is only good for 'normal' distributions and shouldn't be used for other types.

First, let's look at the plot using the log normal of the data instead of the data themselves:

In [71]:

barx=np.log(Simulated).mean() # get the  mean of the logs of the  simulated data set
s=np.log(Simulated).std() # get the standard deviation of same
plt.hist(np.log(Simulated),density=True,stacked=True,histtype='step',color='orange',linewidth=3) # plot them 
plt.plot([barx,barx],\
         [0,1],'k-',linewidth=2);

It appears that a log normal distribution looks more "normal" when plotted on a log scale. But the mean doesn't behave the way we expect for normal distributions (offset to the long tail).

So what would be a sensible way of describing the central tendency (Expectation $E$) and spread (variance) for a log normal distribution?

These are expressed as follows:

$$E(x) =\exp\bigr( \mu +\frac{1}{2}\sigma^2 \bigl),$$

and

$$ \text{var}(x) = \bigr[\exp (\sigma^2)-1\bigr]*\exp(2\mu + \sigma^2)$$

where $\mu, \sigma$ are the mean and standard deviation of the logs of the population. For the sample, we will refer to them at $\bar x$ and $s$ respectively and for the expectation and variance we will use $m$ and $v$.

In scipy, we can use the .

The function stats.norm( ) returns 'moments' of data samples, the first two of which are the log normal mean ($m$) and the variance $v$.

So, we should do the following:

calculate the mean $\bar x$, and standard deviation $s$ of the LOG of the simulated dataset,
set the scale to be $e^{\bar x}$
calculate the first few moments, mean ($m$) and variance ($v$).

Now we can calculate the moments, $m$ and $v$ and compare them to the originals $\bar x$ and $s$ and the $\mu$ and $\sigma$ of the theoretical distribution used to generate the simulated data:

In [72]:

lognorm_simulated=stats.lognorm(s=s,scale=np.exp(barx))
m,v=lognorm_simulated.stats()
print ('mu  %5.2f'%(mu),' and barx %5.2f'%(barx))
print ('sigma %5.2f'%(sigma),' and s %5.2f'%(s))
print ('E %5.2f'%(np.exp(mu + 0.5*sigma**2)),' and m %5.2f'%(m))
print ('var %5.3f'%((np.e**(sigma**2)-1)*np.e**(2*mu+sigma**2)),' \
       and v  %5.3f'%(v))

mu   0.00  and barx -0.09
sigma  0.50  and s  0.54
E  1.13  and m  1.05
var 0.365         and v  0.372

There is also a NumPy function to calculate the median (np.median( )) and for the mode we can use scipy.stats.mode( ). But be careful with this because it finds the most frequent value, which may have a bunch of significant figures and so each value occurs only once. To get a reasonable estimate of the mode, we first have to round off the numbers in Simulated. We can use our friend np.round() for that.

Here's how to calculate each of these types of expected values:

In [73]:

mean=Simulated.mean() # you knew this already
mean_other_way =np.mean(Simulated) # this is a different way, but works the same.  
median=np.median(Simulated) # this is the way to do the median with NumPy
mode,count=stats.mode(np.round(Simulated,2)) # round to 2 significant digits, return mode and number in that mode
print ('mean: ',mean.round(3))
print ('mean other way: ',mean_other_way.round(3))
print ('median: ',median.round(3))
print ('mode: ',mode,' count: ',count)

mean:  1.053
mean other way:  1.053
median:  0.914
mode:  [0.7]  count:  [4]

Notice that mode and count are both lists.

So let's plot up the data and put the mean, median and the mode on it.

In [74]:

newmode=mode[0] # make this a single number, not a list with one element
plt.hist(Simulated,density=True,stacked=True,bins=20,histtype='step',color='orange',linewidth=3) # plot them 
plt.plot(xs,Probability,'c-',linewidth=2,label='Probability'); # plot the theoretical probability distribution function
plt.plot([mean,mean],[0,.75],'k-',linewidth=2,label='Mean') # put on green line, label for legend. 
plt.plot([median,median],[0,1],'g-',linewidth=2,label='Median') # black line
plt.plot([newmode,newmode],[0,1],'m-',linewidth=2,label='Mode') # magenta line
plt.legend()   
plt.ylim(0,1.1)
plt.xlabel('x'); # set the y axis limits.  
plt.ylabel('Fraction');

Assignment #5¶

1. Bar Charts and Subplots¶

Read the data file "Datasets/ClayMinerals/clay_meta_data_v1.csv" into a pandas DataFrame
These are data for the clay minerals in different types of soil from Ito & Wagai 2017 ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5667577/ ).
Use groupby( ) to group the datasets by 'Soil order' and 'Topsoil or Subsoil' (You can group by more than one column).
Use the mean( ) method to get the mean value for each type of clay for the topsoil and subsoil for each soil type.
Make a plot with 2 columns and 4 rows.
The .groupby( ) method has a method .agg( ) which takes a dictionary of the desired column headers (in this case the minerals) as keys and what action is required (in this case 'mean') as values. .groupby also can make plots specifying the kind of plot, the ax object, the rotation of the xlabel (use rot=0 in the following) and the title (use the soil type in the following).
Choose four soil types. For each soil type, plot a bar chart of the topsoil minerals in the left hand column and a bar chart of the subsoil minerals in the right hand column. For each chart, get rid of the minerals for which the abundance is 0.
Set the plot title to the type of soil (Topsoil or Subsoil)

2. Plotting Histograms.¶

In the datafile surfaceTemperature.csv, you will find global surface temperature data that was recorded by the MODIS satellite.
Peek at the data
Import the surface temperature data into a pandas dataframe /Datasets/SurfaceTemp/surfaceTemperature.txt
Sort the data
Plot the global surface temperature across the globe as a histogram. Plot the mean, median and mode as vertical lines and add a legend. Note that the mode( ) method works a little differently than the other to and you have to turn the result into a 'float'. Do these statistics tell us much for this dataset?
Generate a cumulative distribution function of temperature across the globe.

3. Convergence of the mean¶

Create a simulated normal distribution with a mean of 0 and a standard deviation of 1.
Using a for loop, create a program that draws a value from the distribution and appends it to a list. Do this until your list is 100 values long.
At each loop of the list, calculate the mean of the values that you have drawn so far. Append the mean to another list.
Plot the mean of each draw against number of points in the draw as a line plot. Add a red horizontal line at 0.
How many points does it take for the mean to reasonably approximate an answer of 0?

In [ ]: