%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt, pi, erf
import seaborn
seaborn.set_context("talk")
seaborn.set_style("whitegrid")
import scipy.stats
Assume for a moment that $X$ is a random variable with a PDF $P(X = x)$. Next, we have $y = f(x)$ and we want to know what $P(y)$ is. For example, let's say $x$ is mass measured on a balance, $y$ is the number of moles, and $f(x) = x \,/\, \mathrm{MW}$, $\mathrm{MW}=18$. If we have 5 measurements of $x$ that are [3.5g, 3.4g, 3.3g, 3.6g, 3.2g], What is the confidence interval for $y$?
The method is to propogate standard deviation or confidence interval widths using derivatives. We use the following approximation:
$$\frac{\Delta y}{\Delta x} \approx \left.\frac{dy}{dx}\right|_{x=\hat{x}}$$which is valid when $\Delta x$ is small relative to $\frac{dy}{dx}$.
This approximation can be rewritten as:
$$\Delta y \approx \left.\frac{dy}{dx}\right|_{x=\hat{x}}\Delta x$$so a distance in $x$ can be turned into a distance in $y$ using the derivative. To solve our example, we'll get a confidence interval in $x$ and turn it into a confidence interval in $y$. Notice that we evaluate the derivative at the sample-mean (the most likely point).
data = np.array([3.5, 3.4, 3.3, 3.6, 3.2])
sigma_x = sqrt(np.var(data, ddof=1))
mean = np.mean(data)
T_hi = scipy.stats.t.ppf(0.975, len(data) - 1)
ci_width = T_hi * sigma_x / sqrt(len(data))
print(ci_width)
print(mean)
0.196324316148 3.4
So the mean in $x$ is $3.4 \pm 0.18$. Then we find the expected $y$ using the $f(x)$ equation:
$$\hat{y} = \frac{\hat{x}}{18}$$and get the confidence interval width using our new distance conversion formula
$$w_y = \frac{1}{18} w_x$$print(mean / 18.)
print(ci_width / 18.)
0.188888888889 0.0109069064527
So value for the number of moles is $0.19 \pm 0.01$
We measure the volume of a sample with known mass. We obtain the following volumes: [1.2 ml, 0.9 ml, 1.3 ml, 1.0 ml, 1.3 ml]. The mass is 1g. What is the density and what is the uncertainty?
data = np.array([1.2, 0.9, 1.3, 1.0, 1.3])
sigma_x = sqrt(np.var(data, ddof=1))
mean = np.mean(data)
T_hi = scipy.stats.t.ppf(0.975, len(data) - 1)
v_width = T_hi * sigma_x / sqrt(len(data))
print(v_width)
print(mean)
0.22555946663 1.14
rho_width = -1.0 / mean**2 * v_width
print(1.0 / mean, rho_width)
0.877192982456 -0.173560685311
Notice that the negative sign has no influence on the confidence inerval. The answer is $\rho = 0.88 \pm 0.16$
The general formula for $N$-dimensions is:
$$\Delta f(x_1, \ldots, x_N) = \sqrt{\sum_i^N \left(\left.\frac{\partial f(x_1, \ldots, x_N)}{\partial x_i}\right|_{f(\hat{x})}\right)^2 \left(\Delta x_i\right)^2}$$So, for examlpe, in 2D it would be:
$$\Delta f(x,y) = \sqrt{\left(\left.\frac{\partial f(x,y)}{\partial x}\right|_{f(\hat{x}, \hat{y})}\right)^2 \left(\Delta x\right)^2 + \left(\left.\frac{\partial f(x,y)}{\partial y}\right|_{f(\hat{x}, \hat{y})}\right)^2 \left(\Delta y\right)^2 }$$The formula for density is:
$$\rho = \frac{m}{v}$$and I have the following measurements:
What is the 95% confidence interval for density?
The partial derviatves are:
$$\frac{\partial \rho}{\partial m} = \frac{1}{v}$$$$\frac{\partial \rho}{\partial v} = -\frac{m}{v^2}$$masses = np.array([3.5, 3.4, 3.3, 3.6, 3.2])
volumes = np.array([1.2, 0.9, 1.3, 1.0, 1.3])
m_ci_width = scipy.stats.t.ppf(0.975, len(masses) - 1) * np.sqrt(np.var(masses, ddof=1) / len(masses))
v_ci_width = scipy.stats.t.ppf(0.975, len(volumes) - 1) * np.sqrt(np.var(volumes, ddof=1) / len(volumes))
print(m_ci_width, v_ci_width)
0.196324316148 0.22555946663
Now, we check if the errors are small relative to the partial derivatives
m = np.mean(masses)
v = np.mean(volumes)
dm = (1.0 / v)
dv = (m / v**2)
print(dm, m_ci_width)
print(dv, v_ci_width)
0.877192982456 0.196324316148 2.6161895968 0.22555946663
Looks good enough!
drho = sqrt( dm**2 * m_ci_width**2 + dv**2 * v_ci_width**2)
print(m / v, drho)
2.98245614035 0.6147220918203058
So the density is $$2.98 \pm 0.57 \; \textrm{g / ml}$$
Sampling bias is when your samples are NOT independnet from one another, possibly due to some hidden variable.
A common sampling bias is the accidental conflation of experimental variables. For example, let's say you're studying a protein assay and want to know if it depends on temperature. So you prepare your experiments like this:
Day | Replicate Number | Temperature |
Monday | 0 | 25$^\circ{}$C |
Monday | 1 | 25$^\circ{}$C |
Monday | 2 | 25$^\circ{}$C |
Monday | 3 | 25$^\circ{}$C |
Tuesday | 0 | 30$^\circ{}$C |
Tuesday | 1 | 30$^\circ{}$C |
Tuesday | 2 | 30$^\circ{}$C |
Tuesday | 3 | 30$^\circ{}$C |
Wednesday | 0 | 35$^\circ{}$C |
Wednesday | 1 | 35$^\circ{}$C |
Wednesday | 2 | 35$^\circ{}$C |
Wednesday | 3 | 35$^\circ{}$C |
There are two problems here. The first is that you test one temperature per day. So, if on Tuesday you're tired and make mistakes, that will look like a temperature effect. Second, you increase in time and increase in temperature. So, if you get better over time at your experiment, that might appear to be a temperature effect.
Write down all your experiments and give them numbers. Then use numpy.random.choice
to get a random ordering of the experiments. All the problems above will disappear
Over the last 10 years, the cost of UR tuition has gone up every year. Over the last 10 years, rate of murders in the United States has decreased. Bam! Correlation. Probably not a causation.
Cigarette smoking decreases your risk of dementia. Why?
It can be used to disprove causaution. If your hypothesis is that listening to Ke$ha improves your critical thinking skills, seeing no improvement after listening to Ke$ha disproves your hypothesis.
Identify a rule that applies to the construction of number triples. For example, the rule could be "the three numbers must be divisible by 2 and are increasing". Here are two triples that are in the sequence: 1 2 3
and 3 5 8
. Propose a new triple, like an experiment, and I will tell you if it satisfies the rule. Your goal is to discover the rule I'm using to validate the triples.
I won't reveal the answer here, but the point is that you should use evidence to disprove and do not try to prove a hypothesis.
All plots that come from an experiment and not an equation should have error bars. No excuses. Why? Consider this plot:
plt.figure(figsize=(10, 8))
ind = np.arange(3)
width = 0.35
plt.bar([0,1, 2], [24, 27, 40], width=width)
plt.gca().set_xticks(ind + width / 2)
plt.gca().set_xticklabels( ('Li', 'Cs', 'Na'))
plt.show()
plt.figure(figsize=(10, 8))
plt.bar([0,1, 2], [24, 27, 40], width=width, yerr=15, color='r')
plt.gca().set_xticks(ind + width / 2)
plt.gca().set_xticklabels( ('Li', 'Cs', 'Na'))
plt.show()
However, there still could be a significant difference between the ions. If the error bars are standard deviations, then the standard error could be very small because we did 10000 experiments. That's why it's important to see the error bars and know what they are.
Matched data is a billion times better than unmatched. For example, having a person try an acne treatment on the left-side of their face and on the right-side is a much better experiment than having two people try two different treatments. Matched data means you have two "treatments" with the same individual. Another example is the state of Colorado legalizing marijuana. We can look at crime before and after the decriminilization to get very good results. Comparing the crime in Colorado vs Texas would introduce many other factors.
Whenever you design an experiment, you should try to use matched data. Matched data is not always possible. For example, you can't test a drug with a disease using matched data The person would be partially cured with the drug and it would be unethical to give them the disease a second time. In general though, matced data is much better than unmatched data.
These are all separate things. Much of the work we've done in class rely on this. If samples are independent, that means practically that they count equally in any algorithms we do. If they are normal, we can apply much of the error propogation and hypothesis testing techniques we know. If they are identically distributed, that means they are from the same distribution. That is not often discussed in the class, but if you somehow mix two datasets on accident, they have different population distributions. All these properties together are called IID (ind, identical) and normal IID.
Example problem:
You ask a group of two people to take an oral survey. You ask the first person what their stance on gun-control is. Then you ask the second person. Are these two samples independent?