- Be able to plot with error bars
- Be able to choose the appropiate case for OLS with measurement error
- Are aware of attenuation error and its effect on independent variable measurement error

In [1]:

```
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt, pi, erf
import seaborn
seaborn.set_context("notebook")
seaborn.set_style("whitegrid")
import scipy.stats
```

Error bars are little lines that go up and down or left and right in graphs to indicate the standard error of a measurement. They may also indicate a confidence interval or standard deviation, however that will usually be specified in the figure caption. The code to make error bar plots is shown below with a constant y-error bar.

In [2]:

```
x = np.array([1,2,4,4.5,7,10])
y = 3 * x + 2
#yerr=1 means constant standard error up and down of 1
#fmt is for lines/points and color. capthick is necessary to make the small dashes at the top and bottom
plt.errorbar(x,y, yerr=1, fmt='o', capthick=2)
plt.show()
```

Somtimes you have x-error and y-error. You can do that too:

In [3]:

```
plt.errorbar(x,y, yerr=1, xerr=0.5, fmt='o', capthick=2)
plt.show()
```

You may have a different error value at each point. Then you pass an array instead:

In [4]:

```
#THIS IS NOT PART OF PLOTTING ERROR BARS
#DO NOT COPY FOR HW/TEST
#THIS IS TO CREATE FAKE DATA FOR PLOTTING
#create some random numbers to use for my error bars
#squre them to make sure they're positive
yerror = 5*np.random.random(size=len(x))**2
#END
plt.errorbar(x,y, yerr=yerror, xerr=0.5, fmt='o', capthick=2)
plt.show()
```

If you do quantiling or some other technique that is non-parametric, you often can have error bars that are asymmetric. Then you need to pass in a 2xN array that has the distance up in the first row and distance down in the second row.

In [5]:

```
#THIS IS NOT PART OF PLOTTING ERROR BARS
#DO NOT COPY FOR HW/TEST
#THIS IS TO CREATE FAKE DATA FOR PLOTTING
#create some random numbers to use for my error bars
#squre them to make sure they're positive
yerror_up = 2.5*np.random.random(size=len(x))**2
yerror_down = 2.5*np.random.random(size=len(x))**2
#END
yerror = np.row_stack( (yerror_up, yerror_down) )
plt.errorbar(x,y, yerr=yerror, xerr=0.5, fmt='o', capthick=2)
plt.show()
```

We're going to return to regression again. This time with error in both our independent and dependent variables. Here is the list of cases we'll consider:

- OLS with constant $y$ uncertainty in 1D
- OLS with constant $x$ uncertainty in 1D
- OLS with constant $x,y$ uncertainty in 1D
- OLS with multiple $y$ values in N-dimensions

In this case we have some constant extra uncertainty in $y$, so that when we measure $y$ we don't get the actual $y$. We get some estimate of $y$. For example, if I'm weighing out a powder and the balance is only accuarate to $2$mg, I don't get the true mass but isntead some estimate with an uncertainty of $2$ mg. That means our measurements for $y$ do not contain the *true* value of y, but are instead an estimate of $y$. We're doing regression and our equation looks like this for the *true* y values:

where $\eta$ is the extra uncertainty in $y$. We have measured $(y + \eta)$ and $x$ for our regression.

We can rearrange the equation a little bit and get:

$$y = \alpha + \beta x + \epsilon_1$$where $\epsilon_1 = \epsilon - \eta$. The $_1$ stands for case 1. Notice that since $\eta$ and $\epsilon$ are normally distributed and centered at $0$, we don't actually get a smaller error term for $\epsilon_1$ than $\epsilon$. Since we've arraived at the same equation as the usual OLS regression with a slope and intercept, we can use the same equations. EXCEPT, our standard error of $\epsilon_1$ is slightly different. The standard error is:

$$ S^2_{\epsilon_1} = S^2_{\epsilon} + \sigma_{\eta}^2 = \frac{\sum_i (y_i - \hat{y}_i)^2}{N - 2} + \sigma_{\eta}^2 $$where $S^2_{\epsilon}$ was our previously used standard error term. The $-2$ term is for the reduction in degrees of freedom

and $\sigma_{\eta}^2$ is the squared error in our measurement. Notice "error" here generally means an instrument's stated precision.

$$\hat{\beta} = \frac{\sigma_{xy}}{\sigma_x^2}$$$$\hat{\alpha} = \frac{1}{N }\sum_i (y_i - \hat{\beta}x_i)$$$$ S^2_{\epsilon_1} =\frac{SSR}{N-2} + \sigma_{\eta}^2 $$$$SSR = \sum_i (y_i - \hat{\beta}x_i - \hat{\alpha})^2$$$$S^2_{\alpha} = S^2_{\epsilon_1} \left[ \frac{1}{N - 2} + \frac{\bar{x}^2}{\sum_i\left(x_i - \bar{x}\right)^2}\right]$$$$S^2_{\beta} = \frac{S^2_{\epsilon_1}}{\sum_i \left(x_i - \bar{x}\right)^2}$$

The Gibbs equation for a chemical reaction is:

$$\Delta G = \Delta H - T \Delta S$$where $\Delta G = -RT\ln Q$ and $Q$ the equilibrium constant. We can measure $\Delta G$ by measuring $Q$ and due to instrument precision, we know that the precision (generally 1 standard deviation) of $\Delta G$ is 2 kcal / mol. What is the change in entropy, given these measurements:

$T \textrm{[K]}$ :300, 312, 325, 345, 355, 400

$\Delta G \textrm{[kcal/mol]}$: 5.2, 2.9, 0.4, -4.2, -13

In [6]:

```
T = np.array([300., 312, 325, 345, 355, 400])
DG = np.array([5.2, 2.9, 0.4, -4.2,-5, -13])
plt.errorbar(T, DG, yerr=2, fmt='go', capthick=3)
plt.xlim(290, 420)
plt.xlabel('T [ Kelivn]')
plt.ylabel('$\Delta G$')
plt.show()
```

We can use our equations above to find the entropy, which is the negative of the slope:

In [7]:

```
cov_mat = np.cov(T, DG, ddof=1)
slope = cov_mat[0,1] / cov_mat[0,0]
DS = -slope
print(DS, 'kcal/mol*K')
```

Now if we want to give a confidence interval, we need to get the standard error first. Let's start by checking our fit and we'll need the residuals

In [8]:

```
intercept = np.mean(DG - T * slope)
print(intercept)
plt.errorbar(T, DG, yerr=2, fmt='go', capthick=3)
plt.plot(T, T * slope + intercept, '-')
plt.xlim(290, 420)
plt.xlabel('T [ Kelivn]')
plt.ylabel('$\Delta G$')
plt.show()
```

In [9]:

```
residuals = DG - T * slope - intercept
sig_e = np.sum(residuals**2) / (len(T) - 2)
#this is where we include the error in measurement
s2_e = sig_e + 2.0 ** 2
s2_slope = s2_e / (np.sum( (np.mean(T) - T)**2 ) )
```

Now we have the standard error in the slope, which is the same as entropy. Now we get our confidence interval

In [10]:

```
T = scipy.stats.t.ppf(0.975, len(T) - 2)
slope_ci = T * np.sqrt(s2_slope)
print(slope_ci)
```

Remember, the slope is the negative change in our entropy. So our final answer is

$$\Delta S = 0.18 \pm 0.06 \frac{\textrm{kcal}}{\textrm{mol}\cdot\textrm{K}}$$Now we have a measurement error in our independent variables. Our $x$ values are just our best esimiates; we don't know the true $x$ values. Our model equation is:

$$y = \alpha + \beta(x + \eta) + \epsilon$$where our measurements are $y$ and $(x + \eta)$. Once again we can rearrange our model equation into:

$$y = \alpha + \beta x + \epsilon_2$$where $\epsilon_2 = \beta \eta + \epsilon$. Everything is the same as before, except that our extra variance term depends on the slope. That changes our standard error equation to:

$$ S^2_{\epsilon_2} = \frac{SSR}{N - 2} + \hat{\beta}^2\sigma_{\eta}^2 $$and $\sigma_{\eta}^2$ is again the squared error in our measurement. Note that this is an approximate method.

Repeat the case 1 example, except with an error in temperature measurement of 5 K.

In [11]:

```
T = np.array([300., 312, 325, 345, 355, 400])
DG = np.array([5.2, 2.9, 0.4, -4.2,-5, -13])
plt.errorbar(T, DG, xerr=5, fmt='go', capthick=3)
plt.xlim(290, 420)
plt.xlabel('T [ Kelivn]')
plt.ylabel('$\Delta G$')
plt.show()
```

Our slope and intercept are unchanged, but our standard error is different.

In [12]:

```
#Now we use the independent variable measurement error
s2_e = sig_e + slope**2 * 5.0 ** 2
s2_slope = s2_e / (np.sum( (np.mean(T) - T)**2 ) )
T = scipy.stats.t.ppf(0.975, len(T) - 2)
slope_ci = T * np.sqrt(s2_slope)
print(slope_ci)
```

With the new measurement error, our new confidence interval for entropy is:

$$\Delta S = 0.18 \pm 0.03 \frac{\textrm{kcal}}{\textrm{mol}\cdot\textrm{K}}$$There is an interesting side effect of independent measurement error. Let's look at some plots showing increasing uncertainty in $x$, but always with a slope of 3

In [13]:

```
plt.figure(figsize=(24, 12))
rows = 2
cols = 3
N = 1000
for i in range(rows):
for j in range(cols):
index = i * cols + j + 1
fig = plt.subplot(rows, cols, index)
err = scipy.stats.norm.rvs(loc=0, scale = index - 1, size=N)
x = np.linspace(0,5, N)
y = 3 * (x + err)
plt.plot(x, y, 'o')
plt.xlim(-2, 7)
plt.title('$\sigma_\eta = {}$'.format(index))
plt.show()
```

Notice that as our error in our measurement gets larger, the slope becomes less and less clear. This is called **Attenuation Error**. As uncertainty in $x$ increases, our estimates for $\alpha$ and $\beta$ get smaller. We usually don't correct for this, because all our hypothesis tests become *more* conservative due to attenuation and thus we won't ever accidentally think there is a correlation when there isn't. But be aware that when the uncertatinty in $x$ becomes simiar in size to our range of data, we will underestimate the slope.

As you may have expected, the standard error in $\epsilon_3$ is just a combination of the previous two cases:

$$S^2_{\epsilon_3} = \frac{SSR}{N} + \hat{\beta}^2\sigma^2_{\eta_x} + \sigma^2_{\eta_y}$$Repeat the Case 1 example with an uncertainty in $\Delta G$ of 2 kcal/mol and $T$ of 5K

In [14]:

```
temperature = np.array([300., 312, 325, 345, 355, 400])
DG = np.array([5.2, 2.9, 0.4, -4.2,-5, -13])
plt.errorbar(temperature, DG, xerr=5, yerr=2, fmt='go', capthick=3)
plt.xlim(290, 420)
plt.xlabel('T [ Kelivn]')
plt.ylabel('$\Delta G$')
plt.show()
```

In [15]:

```
cov_mat = np.cov(temperature, DG, ddof=1)
slope = cov_mat[0,1] / cov_mat[0,0]
DS = -slope
print(DS, 'kcal/mol*K')
intercept = np.mean(DG - temperature * slope)
print(intercept)
plt.errorbar(temperature, DG, xerr=5, yerr=2, fmt='go', capthick=3)
plt.plot(temperature, temperature * slope + intercept, '-')
plt.xlim(290, 420)
plt.xlabel('T [ Kelivn]')
plt.ylabel('$\Delta G$')
plt.show()
```

In [16]:

```
residuals = DG - temperature * slope - intercept
sig_e = np.sum(residuals**2)
#The only new part
#-------------------------------------
#Now we use both the dependent and the independent variable measurement error
sig_total = sig_e + slope**2 * 5.0 ** 2 + 2.0**2
#-------------------------------------
s2_e = sig_total / (len(temperature) - 2)
s2_slope = s2_e / (np.sum( (np.mean(temperature) - temperature)**2 ) )
T = scipy.stats.t.ppf(0.975, len(temperature) - 2)
slope_ci = T * np.sqrt(s2_slope)
print(slope_ci)
```

With the both measurement errors, our confidence interval for entropy is:

$$\Delta S = 0.18 \pm 0.04 \frac{\textrm{kcal}}{\textrm{mol}\cdot\textrm{K}}$$which is a slightly larger confidence interval than for case 1

Sometimes you'll see people have multiple measurements for each $y$-value so that they can plot error bars. For example, let's say we have 3 $x$-values and we have 4 $y$-vaules at each $x$-value. That would give enough samples so that we can compute a standard error at each $y$-value:

$$S_y = \sqrt{\frac{\sigma_y^2}{N}}$$In [17]:

```
x = np.array([2, 4, 6])
y_1 = np.array([1.2, 1.5, 1.1, 0.9])
y_2 = np.array([2.6, 2.2, 2.1, 2.5])
y_3 = np.array([3.0, 2.9, 3.3, 5])
y = np.array([np.mean(y_1), np.mean(y_2), np.mean(y_3)])
#compute standard error
yerr = np.sqrt(np.array([np.var(y_1, ddof=1), np.var(y_2, ddof=1), np.var(y_3, ddof=1)])) / 4.0
plt.errorbar(x, y, yerr=yerr, fmt='o', capthick=3)
plt.xlim(0, 10)
plt.show()
```

How would we do regression on this? Well, it turns out you don't do anything special. You just treat the 12 $y$-values as equivalent and use the normal regression equations. I know this seems obvious, but it's a common question.