In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Image
import pandas as pd
```

Learn how to deal with bivariate data (fitting lines, curves).

Apply line fitting to determine the age of the Universe. Cool.

Until now, we've worked with data that do not depend on other data - they are *univariate*. But there are many examples in the Earth Sciences where we are interested in the dependence of one set of data on another (*bivariate data*) For example, the distance from the ridge crest versus age gives you a spreading rate. The depth in a sediment core versus age gives you a sediment accumulation rate. The ratio of the radioactive form of carbon, $^{14}$C, to a stable form,$^{12}$C, is a function of the age of the material being dated. And we already saw that the difference in arrival times of the $P$ and $S$ seismic waves is related to distance from the source to the receiver. These examples rely on the use of bivariate statistics to get at the desired relationships.

We'll use the retreat velocity of galaxies as a function of their distance as our example data set. This is the basic data set that underlies what has come to be known as "Hubble's Law" (same Hubble as for the Hubble telescope). Hubble published his results in 1929 [Hubble, E. P. (1929) Proc. Natl. Acad. Sci., 15, 168–173.] At the time, it was unclear whether the universe was static, expanding, or collapsing. Hubble hypothesized that if the universe were expanding, then everything in it would be moving away from us. The greater the distance between the Earth and the galaxy, the faster it must be moving. So all he had to do was measure the distance and velocity of distant galaxies. Easy-peasy - right?

To measure velocity, Hubble made use of doppler shift. To understand how this works, recall that the pitch you hear as an ambulance approaches changes. During doppler shift, the ambulance's pitch changes from high (as it approaches) to low (as it recedes). The pitch changes because the relative frequency of the sound waves changes. The frequency increases as the ambulance approaches, leading to a higher pitch, and then decreases as it moves away, resulting in a lower pitch.

In [33]:

```
Image(filename='Figures/doppler.jpg',width=600)
```

Out[33]:

*Figure from https://www.physicsclassroom.com/class/waves/Lesson-3/The-Doppler-Effect*

The same principle applies to light, but rather than hear a change in frequency, we observe a shift in the wavelength (the color) emitted by the galaxy. But how can we measure this shift? Recall our class on solar elemental abundances- stars are made primarily of hydrogen. Hydrogen absorbs energy at discrete wavelengths. So if light passes through hydrogen, it will lose energy at those specific wavelengths. If a star or galaxy is moving away from us, these distinct absorption bands are shifted towards longer wavelengths - the red end of the visible spectrum. The faster the star or galaxy travels away from the observer, the greater the shift will be to the red:

In [3]:

```
Image(filename='Figures/dopp-redshift.jpg',width=600)
```

Out[3]:

*Figure from http://www.a-levelphysicstutor.com/wav-doppler.php*

So, Hubble measured the red shift of different galaxies and converted them to velocities.

He then estimated the distance to these objects, which is harder to do. There are many ways to do this and you can look them up, if you like. In any case, we can use this dataset

http://keyah.asu.edu/lessons/AgeOfUniverse/Exercises.html

which contains distances to different galaxies and their velocities.

Before we read in the data and plot it, let's learn about styling text in **matplotlib**. To create nice labels with superscripts, we can use latex styling, the same as in a markdown cell. For a superscript, first we need to encase the text in dollar signs and then use the ^ symbol to make the following text a superscript. If there is more than one number in the superscript, you must enclose what you want as the superscript in curly braces.
For example, to print $10^3$, we use:

\$ 10^3 \,$

and for for 'per second' (s$^{-1}$), I use:

s\$^{-1}\$.

For a subscript, replace the ^ symbol with an _ symbol.

Okay, now let's read in the data and take a look.

In [4]:

```
data=pd.read_csv('Datasets/redShiftDistance/d_v.csv',header=1)
print (data.head()) # take a peek
```

Now we can plot the data (using the fancy latex styling tricks we just learned).

In [36]:

```
plt.plot(data['d (10^3 Mpc)'],data['v (10^3km/s)'],'r*') # plot the points as red stars
plt.ylabel('v (10$^3$km s$^{-1}$)') # notice the latex style labels
plt.xlabel('Distance (10$^3$ Mpc)');
```

So, we have distance on the X axis in Megaparsecs and velocity on the Y axis in 10$^3$km/s.

To calculate the age of the universe, we can use Hubble's logic:

Here is "Hubble's Law": $v = H_o d$, where $v$ is velocity, $d$ is distance, and $H_o$ is "Hubble's constant".

This looks a lot like the equation for a line through the data ($y=mx + b$) where $m$ is the slope and $b$ is the y-intercept. In this case, the y-intercept should be 0 or nearly so, and $m$ is $H_o$.

We can calculate the coefficients $m$ and $b$ in **NumPy** fairly easily using a function called **np.polyfit( )**. (A line is just a first degree polynomial after all).

In [6]:

```
help(np.polyfit)
```

**np.polyfit( )** can be used to calculate best fit lines (setting the degree (**deg**) to 1), or higher order curves (setting degree to 2 or higher).

All we want is a best fit line for now, so let's see how well it does:

In [38]:

```
print (np.polyfit(data['d (10^3 Mpc)'],data['v (10^3km/s)'],1))
```

Ahah! So $H_o$, the slope of the best-fit line, is 70.3 (in some weird units).

Before going on, let's plot the best fit line on our graph.

We can assign the best fitting slope and y-intercept from **np.polyfit( )** to a variable (**m_b**) and feed that to another function **np.polyval( )** which will calculate new Y values using the model of a linear fit:

In [39]:

```
help(np.polyval)
```

In [8]:

```
# save the statistics in an array called m_b
m_b= np.polyfit(data['d (10^3 Mpc)'],data['v (10^3km/s)'],1)
print (m_b) # see if that worked
```

**m_b** seems to be an array of coefficients, where the first is the slope and the second is the y-intercept.

Let's feed **m_b** into **np.polyval( )**, along with our X array to get a new set of Y values, assuming a linear fit. Then we can plot the model data as a black line along with the original Y values.

In [9]:

```
modelYs=np.polyval(m_b,data['d (10^3 Mpc)'])
# now plot the data and the best-fit line:
plt.plot(data['d (10^3 Mpc)'],data['v (10^3km/s)'],'r*')
plt.plot(data['d (10^3 Mpc)'],modelYs,'k-') # plot as black line
plt.ylabel('v (10$^3$km s$^{-1}$)')
plt.xlabel('Distance (10$^3$ Mpc)');
```

It would be handy to also know how well our data fit the model (not everything is linear, after all).

To do that, we need a different function: **scipy.stats.linregress( )**.

In [10]:

```
from scipy.stats import stats # import the module
help(stats.linregress)
```

And use it, to get what is normally called the $R^2$ value, which when 1. represents perfect agreement.

In [17]:

```
print (stats.linregress(data['d (10^3 Mpc)'],data['v (10^3km/s)']))
# and save the statistics as:
slope,intercept, r_value, p_value, std_err = stats.linregress(data['d (10^3 Mpc)'],data['v (10^3km/s)'])
print ('\n')
print ('slope: %7.3f, intercept: %4.1f, R^2: %5.3f'%\
(slope, intercept, r_value**2))
```

Not a bad fit! The Universe is expanding.

Now let's get back to Hubble's Law and the age of the universe.

We had $v=H_o d$ as Hubble's law and we know that distance = velocity x time, or, $d=vt$. So, if we divide both sides by $v$ and we get:

1=$H_o$t.

Solving for $t$ (the age of the universe), we get

$t=1/H_o$ [in some weird units.]

In [18]:

```
print (1./slope)
t=1./slope
```

But the units are weird (not years, per Mpc s/km). To fix this, we need to know how many kilometers are in a megaparsec. As it happens, there are 3.09 x 10$^{19}$ km/Mpc. [I bet you didn't know that off-hand. You're welcome. ]
So, we can calculate the age of the universe in seconds (**Age_sec**) by converting the megaparsecs to km:

Age (s) = $t \frac{s \cdot Mpc}{km}$ x $3.09 x 10^{19} \frac {km}{Mpc}$

In [19]:

```
Age_sec=t*3.09e19
print (Age_sec)
```

That's a lot of seconds! We should convert seconds to years. Here's another fun fact: there are approximately $\pi$x 10$^7$ seconds in a year.

More precisely, there are 60(s/min) x 60(min/hr) x 24(hr/day) x 365.25 (days/yr)

In [20]:

```
s_yr=60*60*24*365.25
print ('%e'%(s_yr))
```

Ok. so not exactly $\pi \times 10^7$, but close....

In [24]:

```
Age_yrs=Age_sec/s_yr
print (Age_yrs)
```

And now in billions of years please:

In [25]:

```
print ('Age of the universe: %5.2f'%(Age_yrs*1e-9), 'Ga')
```

There are many other curve fitting methods in **scipy** and **NumPy**. I encourage you to look around when the need arises.

As an example of a curved relationship, there are new data sets available for the classic Hubble problem. I found one published by Betoule et al. in 2014 http://dx.doi.org/10.1051/0004-6361/201423413. Cosmologists plot things differently now. They use the parameters $z$ and $\mu$ which are related to the red shift velocity and distance. $z$ is the fractional shift in the spectral wavelength and $\mu$ has something to do with distance.

Here is a plot from the Betoule et al. paper:

In [48]:

```
Image(filename='Figures/betoule14.png',width=600)
```

Out[48]:

*[Figure modified from Betoule et al., 2014.] The different colors are different types of objects (galaxies, supernovae, etc.).*

Notice that they plotted the data on a log scale. (This hides some surprising things.)

To compare the old estimate of the age of the universe (~14 Ga) with that calculated from the new data set, we will need to convert $z$ and $\mu$ to distance and velocity.

According to http://hyperphysics.phy-astr.gsu.edu/hbase/Astro/hubble.html

velocity $v$ (as fraction of the speed of light, $c$) is given by

${v\over c}= \bigl({{(z+1)^2-1} \over {(z+1)^2+1}}\bigr)$

where $c=3 \times 10^8$m s$^{-1}$.

And according to the Betoule et al. (2014) paper, $\mu$ relates to distance in parsecs $d$ like this:

$\mu=5\log(d/10)$.

Let's read in the data (available from this website: http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/568/A22#sRM2.2), which are averages of the data shown in the figure above,and take a peek.

In [26]:

```
new_data=pd.read_csv('Datasets/redShiftDistance/mu_z.csv',header=1)
#new_data=pd.read_csv('Datasets/redShiftDistance/bigBang.csv')
print (new_data.columns)
```

Now we can plot it the same way as the cosmologists do, using $\mu$ and $\log z$:

In [27]:

```
plt.plot(np.log10(new_data.z),new_data.mu,'bo')
plt.xlabel('$\log_{10}$ (z) ')
plt.ylabel('$\mu$')
plt.xlim(-2,0);
```

Better yet, we can use the **plt.semilogx( )** function:

In [29]:

```
plt.semilogx(new_data.z,new_data.mu,'bo')
plt.xlabel('$z$')
plt.ylabel('$\mu$');
```

To compare the new data with the 'old' data, we must do the following:

- Transform $z$ to velocity
- Transform $\mu$ to distance using the equations provided.
- Truncate the new dataset which goes to much farther distances than the 'old' data set
- Write a nice
**lambda**function to recalculate the age of the universe

Let's begin by writing a lambda function:

In [30]:

```
# function returns age in Ga for Ho
age_from_Ho= lambda Ho : 1e-9*3.09e19/(Ho*np.pi*1e7)
```

Now we convert, truncate, and recalculate H$_o$:

In [31]:

```
# convert z to velocity in 10^3 km/s (or 10^6 m/s)
c=3e8 # speed of light in m/s
new_data['velocity']=1e-6*c* \
(((new_data.z+1.)**2-1.)/((new_data.z+1.)**2+1.)) # the formula for v from z (and c)
# convert mu to distance in 10^3 Mpc (a Gpc):
new_data['distance']=10.*(10.**((new_data['mu'])/5.))*1e-9 # convert mu to Gpc
# and filter for the closest objects
close_data=new_data[new_data.distance<0.7]
close_fit= np.polyfit(close_data['distance'],close_data['velocity'],1) # calculate the coefficients
close_modelYs=np.polyval(close_fit,close_data['distance']) # get the model values
age=age_from_Ho(close_fit[0]) # get a new age
print (age)
```

Wow - that is different!

Now we can plot it and see what happened!

In [35]:

```
# plot the old data (in red dots) with the new (in blue dots)
plt.plot(data['d (10^3 Mpc)'],data['v (10^3km/s)'],'r*',label='Old data') # as red stars
plt.plot(data['d (10^3 Mpc)'],modelYs,'r--',label='Old linear fit') # plot old fit
plt.plot(close_data.distance,close_data.velocity,'bd',label='New data') # as blue diamonds
plt.plot(close_data.distance,close_modelYs,'b-',label='New Linear fit') # plot old fit
plt.xlabel('Distance (10$^3$ Mpc)')
plt.ylabel('v (10$^3$km s$^{-1}$)')
plt.text(0.4,15,'Old Age: 13.9 Ga',color='red') # put on a note with the old age
plt.text(0.4,10,'New Age: %4.1f'%age+' Ga',color='blue') # put on a note with the new age
plt.legend(numpoints=1,loc=2); # and a legend
```

It looks like the velocity estimates in the new data are slower than the velocities estimated in the old data. The age of the universe just got a lot older! But don't get too excited yet!

Although the two data sets look superficially similar, if you read the paper, you will see there are many adjustments required to calculate the distances (Hubble was in fact WAY off, almost by a factor of 10!).

We filtered the data, so we only worked with 'near-by' object- less than .7 Mpc, but that only accounts for a small portion of the data set. Let's take a look at the entire dataset.

In [36]:

```
plt.plot(new_data.distance,new_data.velocity,'bd',label='new data')
plt.ylabel('v (10$^3$km s$^{-1}$)')
plt.xlabel('Distance (10$^3$ Mpc)');
```

That doesn't look linear at all! The cosmologists believe that the expansion of the universe is accelerating. Why? Ask your physics friends about dark energy. Or read Neal de Grasse Tyson's "Astrophysics for People in a Hurry". I loved it.

Given that the data are hardly linear, we should calculate the fit with a higher order polynomial instead of with a linear fit.

In [37]:

```
colors = ['m','m','r','g','k'] # make a list of colors for plotting
plt.plot(new_data.distance,new_data.velocity,'bd',label='new data')
# try with higher degree polynomials
for deg in range(1,5):
fit= np.polyfit(new_data['distance'],new_data['velocity'],deg) # second order
Y=np.polyval(fit,new_data['distance'])
plt.plot(new_data.distance,Y,colors[deg],label="%i degree"%deg)
plt.ylabel('v (10$^3$km s$^{-1}$)')
plt.xlabel('Distance (10$^3$ Mpc)')
plt.legend(numpoints=1,loc=2);
```

So, it appears that a 3rd degree polynomial does the trick. But this is an active area of research. Read this article in the New York times -
https://nyti.ms/2loeFoX and saved in Background/expandingUniverse.pdf. The apparent acceleration may be controlled by *dark energy* and values for H$_o$ hover between 67 (~14.7 Ga) and 73 (13.5 Ga).