In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Image
import pandas as pd
```

Learn how to filter data with Pandas

Write a program to calculate the great circle distances between two known points.

Learn how to generate formatted strings for output.

Returning to our original seismogram from Lecture 8 (which we cleverly saved):

In [2]:

```
Image(filename='Figures/seismogram.png') # let's pull this back up.
```

Out[2]:

And we can pick up where we left off by reading in the data which we already converted to minutes and saved in Lecture 8.

In [3]:

```
EQ=pd.read_csv('Datasets/minutes_velocity.csv')
EQ.head()
```

Out[3]:

Considering our seismic record, you know that the P wave arrives in the first few minutes of the record. To find the exact time, we can filter the data to look for the maximum (positive or negative) velocity between two time intervals, say 1 and 1.37 minutes, and take this as the P-wave arrival time.

To look at only the interval between two times, we first need to understand some more about Pandas data structures. Remember that the **DataFrames** are made up of columns of data (with the indices) and each Column is itself a **Series**.

We can access a particular series in two ways. One is to use the **Series** name as a key (like in Dictionaries).

In [4]:

```
print (EQ['Minutes'].head())
# I'm using head, because this is a really long series!
```

OR, we can use the **Series** name as an attribute of the **DataFrame**:

In [5]:

```
print (EQ.Minutes.head())
```

As you can see below, both of these are identical and are *instances* of a Pandas **Series** class:

In [6]:

```
print (type(EQ['Minutes']))
print (type(EQ.Minutes))
```

Let's take a closer look at one of these **Series**:

In [7]:

```
EQ.Minutes.head()
```

Out[7]:

The EQ.Minutes **Series** are floating point numbers but have indices (starting with 0) so they are like arrays and lists, but are not the same thing. No worries - you can turn a **Series** into an array with this command (I'm just printing the first 10 elements...):

In [8]:

```
EQ_array=EQ.Minutes.values
print (EQ_array[0:10])
```

Or we can make it a list:

In [9]:

```
EQ_list=EQ.Minutes.tolist()
print (EQ_list[0:10])
```

You can filter the EQ **DataFrame** by placing conditions (which evaluate to **True** or **False**) on one of the **Series** (Minutes or Velocity).

The simplest syntax of a basic Pandas filter is:

**DataFrame.Series[condition]**

The "P wave arrival" is by definition, the first wave recorded that exceeds "noise". So we need to calculate what the "noise" level is by getting the maximum and minimum values in the first part of the record, say the first minute.

To find all the records where the series **EQ.Minutes** is less than 1, we need to set the condition to:

EQ.Minutes$<$1

Let's just see what that does:

In [10]:

```
EQ.Minutes<1
```

Out[10]:

So the **EQ.Minutes$<1$** condition returned another Pandas **Series** with the **EQ.Minutes Series** being evaluated for each element in the **Series**.

If we then use the returned set of booleans as the "key" in the DataFrame **EQ**, Pandas will return all the indices in the **DataFrame** for which the condition is **True**:

In [11]:

```
EQ[EQ.Minutes<1]
```

Out[11]:

So let us assign this new **DataFrame** to **Noise**:

In [12]:

```
Noise=EQ[EQ.Minutes<1]
print (Noise.head())
print (Noise.tail())
```

Here we created a NEW DataFrame, **Noise**, that is a subset of the original **EQ** and contains only the first minute worth of data.

There are many methods for Panda **Series**. Two of these are **max( )** and **min( )**. To get the maximum and minimum values of the velocity **Series** in the first minute, we just apply **max( )** and **min( )** to the **Series**:

In [13]:

```
NoiseMax=Noise.Velocity.max()
print ('maximum value: ',NoiseMax)
NoiseMin=Noise.Velocity.min()
print ('minimum value: ',NoiseMin)
```

Now we can find the first velocity that exceeds (in the absolute sense) **NoiseMax** in the **EQ** DataFrame.

To do this, we can find the first peaks or troughs in the Velocity Series that are outside the bounds of **NoiseMax** and **NoiseMin** respectively. So, here are all the peaks and troughs in the data series that exceed the noise bounds:

In [14]:

```
Peaks=EQ[EQ.Velocity>NoiseMax]
print (Peaks.head())
Troughs=EQ[EQ.Velocity<NoiseMin]
print (Troughs.head())
```

We can get these both in one go by using the Pandas "or" trick, which has the general syntax of:

**DataFrame.Series[(condition_1) | (condition_2) ]**

where the '|' stands for 'or'.

While we are on the topic, other types of conditions, including 'and' and 'not'. 'and' is represented by a '&':

**DataFrame.Series[(condition_1) & (condition_2) ]**

which requires both conditions to be **True**, and "and not" is a '& ~':

**DataFrame.Series[(condition_1) & ~(condition_2) ]**

which requires condition_1 to be **True** and condition_2 to be **False**.

Anyway, we can get a single data frame with all the values exceeding bounds (both negative and positive) this way:

In [15]:

```
PeaksandTroughs=EQ[(EQ.Velocity>NoiseMax)|(EQ.Velocity<NoiseMin)]
print (PeaksandTroughs.head())
```

So the first Velocity that exceeds the noise is a peak that arrives at 1.025 minutes.

We can make a variable named **PwaveArrival** by converting the Series **PeaksandTroughs.Minutes** to an array and then selecting the first value in this array.

In [16]:

```
PwaveArrival=PeaksandTroughs.Minutes.values[0]
PwaveArrival
```

Out[16]:

In [4]:

```
Image(filename='Figures/seismogram.png') # let's pull this back up.
```

Out[4]:

After the P wave arrival, the earthquake rumbles along for a few minutes, so we need to re-characterize the noise floor, say between 5 and 10 minutes. Let's we repeat the noise exercise for the period between 5 and 10 minutes. In this case, we have TWO conditions: the first being Minutes$>$5 and the second being Minutes$<$10.

To impose TWO conditions, we can combine each (enclosed in parentheses) with '&':

In [18]:

```
Noise2=EQ[(EQ.Minutes >5) & (EQ.Minutes<10)]
print (Noise2.head()) # see, we only have data after 5 minutes
print (Noise2.tail()) # and before 10 minutes.
```

Following the same logic as before, we can find the bounds for the noise:

In [19]:

```
Noise2Max=Noise2.Velocity.max()
Noise2Min=Noise2.Velocity.min()
print (Noise2Max,Noise2Min)
```

To find the new S wave arrival, we find the peaks and troughs that exceed the maximum and minimum values in the time between 5 and 10 minutes. First we need to filter for data after the first 10 minutes.

In [20]:

```
PostP=EQ[EQ.Minutes>10]
```

Now we can use our handy 'or' operator '|' like this:

In [21]:

```
PeaksandTroughs2=PostP[(PostP.Velocity>Noise2Max)|(PostP.Velocity<Noise2Min)]
print(PeaksandTroughs2.head())
```

The S wave arrives as a trough at about 11.7 minutes into the earthquake.

Let's retrieve that value like we did before.

In [22]:

```
SwaveArrival=PeaksandTroughs2.Minutes.values[0]
SwaveArrival
```

Out[22]:

So the time delay between P and S wave arrivals can be calculated by the difference in their arrival times:

In [23]:

```
delayTime= SwaveArrival-PwaveArrival
delayTime
```

Out[23]:

From this we see that there was a 10.7 minute delay.

If you have already taken geophysics, you probably know that seismologists have figured out the velocity of various seismic waves as a function of depth in the Earth. From this model, there is a basic look up table that specifies the delay times as a function of distance (expressed as an angle) between the source and the receiver. This model is in the file **DeltaTimeData.csv** that we already looked at, munched on and saved in Lecture 8.

Let's take another look at this file:

In [5]:

```
DeltaTimeData=pd.read_csv('Datasets/DeltaTimeData.csv') # read this back in
DeltaTimeData.head()
```

Out[5]:

So, how many degrees away from the source is the reciever based on the DeltaTimeData data from the Reference Earth Model?

Here is the plot from our reference model (again).

In [6]:

```
Image(filename='Figures/TravelTime.png')
```

Out[6]:

We can bracket the angular distance this represents by filtering the **DeltaTimeData** DataFrame for those records within, say, .2 minutes of our delay time:

In [26]:

```
find_delta=DeltaTimeData[
(DeltaTimeData.SP_decimal_minutes>delayTime-.2) &
(DeltaTimeData.SP_decimal_minutes<delayTime+.2)]
find_delta # find_delta has all the records with SP_min less than 10
```

Out[26]:

So according the model, the earthquake occurred around 89 degrees away from the reciever.

We know the location of both the earthquake and the reciever, so we could calculate the actual separation. Let's do that now.

The earthquake occured in Chile and was recorded at Pinyon Flats (near us).

Aside: we will learn how to generate this map in the coming lectures!

In [7]:

```
Image(filename='Figures/greatCirc.png')
```

Out[7]:

First, we must calculate the angular distance (great circle distance in arc length - not kilometers) for the two points, B and C.

We'll need to use *spherical trigonometry* which is very useful in Earth Science because we live on a sphere.

In [8]:

```
Image(filename='Figures/strig.png',width=500,height=500)
```

Out[8]:

[Figure from Tauxe et al., 2010, Essentials of Paleomagnetism; https://earthref.org/MagIC/books/Tauxe/Essentials/WebBook3ap1.html#x20-211000A.3 ]

One of the laws of spherical trig is the Law of Cosines:

$\cos a = \cos b \cos c + \sin b \sin c \cos \alpha$.

In our case, $a$ is the great circle distance between points on the globe, B and C. Well, well, well, that could come in handy here. So to get $a$ we need to know $b, c$ and $\alpha$. $b$ and $c$ are the co-latitudes (90-latitude) of points B and C, respectively, and $\alpha$ is the difference in the two longitudes.

Now we have a formula, we can write a function to calculate the great circle distance from the latitudes and longitudes of the two points (lat_1,lon_1,lat_2,lon_2).

In [3]:

```
import numpy as np
```

In [4]:

```
def great_circle(lat_1,lon_1,lat_2,lon_2):
# first we have to convert the latitudes to colatitudes:
colat_1,colat_2=90.-lat_1,90.-lat_2
# and alpha is the difference betwee the two longitudes
alpha=lon_2-lon_1
# Then lets make life easy on us and convert degrees to radians
colat_1,colat_2,alpha= np.radians(colat_1),\
np.radians(colat_2),np.radians(alpha)# continued line from above
# from spherical trig we know that:
cosa=np.cos(colat_1)*np.cos(colat_2)+np.sin(colat_1)*np.sin(colat_2)*np.cos(alpha)
# solve for a
a=np.arccos(cosa)# take the arc cosine of cosa
# remember to convert back to degrees!
return np.degrees(a) # return the great circle distance in degrees.
```

In [7]:

```
great_circle(33.3,-115.7,-43.42,-73.95)
```

Out[7]:

In [30]:

```
PF_lat,PF_lon=33.3,-115.7
EQ_lat,EQ_lon=-43.42,-73.95
Delta=great_circle(PF_lat,PF_lon,EQ_lat,EQ_lon)
print ('Degrees separation between source and receiver: ',Delta)
```

Earlier, we guessed that it would be about 89 degrees. So this is pretty close given our quick and dirty guesstimate.

Now for a word about formatting strings. Notice how the output of the above print statement printed out all the decimal places. We can do better!
To show only the first decimal place we can use *string formatting*.

The structure of a formatting statement is:

**'%FMT'%(DATA)**,

where **FMT** is a 'format string' and **DATA** is the variable name whose value we want to format. Here is an example in which **FMT** is:

**3.1f**.

The first number (3) is the number of characters in the output. The second number (1) is the number of characters AFTER the decimal place. The 'f' means that **DATA** is a floating point variable.

Other format strings include: %s for a string, %i for an integer, %e for 'scientific notation'.

In [31]:

```
print ('no formatting: ',Delta) # no formatting
print ('formatted: ','%3.1f'%(Delta)) # with formatting
# or can use round(Delta,1)
print ('rounded: ',round(Delta,1))
```

Recall from Lecture 2:

In [32]:

```
number=1 # an integer
Number=1.0 # a floating point
NUMBER='1' # a string
```

we can use the formatted string trick on them like this:

In [33]:

```
print ('%i'%(number))
print ('%f'%(Number))
print ('%3.1f'%(Number))
print ('%s'%(NUMBER))
```

Or for a bigger number:

In [34]:

```
bignum=np.pi*1000
print (bignum)
print ('%6.5e'%(bignum))
print ('%3.2e'%(bignum))
```

This is the last time I'm going to remind you to do the Practice Problems. After this, it will be understood.