In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Image

Lecture 12:

  • Tricks with pandas
  • Filtering
  • concatentating and merging dataframes

Locating and Editing Data

Today we are going to learn about filtering pandas DataFrames to tease out useful information. We will be looking at data on Holocene Eruptions from the Smithsonian Holocene Volcano Database (https://volcano.si.edu/list_volcano_holocene.cfm). We will see how to filter these data to pull out interesting information on Holocene Eruptions. Let's read in this data and look at its length.

In [3]:
EruptionData=pd.read_excel('Datasets/GVP_Eruption_Results.xlsx',header=1)
print('Number of Eruptions:',len(EruptionData))
EruptionData.head()
Number of Eruptions: 11134
Out[3]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
0 290390 Alaid 22281 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 8.0 NaN 21.0 NaN 50.861 155.565
1 352090 Sangay 22283 Confirmed Eruption NaN 2.0 NaN NaN 2018 NaN ... Historical Observations > 2018.0 NaN 12.0 NaN 7.0 NaN -2.005 -78.341
2 345020 Rincon de la Vieja 22282 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 12.0 NaN 3.0 NaN 10.830 -85.324
3 284096 Nishinoshima 22280 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 7.0 NaN 12.0 NaN 27.247 140.874
4 353050 Negra, Sierra 22279 Confirmed Eruption Summit crater and NNW flank 2.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 8.0 NaN 23.0 NaN -0.830 -91.170

5 rows × 24 columns

Wow, that's a lot of Eruptions! However, the DataFrame has a lot of information we really aren't interested in. For example, there are many eruptions in this data that are "unconfirmed".

In Lecture 9, we learned how to filter a DataFrame by putting what we wanted in a conditional statement enclosed in square brackets. Remembering from Lecture 4 that the conditional for "equal to" is "==", we can retrieve all the rows that contain 'Confirmed Eruption' in the column 'Eruption Category' like this this:

In [4]:
#notice the conditional '==' which means 'equals to' from Lecture 4
EruptionData[EruptionData['Eruption Category']=='Confirmed Eruption'].head()
Out[4]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
0 290390 Alaid 22281 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 8.0 NaN 21.0 NaN 50.861 155.565
1 352090 Sangay 22283 Confirmed Eruption NaN 2.0 NaN NaN 2018 NaN ... Historical Observations > 2018.0 NaN 12.0 NaN 7.0 NaN -2.005 -78.341
2 345020 Rincon de la Vieja 22282 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 12.0 NaN 3.0 NaN 10.830 -85.324
3 284096 Nishinoshima 22280 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 7.0 NaN 12.0 NaN 27.247 140.874
4 353050 Negra, Sierra 22279 Confirmed Eruption Summit crater and NNW flank 2.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 8.0 NaN 23.0 NaN -0.830 -91.170

5 rows × 24 columns

Pandas DataFrames also have a method called .loc that allows for filtering of DataFrames in a similar way to the familiar conditional above.

In [5]:
ConfirmedEruptions=EruptionData.loc[EruptionData['Eruption Category']=='Confirmed Eruption']
ConfirmedEruptions.head()
Out[5]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
0 290390 Alaid 22281 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 8.0 NaN 21.0 NaN 50.861 155.565
1 352090 Sangay 22283 Confirmed Eruption NaN 2.0 NaN NaN 2018 NaN ... Historical Observations > 2018.0 NaN 12.0 NaN 7.0 NaN -2.005 -78.341
2 345020 Rincon de la Vieja 22282 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 12.0 NaN 3.0 NaN 10.830 -85.324
3 284096 Nishinoshima 22280 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 7.0 NaN 12.0 NaN 27.247 140.874
4 353050 Negra, Sierra 22279 Confirmed Eruption Summit crater and NNW flank 2.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 8.0 NaN 23.0 NaN -0.830 -91.170

5 rows × 24 columns

This statement does exactly the same thing as the conditional. The syntax of a .loc statement might look trickier, but trust us, it will make your life easier as things get more complicated. It is is computationally faster and has more tricks up its sleeve as we shall see soon. :)

Now let's look at some big eruptions we might be interested in (and who wouldn't be?). Our dataset has a column called 'VEI' or 'Volcanic Explosivity Index' that depends on the amount of material erupted, and the height of the plume.

One of the most famous Volcanic eruptions is the 1980 Eruption of Mount St. Helens (Washington State). To find it, let's search for Holocene Eruptions of Mount St. Helens with a VEI>4.

In [6]:
Image(filename='Figures/StHelens.jpg')
Out[6]:

Image from: Global Volcanism Program, 2013. St. Helens (321050) in Volcanoes of the World, v. 4.7.5. Venzke, E (ed.). Smithsonian Institution. Downloaded 31 Dec 2018 (https://volcano.si.edu/volcano.cfm?vn=321050)

In [12]:
ConfirmedEruptions.loc[(ConfirmedEruptions['Volcano Name']=='St. Helens')&(ConfirmedEruptions['VEI']>4.0)]
Out[12]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
1567 321050 St. Helens 20557 Confirmed Eruption Summit and north flank 5.0 NaN NaN 1980 NaN ... Historical Observations NaN 1986.0 NaN 10.0 NaN 28.0 3.0 46.2 -122.18
6166 321050 St. Helens 20544 Confirmed Eruption N flank--Goat Rocks area 5.0 NaN NaN 1800 NaN ... Dendrochronology NaN NaN NaN NaN NaN NaN NaN 46.2 -122.18
7523 321050 St. Helens 20541 Confirmed Eruption NaN 5.0 NaN NaN 1482 NaN ... Dendrochronology NaN NaN NaN NaN NaN NaN NaN 46.2 -122.18
7525 321050 St. Helens 20540 Confirmed Eruption NaN 5.0 + NaN 1480 NaN ... Dendrochronology NaN NaN NaN NaN NaN NaN NaN 46.2 -122.18
9042 321050 St. Helens 20529 Confirmed Eruption NaN 5.0 NaN ? -530 NaN ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 46.2 -122.18
9545 321050 St. Helens 20521 Confirmed Eruption NaN 5.0 NaN NaN -1770 100.0 ... Tephrochronology NaN NaN NaN NaN NaN NaN NaN 46.2 -122.18
9580 321050 St. Helens 20520 Confirmed Eruption NaN 6.0 NaN ? -1860 NaN ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 46.2 -122.18
9752 321050 St. Helens 20518 Confirmed Eruption NaN 5.0 NaN ? -2340 NaN ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 46.2 -122.18

8 rows × 24 columns

As we can see, simple conditional statements like this enable us to filter large datasets for the small amount of information we're interested in.

Although the above statement would work equally well without the .loc method, we can add some whistles and bells. The use of the .loc syntax allows you search through a particular column (Series) by putting a comma after your conditional statement followed by another Series name. Say we wanted the 'Start Year' of this eruption. We could do this:

In [7]:
EruptionYears=ConfirmedEruptions.loc[ConfirmedEruptions['Volcano Name']=='St. Helens','Start Year']
EruptionYears.head()
Out[7]:
575     2004
1148    1990
1181    1989
1567    1980
5317    1857
Name: Start Year, dtype: int64

But wait, there's more! The .loc syntax also allows you to take a slice through the columns list to select a specific range of column headers:

In [8]:
ColumnSlice=ConfirmedEruptions.loc[ConfirmedEruptions['Volcano Name']=='St. Helens','VEI':'End Year']
ColumnSlice.head()
Out[8]:
VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty Start Month Start Day Modifier Start Day Start Day Uncertainty Evidence Method (dating) End Year Modifier End Year
575 2.0 NaN NaN 2004 NaN 10.0 NaN 1.0 NaN Historical Observations NaN 2008.0
1148 3.0 ? NaN 1990 NaN 11.0 NaN 5.0 NaN Historical Observations NaN 1991.0
1181 2.0 NaN NaN 1989 NaN 12.0 NaN 7.0 NaN Historical Observations NaN 1990.0
1567 5.0 NaN NaN 1980 NaN 3.0 NaN 27.0 NaN Historical Observations NaN 1986.0
5317 2.0 NaN NaN 1857 NaN 4.0 NaN 0.0 NaN Historical Observations NaN NaN

Something else .loc can do is to change the values inplace in DataFrames easily. Let's say we found a historical document that told us that the 1800 Eruption at Mount St. Helens ended in 1805. We want to update the information in the DataFrame and can do it this way:

In [11]:
ConfirmedEruptions.loc[(ConfirmedEruptions['Volcano Name']=='St. Helens')&\
                       (ConfirmedEruptions['Start Year']==1800),'End Year']=1805

# and let's take a look: 
ConfirmedEruptions[(ConfirmedEruptions['Volcano Name']=='St. Helens')&(ConfirmedEruptions['Start Year']==1800)]
Out[11]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
6166 321050 St. Helens 20544 Confirmed Eruption N flank--Goat Rocks area 5.0 NaN NaN 1800 NaN ... Dendrochronology NaN 1805.0 NaN NaN NaN NaN NaN 46.2 -122.18

1 rows × 24 columns

Never mind about the annoying error message - if you get one. It's just warning us that changing ConfirmedEruptions won't change EruptionData.

As we can see, the syntax for this can get complicated quickly, but we can retrieve lots of data using a few lines of code.

Sorting and Indexing

What if we wanted to sort this dataset so the most explosive eruptions come first? Pandas DataFrames have a method for this called sort_values. Normally, this will sort from lowest to highest (an "ascending" sort), but we can use the argument ascending=False to tell it to sort from highest to lowest.

In [12]:
BiggesttoSmallest=ConfirmedEruptions.sort_values(by='VEI',ascending=False)
BiggesttoSmallest.head()
Out[12]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
8107 305060 Changbaishan 19644 Confirmed Eruption NaN 7.0 ? NaN 942 4.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 41.980 128.080
9746 355210 Blanco, Cerro 20904 Confirmed Eruption NaN 7.0 NaN NaN -2300 160.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN -26.789 -67.765
9493 212040 Santorini 13879 Confirmed Eruption NaN 7.0 ? NaN -1610 14.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 36.404 25.396
10732 300023 Kurile Lake 18903 Confirmed Eruption NaN 7.0 NaN NaN -6440 25.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 51.450 157.120
6083 264040 Tambora 16231 Confirmed Eruption NaN 7.0 NaN NaN 1812 NaN ... Historical Observations NaN 1815.0 NaN 7.0 ? 15.0 NaN -8.250 118.000

5 rows × 24 columns

Looks like the biggest eruptions during the Holocene were VEI of 7.0. Now let's try to get the first 10 rows in this DataFrame. We can do this using .loc, right?

In [19]:
BiggesttoSmallest.loc[0:10]
Out[19]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
0 290390 Alaid 22281 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 8.0 NaN 21.0 NaN 50.861 155.565
4488 211060 Etna 13763 Confirmed Eruption Central Crater 1.0 NaN NaN 1893 NaN ... Historical Observations NaN 1898.0 NaN 6.0 NaN 0.0 NaN 37.748 14.999
1870 358060 Lautaro 12320 Confirmed Eruption NaN 1.0 NaN NaN 1972 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN -49.020 -73.550
1512 241050 Okataina 14526 Confirmed Eruption Waimangu (Raupo Pond crater) 1.0 NaN NaN 1981 NaN ... Historical Observations NaN 1981.0 NaN 5.0 NaN 16.0 15.0 -38.120 176.500
1518 312030 Pavlof 20144 Confirmed Eruption NaN 1.0 NaN NaN 1981 NaN ... Historical Observations NaN 1981.0 NaN 5.0 NaN 28.0 NaN 55.417 -161.894
1521 345040 Poas 11182 Confirmed Eruption NaN 1.0 NaN NaN 1981 NaN ... Historical Observations NaN 1981.0 NaN 5.0 NaN 16.0 15.0 10.200 -84.233
1523 268040 Gamkonora 16593 Confirmed Eruption NaN 1.0 NaN NaN 1981 NaN ... Historical Observations NaN 1981.0 NaN 7.0 NaN 25.0 NaN 1.380 127.530
1524 285040 Shikotsu 18633 Confirmed Eruption Tarumai 1.0 NaN NaN 1981 NaN ... Historical Observations NaN 1981.0 NaN 2.0 NaN 27.0 NaN 42.688 141.380
1526 344040 Telica 10930 Confirmed Eruption NaN 1.0 NaN NaN 1981 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN 12.606 -86.840
1530 300260 Klyuchevskoy 19461 Confirmed Eruption NaN 1.0 NaN NaN 1981 NaN ... Historical Observations NaN 1981.0 NaN 8.0 ? 4.0 NaN 56.056 160.642
1532 345040 Poas 11181 Confirmed Eruption NaN 1.0 NaN NaN 1980 NaN ... Historical Observations NaN 1980.0 NaN 12.0 NaN 26.0 NaN 10.200 -84.233
1537 241100 Ruapehu 14648 Confirmed Eruption NaN 1.0 NaN NaN 1980 NaN ... Historical Observations NaN 1980.0 NaN 11.0 NaN 3.0 NaN -39.280 175.570
1538 357091 Callaqui 11997 Confirmed Eruption NaN 1.0 NaN NaN 1980 NaN ... Historical Observations NaN 1980.0 NaN 10.0 NaN 16.0 15.0 -37.920 -71.450
1539 255060 Kavachi 15168 Confirmed Eruption NaN 1.0 NaN NaN 1980 NaN ... Historical Observations NaN 1981.0 NaN 2.0 NaN 25.0 NaN -8.991 157.979
1542 282110 Asosan 17340 Confirmed Eruption Naka-dake 1.0 NaN NaN 1980 NaN ... Historical Observations NaN 1980.0 NaN 9.0 NaN 24.0 NaN 32.884 131.104
1543 234070 Marion Island 14413 Confirmed Eruption E-W fissure from summit to W coast 1.0 NaN NaN 1980 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN -46.900 37.750
1544 345040 Poas 11180 Confirmed Eruption NaN 1.0 NaN NaN 1980 NaN ... Historical Observations NaN 1980.0 NaN 9.0 NaN 12.0 NaN 10.200 -84.233
1546 300270 Sheveluch 19580 Confirmed Eruption Center of 1964 crater 1.0 NaN NaN 1980 NaN ... Historical Observations NaN 1981.0 NaN 12.0 NaN 1.0 30.0 56.653 161.360
1554 312030 Pavlof 20142 Confirmed Eruption NaN 1.0 NaN NaN 1980 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN 55.417 -161.894
1556 257020 Gaua 15228 Confirmed Eruption Mt. Garat 1.0 ? NaN 1980 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN -14.270 167.500
1561 290270 Ekarma 18832 Confirmed Eruption NaN 1.0 NaN NaN 1980 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN 48.958 153.930
1563 311310 Makushin 19988 Confirmed Eruption SE side of summit 1.0 NaN NaN 1980 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN 53.891 -166.923
1566 261140 Marapi 15469 Confirmed Eruption NaN 1.0 NaN NaN 1980 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN -0.380 100.474
1515 262000 Krakatau 15615 Confirmed Eruption Anak Krakatau 1.0 NaN NaN 1981 NaN ... Historical Observations NaN 1981.0 NaN 10.0 NaN 20.0 NaN -6.102 105.423
1509 282110 Asosan 17341 Confirmed Eruption Naka-dake 1.0 NaN NaN 1981 NaN ... Historical Observations NaN 1981.0 NaN 6.0 NaN 15.0 NaN 32.884 131.104
1571 284120 Ioto 18415 Confirmed Eruption Kitanohara 1.0 NaN NaN 1980 NaN ... Historical Observations NaN 1980.0 NaN 3.0 NaN 13.0 NaN 24.751 141.289
1508 263200 Dieng Volcanic Complex 15787 Confirmed Eruption Sikidang 1.0 ? NaN 1981 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN -7.200 109.879
1427 283120 Kusatsu-Shiranesan 17844 Confirmed Eruption Yu-gama, Kara-gama 1.0 NaN NaN 1983 NaN ... Historical Observations NaN 1983.0 NaN 12.0 NaN 21.0 NaN 36.618 138.528
1429 282110 Asosan 17342 Confirmed Eruption Naka-dake 1.0 NaN NaN 1983 NaN ... Historical Observations NaN 1983.0 NaN 10.0 NaN 16.0 15.0 32.884 131.104
1435 261140 Marapi 15472 Confirmed Eruption Kepundan Tuo and Kepundan Verbeek 1.0 NaN NaN 1983 NaN ... Historical Observations NaN NaN NaN NaN NaN NaN NaN -0.380 100.474
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52 344020 San Cristobal 22269 Confirmed Eruption NaN 1.0 NaN NaN 2017 NaN ... Historical Observations NaN 2017.0 NaN 4.0 NaN 19.0 NaN 12.702 -87.004
56 343100 San Miguel 22192 Confirmed Eruption NaN 1.0 NaN NaN 2017 NaN ... Historical Observations NaN 2017.0 NaN 1.0 NaN 7.0 NaN 13.434 -88.269
61 261170 Kerinci 22172 Confirmed Eruption NaN 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 11.0 NaN 21.0 NaN -1.697 101.264
64 252120 Ulawun 22171 Confirmed Eruption NaN 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 11.0 NaN 18.0 NaN -5.050 151.330
67 390090 Saunders 22271 Confirmed Eruption Summit crater 1.0 NaN NaN 2016 NaN ... Historical Observations > 2018.0 NaN 12.0 NaN 2.0 NaN -57.800 -26.483
69 241040 White Island 22147 Confirmed Eruption 2012 lava dome 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 9.0 NaN 13.0 NaN -37.520 177.180
71 268060 Gamalama 22150 Confirmed Eruption NaN 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 8.0 NaN 4.0 NaN 0.800 127.330
74 343100 San Miguel 22154 Confirmed Eruption NaN 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 6.0 NaN 18.0 NaN 13.434 -88.269
76 345040 Poas 22156 Confirmed Eruption NaN 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 8.0 NaN 16.0 NaN 10.200 -84.233
78 241040 White Island 22121 Confirmed Eruption NaN 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 4.0 NaN 27.0 NaN -37.520 177.180
79 390080 Bristol Island 22133 Confirmed Eruption NaN 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 7.0 NaN 19.0 NaN -59.017 -26.533
84 390130 Zavodovski 22159 Confirmed Eruption NaN 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 5.0 NaN 16.0 15.0 -56.300 -27.570
89 343100 San Miguel 22117 Confirmed Eruption NaN 1.0 NaN NaN 2016 NaN ... Historical Observations NaN 2016.0 NaN 1.0 NaN 18.0 NaN 13.434 -88.269
94 261140 Marapi 22111 Confirmed Eruption NaN 1.0 NaN NaN 2015 NaN ... Historical Observations NaN 2015.0 NaN 11.0 NaN 14.0 NaN -0.380 100.474
98 344100 Masaya 22108 Confirmed Eruption NaN 1.0 NaN NaN 2015 NaN ... Historical Observations > 2018.0 NaN 12.0 NaN 15.0 NaN 11.984 -86.161
99 290390 Alaid 21096 Confirmed Eruption NaN 1.0 NaN NaN 2015 NaN ... Historical Observations NaN 2016.0 NaN 8.0 NaN 11.0 NaN 50.861 155.565
102 266100 Lokon-Empung 21093 Confirmed Eruption Tompaluan 1.0 NaN NaN 2015 NaN ... Historical Observations NaN 2015.0 NaN 9.0 NaN 28.0 1.0 1.358 124.792
106 343100 San Miguel 22233 Confirmed Eruption NaN 1.0 NaN NaN 2015 NaN ... Historical Observations NaN 2015.0 NaN 8.0 NaN 13.0 NaN 13.434 -88.269
109 264270 Sirung 21086 Confirmed Eruption NaN 1.0 NaN NaN 2015 NaN ... Historical Observations NaN 2015.0 NaN 7.0 NaN 8.0 NaN -8.508 124.130
51 262000 Krakatau 22188 Confirmed Eruption Summit crater 1.0 NaN NaN 2017 NaN ... Historical Observations NaN 2017.0 NaN 2.0 NaN 19.0 NaN -6.102 105.423
45 345040 Poas 22209 Confirmed Eruption NaN 1.0 NaN NaN 2017 NaN ... Historical Observations NaN 2017.0 NaN 11.0 NaN 6.0 NaN 10.200 -84.233
254 261170 Kerinci 22217 Confirmed Eruption NaN 1.0 NaN NaN 2011 NaN ... Historical Observations NaN 2011.0 NaN 12.0 NaN 15.0 NaN -1.697 101.264
43 263200 Dieng Volcanic Complex 22227 Confirmed Eruption Sileri Crater 1.0 NaN NaN 2017 NaN ... Historical Observations NaN 2017.0 NaN 7.0 NaN 2.0 NaN -7.200 109.879
2 345020 Rincon de la Vieja 22282 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 12.0 NaN 3.0 NaN 10.830 -85.324
3 284096 Nishinoshima 22280 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 7.0 NaN 12.0 NaN 27.247 140.874
5 344040 Telica 22277 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 8.0 NaN 15.0 NaN 12.606 -86.840
6 262000 Krakatau 22275 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations > 2018.0 NaN 12.0 NaN 20.0 NaN -6.102 105.423
7 353010 Fernandina 22278 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 6.0 NaN 21.0 NaN -0.370 -91.550
8 311120 Great Sitkin 22276 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 8.0 NaN 11.0 NaN 52.076 -176.130
10 261170 Kerinci 22274 Confirmed Eruption NaN 1.0 NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 10.0 NaN 22.0 NaN -1.697 101.264

493 rows × 24 columns

Oops! This didn't work as expected did it? Instead, we got the all the rows between the indices of 0 and 10 which are not in any particular order now. When we sorted by VEI, Pandas did not assign new indices and put the records in no particular order within a particular VEI value. This is a "feature" of sorting functions. So... to get what we really wanted, which was the first 10 records in the BiggesttoSmallest DataFrame, we can use the method .iloc instead of .loc.

In [20]:
BiggesttoSmallest.iloc[0:10]
Out[20]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
8107 305060 Changbaishan 19644 Confirmed Eruption NaN 7.0 ? NaN 942 4.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 41.980 128.080
9746 355210 Blanco, Cerro 20904 Confirmed Eruption NaN 7.0 NaN NaN -2300 160.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN -26.789 -67.765
9493 212040 Santorini 13879 Confirmed Eruption NaN 7.0 ? NaN -1610 14.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 36.404 25.396
10732 300023 Kurile Lake 18903 Confirmed Eruption NaN 7.0 NaN NaN -6440 25.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 51.450 157.120
6083 264040 Tambora 16231 Confirmed Eruption NaN 7.0 NaN NaN 1812 NaN ... Historical Observations NaN 1815.0 NaN 7.0 ? 15.0 NaN -8.250 118.000
7823 264030 Rinjani 20843 Confirmed Eruption Samalas 7.0 ? NaN 1257 NaN ... Ice Core NaN NaN NaN NaN NaN NaN NaN -8.420 116.470
10558 322160 Crater Lake 20610 Confirmed Eruption Mt. Mazama summit and flank vents 7.0 NaN NaN -5680 150.0 ... Ice Core NaN NaN NaN NaN NaN NaN NaN 42.930 -122.120
10242 282060 Kikai 16980 Confirmed Eruption Kikai caldera 7.0 NaN ? -4350 NaN ... Radiocarbon (uncorrected) NaN NaN NaN NaN NaN NaN NaN 30.793 130.305
8910 242030 Raoul Island 14686 Confirmed Eruption Denham caldera 6.0 NaN NaN -250 75.0 ... Radiocarbon (uncorrected) NaN NaN NaN NaN NaN NaN NaN -29.270 -177.920
7797 352060 Quilotoa 11597 Confirmed Eruption NaN 6.0 NaN ? 1280 NaN ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN -0.850 -78.900

10 rows × 24 columns

Much better. Now we can see that there were 8 VEI 7.0 eruptions during the Holocene (that we know of).

On the other hand, the problem we had with indexing can be dealt with by re-indexing our sorted DataFrame. To re-index a Pandas DataFrame, we use the .set_index( ) method.

This will set the index to a list of values from 0 to the length of the Dataframe.

In [13]:
# make a list of integers between zero up to (but not including) the length of the DataFrame
newIndexValues=list(range(len(BiggesttoSmallest))) 
# reset the indices to this list
BiggesttoSmallest=BiggesttoSmallest.set_index([newIndexValues])
BiggesttoSmallest.head()
Out[13]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
0 305060 Changbaishan 19644 Confirmed Eruption NaN 7.0 ? NaN 942 4.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 41.980 128.080
1 355210 Blanco, Cerro 20904 Confirmed Eruption NaN 7.0 NaN NaN -2300 160.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN -26.789 -67.765
2 212040 Santorini 13879 Confirmed Eruption NaN 7.0 ? NaN -1610 14.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 36.404 25.396
3 300023 Kurile Lake 18903 Confirmed Eruption NaN 7.0 NaN NaN -6440 25.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 51.450 157.120
4 264040 Tambora 16231 Confirmed Eruption NaN 7.0 NaN NaN 1812 NaN ... Historical Observations NaN 1815.0 NaN 7.0 ? 15.0 NaN -8.250 118.000

5 rows × 24 columns

Another thing about indices: We can set the indices to one of the other column names, for example the "Volcano Name".

In [22]:
BiggesttoSmallest=BiggesttoSmallest.set_index('Volcano Name')
BiggesttoSmallest.head()
Out[22]:
Volcano Number Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty Start Month ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
Volcano Name
Changbaishan 305060 19644 Confirmed Eruption NaN 7.0 ? NaN 942 4.0 0.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 41.980 128.080
Blanco, Cerro 355210 20904 Confirmed Eruption NaN 7.0 NaN NaN -2300 160.0 NaN ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN -26.789 -67.765
Santorini 212040 13879 Confirmed Eruption NaN 7.0 ? NaN -1610 14.0 0.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 36.404 25.396
Kurile Lake 300023 18903 Confirmed Eruption NaN 7.0 NaN NaN -6440 25.0 0.0 ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN 51.450 157.120
Tambora 264040 16231 Confirmed Eruption NaN 7.0 NaN NaN 1812 NaN 0.0 ... Historical Observations NaN 1815.0 NaN 7.0 ? 15.0 NaN -8.250 118.000

5 rows × 23 columns

Concatenation and Merging

The eruptions DataFrame we read in above only contains events that started before September 2018. Newer data are contained in the 'GVP_Eruption_Results_2' spreadsheet. Let's add this to our previous DataFrame. First let's read in the data:

In [23]:
NewEruptions=pd.read_excel('Datasets/GVP_Eruption_Results_2.xlsx',header=1)
NewEruptions.head()
Out[23]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
0 357040 Planchon-Peteroa 22297 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Historical Observations NaN 2018 NaN 12 NaN 16 NaN -35.223 -70.568
1 357090 Copahue 22299 Confirmed Eruption Agrio Crater NaN NaN NaN 2018 NaN ... Historical Observations NaN 2018 NaN 12 NaN 7 NaN -37.856 -71.183
2 267020 Karangetang 22294 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Historical Observations NaN 2018 NaN 11 NaN 25 NaN 2.781 125.407
3 282050 Kuchinoerabujima 22296 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Historical Observations > 2018 NaN 12 NaN 18 NaN 30.443 130.217
4 268060 Gamalama 22295 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Historical Observations NaN 2018 NaN 10 NaN 6 NaN 0.800 127.330

5 rows × 24 columns

We would like to combine (concatenate) the two datasets into a single DataFrame. Fortunately, Pandas has a .concat( ) method that allows us to do just that, provided both have the same columns. If we use the argument ignore_index=True, the new (or re-used in this case) DataFrame will be automatically re-ndexed for us.

In [24]:
ConfirmedEruptions=pd.concat([NewEruptions,ConfirmedEruptions],ignore_index=True)
ConfirmedEruptions.head()
Out[24]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
0 357040 Planchon-Peteroa 22297 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 12.0 NaN 16.0 NaN -35.223 -70.568
1 357090 Copahue 22299 Confirmed Eruption Agrio Crater NaN NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 12.0 NaN 7.0 NaN -37.856 -71.183
2 267020 Karangetang 22294 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 11.0 NaN 25.0 NaN 2.781 125.407
3 282050 Kuchinoerabujima 22296 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Historical Observations > 2018.0 NaN 12.0 NaN 18.0 NaN 30.443 130.217
4 268060 Gamalama 22295 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Historical Observations NaN 2018.0 NaN 10.0 NaN 6.0 NaN 0.800 127.330

5 rows × 24 columns

In [25]:
ConfirmedEruptions.tail()
Out[25]:
Volcano Number Volcano Name Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Evidence Method (dating) End Year Modifier End Year End Year Uncertainty End Month End Day Modifier End Day End Day Uncertainty Latitude Longitude
9867 241080 Tongariro 14555 Confirmed Eruption NaN NaN NaN ? -9850 NaN ... Radiocarbon (corrected) NaN NaN NaN NaN NaN NaN NaN -39.157 175.632
9868 327812 Red Hill 22193 Confirmed Eruption Cerro Pomo? NaN NaN NaN -9850 500.0 ... Surface Exposure NaN NaN NaN NaN NaN NaN NaN 34.250 -108.830
9869 213020 Nemrut Dagi 13908 Confirmed Eruption NaN NaN NaN NaN -9950 150.0 ... Varve Count NaN NaN NaN NaN NaN NaN NaN 38.654 42.229
9870 324020 Craters of the Moon 21101 Confirmed Eruption Sunset cone 0.0 NaN NaN -10060 NaN ... Radiocarbon (uncorrected) NaN NaN NaN NaN NaN NaN NaN 43.420 -113.500
9871 222161 Igwisi Hills 22141 Confirmed Eruption NE Volcano 1.0 NaN NaN -10450 4800.0 ... Surface Exposure NaN NaN NaN NaN NaN NaN NaN -4.889 31.933

5 rows × 24 columns

Looks like it worked.

Let's do some science

It would be nice to know more about these volcanoes. Fortunately the Smithsonian Holocene Volcano Database has a lot more stuff in it. These data are in a separate file called 'GVP_Volcano_List_Holocene.xlsx'. Let's read these in.

In [14]:
VolcanoData=pd.read_excel('Datasets/GVP_Volcano_List_Holocene.xlsx',header=1) 
VolcanoData.head()
Out[14]:
Volcano Number Volcano Name Country Primary Volcano Type Activity Evidence Last Known Eruption Region Subregion Latitude Longitude Elevation (m) Dominant Rock Type Tectonic Setting
0 210010 West Eifel Volcanic Field Germany Maar(s) Eruption Dated 8300 BCE Mediterranean and Western Asia Western Europe 50.170 6.85 600 Foidite Rift zone / Continental crust (>25 km)
1 210020 Chaine des Puys France Lava dome(s) Eruption Dated 4040 BCE Mediterranean and Western Asia Western Europe 45.775 2.97 1464 Basalt / Picro-Basalt Rift zone / Continental crust (>25 km)
2 210030 Olot Volcanic Field Spain Pyroclastic cone(s) Evidence Credible Unknown Mediterranean and Western Asia Western Europe 42.170 2.53 893 Trachybasalt / Tephrite Basanite Intraplate / Continental crust (>25 km)
3 210040 Calatrava Volcanic Field Spain Pyroclastic cone(s) Eruption Dated 3600 BCE Mediterranean and Western Asia Western Europe 38.870 -4.02 1117 Basalt / Picro-Basalt Intraplate / Continental crust (>25 km)
4 211001 Larderello Italy Explosion crater(s) Eruption Observed 1282 CE Mediterranean and Western Asia Italy 43.250 10.87 500 No Data (checked) Subduction zone / Continental crust (>25 km)

The 'Volcano Number' in this DataFrame is the same as in the previous one. We are now faced with a different problem than one .concat( ) can solve as the new dataset has different columns. To help us out in this case, Pandas has a DataFrame method called .merge( ) that allows us to merge two DataFrames. The DataFrame being operated on with the .merge( ) method is considered the 'left' DataFrame, and the one getting merged into it is the 'right' DataFrame. The .merge( ) method takes values from each depending on an argument 'how'. The different types of .merge( ) actions are displayed in the image below (Source: StackOverflow)

In [27]:
Image(filename='Figures/join-types.jpg')
Out[27]:

The ConfirmedEruptions DataFrame can have the same volcano number multiple times (for multiple eruptions) but the VolcanoData DataFrame only has a single entry for each volcano. As such we want a 'left' join that takes the rows from the ConfirmedEruptions DataFrame, and merges the information from the VolcanoData DataFrame onto the the ConfirmedEruptions DataFrame. So the VolcanoData DataFrame is the "right" DataFrame. We want to merge on the 'Volcano Number' in both left and right DataFrames.

In [28]:
MergedEruptions=ConfirmedEruptions.merge(VolcanoData,how='left',left_on='Volcano Number',right_on='Volcano Number')
MergedEruptions.head()
Out[28]:
Volcano Number Volcano Name_x Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Primary Volcano Type Activity Evidence Last Known Eruption Region Subregion Latitude_y Longitude_y Elevation (m) Dominant Rock Type Tectonic Setting
0 357040 Planchon-Peteroa 22297 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Stratovolcano(es) Eruption Observed 2018 CE South America Central Chile and Argentina -35.223 -70.568 3977.0 Andesite / Basaltic Andesite Subduction zone / Continental crust (>25 km)
1 357090 Copahue 22299 Confirmed Eruption Agrio Crater NaN NaN NaN 2018 NaN ... Stratovolcano Eruption Observed 2018 CE South America Central Chile and Argentina -37.856 -71.183 2953.0 Trachybasalt / Tephrite Basanite Subduction zone / Continental crust (>25 km)
2 267020 Karangetang 22294 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Stratovolcano Eruption Observed 2018 CE Indonesia Sangihe Islands 2.781 125.407 1797.0 Andesite / Basaltic Andesite Subduction zone / Oceanic crust (< 15 km)
3 282050 Kuchinoerabujima 22296 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Stratovolcano(es) Eruption Observed 2018 CE Japan, Taiwan, Marianas Ryukyu Islands and Kyushu 30.443 130.217 657.0 Andesite / Basaltic Andesite Subduction zone / Oceanic crust (< 15 km)
4 268060 Gamalama 22295 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Stratovolcano(es) Eruption Observed 2018 CE Indonesia Halmahera 0.800 127.330 1715.0 Andesite / Basaltic Andesite Subduction zone / Oceanic crust (< 15 km)

5 rows × 36 columns

Using .unique( ) to find a list of categories, and string operations

Now that we have more information, we can start classifying these eruptions by type. For example, what tectonic settings are represented in this dataset? Pandas has a method called .unique( ) that allows us to find all the unique values in a column.

In [29]:
list(MergedEruptions['Tectonic Setting'].unique())
Out[29]:
['Subduction zone / Continental crust (>25 km)',
 'Subduction zone / Oceanic crust (< 15 km)',
 'Subduction zone / Intermediate crust (15-25 km)',
 'Subduction zone / Crustal thickness unknown',
 'Rift zone / Oceanic crust (< 15 km)',
 'Intraplate / Oceanic crust (< 15 km)',
 'Rift zone / Continental crust (>25 km)',
 'Intraplate / Intermediate crust (15-25 km)',
 'Rift zone / Intermediate crust (15-25 km)',
 'Intraplate / Continental crust (>25 km)',
 nan]

This tells us some useful information, including that some of the values are not a number (or 'nan' in Pandish). We can get rid of these using the method .dropna( )

In [30]:
MergedEruptions.dropna(subset=['Tectonic Setting'],inplace=True)
# inplace=True does the method 'in place' so we don't have to assign it to a new DataFrame

Let's also get an approximate duration of each of these eruptions by subtracting the start date from the end date

In [31]:
MergedEruptions['StartTime']=MergedEruptions['Start Year']\
                +(MergedEruptions['Start Month']-1)/12+(MergedEruptions['Start Day']-1)/365.25
MergedEruptions['EndTime']=MergedEruptions['End Year']\
                +(MergedEruptions['End Month']-1)/12+(MergedEruptions['End Day']-1)/365.25
MergedEruptions['Duration']=MergedEruptions['EndTime']-MergedEruptions['StartTime']

What if we wanted to know about all the eruptions that occurred in a subduction zone setting? This is difficult as these data are stored as strings. Fortunately Pandas has a .str( ) method for columns which itself has several methods, one of which is called str.contains(SUBSTRING). The str.contains(SUBSTRING) method finds cells in the column that contain SUBSTRING, for example, 'Subduction'.

For more information on Pandas .str( ) methods, see https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html

So now we fish out all the records that have a 'Tectonic Setting' string that contains 'Subduction':

In [32]:
Subduction=MergedEruptions.loc[MergedEruptions['Tectonic Setting'].str.contains('Subduction')]
Subduction.head()
Out[32]:
Volcano Number Volcano Name_x Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Region Subregion Latitude_y Longitude_y Elevation (m) Dominant Rock Type Tectonic Setting StartTime EndTime Duration
0 357040 Planchon-Peteroa 22297 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... South America Central Chile and Argentina -35.223 -70.568 3977.0 Andesite / Basaltic Andesite Subduction zone / Continental crust (>25 km) 2018.957734 2018.957734 0.000000
1 357090 Copahue 22299 Confirmed Eruption Agrio Crater NaN NaN NaN 2018 NaN ... South America Central Chile and Argentina -37.856 -71.183 2953.0 Trachybasalt / Tephrite Basanite Subduction zone / Continental crust (>25 km) 2018.919405 2018.933094 0.013689
2 267020 Karangetang 22294 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Indonesia Sangihe Islands 2.781 125.407 1797.0 Andesite / Basaltic Andesite Subduction zone / Oceanic crust (< 15 km) 2018.899042 2018.899042 0.000000
3 282050 Kuchinoerabujima 22296 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Japan, Taiwan, Marianas Ryukyu Islands and Kyushu 30.443 130.217 657.0 Andesite / Basaltic Andesite Subduction zone / Oceanic crust (< 15 km) 2018.804757 2018.963210 0.158453
4 268060 Gamalama 22295 Confirmed Eruption NaN NaN NaN NaN 2018 NaN ... Indonesia Halmahera 0.800 127.330 1715.0 Andesite / Basaltic Andesite Subduction zone / Oceanic crust (< 15 km) 2018.758214 2018.763689 0.005476

5 rows × 39 columns

Let's compare these to volcanoes found in intraplate oceanic crust settings.

In [33]:
IntraplateOceanic=MergedEruptions.loc[MergedEruptions['Tectonic Setting'].str.contains('Intraplate / Oceanic crust')]
IntraplateOceanic.head()
Out[33]:
Volcano Number Volcano Name_x Eruption Number Eruption Category Area of Activity VEI VEI Modifier Start Year Modifier Start Year Start Year Uncertainty ... Region Subregion Latitude_y Longitude_y Elevation (m) Dominant Rock Type Tectonic Setting StartTime EndTime Duration
27 233020 Fournaise, Piton de la 22261 Confirmed Eruption NaN 0.0 NaN NaN 2018 NaN ... Middle East and Indian Ocean Indian Ocean (western) -21.244 55.708 2632.0 Basalt / Picro-Basalt Intraplate / Oceanic crust (< 15 km) 2018.255476 2018.833333 0.577858
47 233020 Fournaise, Piton de la 22224 Confirmed Eruption NaN 0.0 NaN NaN 2017 NaN ... Middle East and Indian Ocean Indian Ocean (western) -21.244 55.708 2632.0 Basalt / Picro-Basalt Intraplate / Oceanic crust (< 15 km) 2017.535592 2017.657255 0.121663
63 233020 Fournaise, Piton de la 22184 Confirmed Eruption NaN 0.0 NaN NaN 2017 NaN ... Middle East and Indian Ocean Indian Ocean (western) -21.244 55.708 2632.0 Basalt / Picro-Basalt Intraplate / Oceanic crust (< 15 km) 2017.082136 2017.154517 0.072382
80 233020 Fournaise, Piton de la 22145 Confirmed Eruption NaN 0.0 NaN NaN 2016 NaN ... Middle East and Indian Ocean Indian Ocean (western) -21.244 55.708 2632.0 Basalt / Picro-Basalt Intraplate / Oceanic crust (< 15 km) 2016.694045 2016.713210 0.019165
87 233020 Fournaise, Piton de la 22144 Confirmed Eruption Château Fort crater 0.0 NaN NaN 2016 NaN ... Middle East and Indian Ocean Indian Ocean (western) -21.244 55.708 2632.0 Basalt / Picro-Basalt Intraplate / Oceanic crust (< 15 km) 2016.401780 2016.404517 0.002738

5 rows × 39 columns

We can plot 'Start Year' and 'VEI' of both of these types of eruptions to see if there are any relationships with tectonic setting. This time we'll use the plt.scatter() function (a new plotting method for your enjoyment).

In [34]:
help(plt.scatter)
Help on function scatter in module matplotlib.pyplot:

scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, *, data=None, **kwargs)
    A scatter plot of *y* vs *x* with varying marker size and/or color.
    
    Parameters
    ----------
    x, y : array_like, shape (n, )
        The data positions.
    
    s : scalar or array_like, shape (n, ), optional
        The marker size in points**2.
        Default is ``rcParams['lines.markersize'] ** 2``.
    
    c : color, sequence, or sequence of color, optional
        The marker color. Possible values:
    
        - A single color format string.
        - A sequence of color specifications of length n.
        - A sequence of n numbers to be mapped to colors using *cmap* and
          *norm*.
        - A 2-D array in which the rows are RGB or RGBA.
    
        Note that *c* should not be a single numeric RGB or RGBA sequence
        because that is indistinguishable from an array of values to be
        colormapped. If you want to specify the same RGB or RGBA value for
        all points, use a 2-D array with a single row.  Otherwise, value-
        matching will have precedence in case of a size matching with *x*
        and *y*.
    
        Defaults to ``None``. In that case the marker color is determined
        by the value of ``color``, ``facecolor`` or ``facecolors``. In case
        those are not specified or ``None``, the marker color is determined
        by the next color of the ``Axes``' current "shape and fill" color
        cycle. This cycle defaults to :rc:`axes.prop_cycle`.
    
    marker : `~matplotlib.markers.MarkerStyle`, optional
        The marker style. *marker* can be either an instance of the class
        or the text shorthand for a particular marker.
        Defaults to ``None``, in which case it takes the value of
        :rc:`scatter.marker` = 'o'.
        See `~matplotlib.markers` for more information about marker styles.
    
    cmap : `~matplotlib.colors.Colormap`, optional, default: None
        A `.Colormap` instance or registered colormap name. *cmap* is only
        used if *c* is an array of floats. If ``None``, defaults to rc
        ``image.cmap``.
    
    norm : `~matplotlib.colors.Normalize`, optional, default: None
        A `.Normalize` instance is used to scale luminance data to 0, 1.
        *norm* is only used if *c* is an array of floats. If *None*, use
        the default `.colors.Normalize`.
    
    vmin, vmax : scalar, optional, default: None
        *vmin* and *vmax* are used in conjunction with *norm* to normalize
        luminance data. If None, the respective min and max of the color
        array is used. *vmin* and *vmax* are ignored if you pass a *norm*
        instance.
    
    alpha : scalar, optional, default: None
        The alpha blending value, between 0 (transparent) and 1 (opaque).
    
    linewidths : scalar or array_like, optional, default: None
        The linewidth of the marker edges. Note: The default *edgecolors*
        is 'face'. You may want to change this as well.
        If *None*, defaults to rcParams ``lines.linewidth``.
    
    edgecolors : color or sequence of color, optional, default: 'face'
        The edge color of the marker. Possible values:
    
        - 'face': The edge color will always be the same as the face color.
        - 'none': No patch boundary will be drawn.
        - A matplotib color.
    
        For non-filled markers, the *edgecolors* kwarg is ignored and
        forced to 'face' internally.
    
    Returns
    -------
    paths : `~matplotlib.collections.PathCollection`
    
    Other Parameters
    ----------------
    **kwargs : `~matplotlib.collections.Collection` properties
    
    See Also
    --------
    plot : To plot scatter plots when markers are identical in size and
        color.
    
    Notes
    -----
    
    * The `.plot` function will be faster for scatterplots where markers
      don't vary in size or color.
    
    * Any or all of *x*, *y*, *s*, and *c* may be masked arrays, in which
      case all masks will be combined and only unmasked points will be
      plotted.
    
    * Fundamentally, scatter works with 1-D arrays; *x*, *y*, *s*, and *c*
      may be input as 2-D arrays, but within scatter they will be
      flattened. The exception is *c*, which will be flattened only if its
      size matches the size of *x* and *y*.
    
    .. note::
        In addition to the above described arguments, this function can take a
        **data** keyword argument. If such a **data** argument is given, the
        following arguments are replaced by **data[<arg>]**:
    
        * All arguments with the following names: 'c', 'color', 'edgecolors', 'facecolor', 'facecolors', 'linewidths', 's', 'x', 'y'.
    
        Objects passed as **data** must support item access (``data[<arg>]``) and
        membership test (``<arg> in data``).

In [35]:
# plot the Subduction dates
plt.scatter(Subduction['Start Year'],Subduction['VEI'],c='orange',label='Subduction Zones')
# plot the Intraplate dates
plt.scatter(IntraplateOceanic['Start Year'],IntraplateOceanic['VEI'],c='cyan',\
            label='Interplate Oceanic')
plt.xlabel('Start Year')
plt.ylabel('VEI')
plt.ylim(-1,9)
plt.legend();

As we might expect, volcanoes in intraplate oceanic settings are less explosive (mostly) than those in subduction zone settings. This is partially because the lavas produced in oceanic intraplate are less viscous and trap less gasses.

.groupby( ) and .describe( )

Pandas has a couple more methods that might be useful for looking at the distribution of these data. These are the .groupby( ) and .describe( ) methods. We can use these methods to look at the average VEI for each tectonic setting in our dataset.

.groupby( ) groups things in your DataFrame by unique values in a Series, for example grouping everything by 'Tectonic Setting'. .describe( ) summarizes some useful statistics. So if we wanted to know basic statistics for each tectonic setting (and who wouldn't?), we would do:

In [39]:
MergedEruptions.groupby('Tectonic Setting')['VEI'].describe()
Out[39]:
count mean std min 25% 50% 75% max
Tectonic Setting
Intraplate / Continental crust (>25 km) 83.0 2.108434 1.530280 0.0 1.5 2.0 3.0 7.0
Intraplate / Intermediate crust (15-25 km) 11.0 2.090909 0.539360 1.0 2.0 2.0 2.0 3.0
Intraplate / Oceanic crust (< 15 km) 485.0 0.709278 0.953472 0.0 0.0 0.0 2.0 5.0
Rift zone / Continental crust (>25 km) 148.0 1.425676 1.299352 0.0 0.0 1.0 2.0 6.0
Rift zone / Intermediate crust (15-25 km) 17.0 1.588235 1.277636 0.0 0.0 2.0 2.0 4.0
Rift zone / Oceanic crust (< 15 km) 475.0 1.631579 1.507034 0.0 0.0 2.0 2.0 6.0
Subduction zone / Continental crust (>25 km) 5036.0 2.214654 1.037046 0.0 2.0 2.0 3.0 7.0
Subduction zone / Crustal thickness unknown 336.0 1.723214 0.886480 0.0 1.0 2.0 2.0 6.0
Subduction zone / Intermediate crust (15-25 km) 340.0 2.064706 0.981505 0.0 2.0 2.0 2.0 6.0
Subduction zone / Oceanic crust (< 15 km) 689.0 1.952104 1.205257 0.0 1.0 2.0 3.0 7.0

If we use these methods on a column with string data, it gives us the most common thing in that column associated with that tectonic setting.

In [40]:
MergedEruptions.groupby('Tectonic Setting')['Primary Volcano Type'].describe()
Out[40]:
count unique top freq
Tectonic Setting
Intraplate / Continental crust (>25 km) 210 11 Stratovolcano 97
Intraplate / Intermediate crust (15-25 km) 27 2 Fissure vent(s) 15
Intraplate / Oceanic crust (< 15 km) 599 5 Shield 499
Rift zone / Continental crust (>25 km) 234 12 Shield 82
Rift zone / Intermediate crust (15-25 km) 21 5 Stratovolcano 7
Rift zone / Oceanic crust (< 15 km) 749 11 Stratovolcano 222
Subduction zone / Continental crust (>25 km) 6310 24 Stratovolcano 3287
Subduction zone / Crustal thickness unknown 366 10 Stratovolcano 125
Subduction zone / Intermediate crust (15-25 km) 423 8 Stratovolcano 267
Subduction zone / Oceanic crust (< 15 km) 854 9 Stratovolcano 471

This tells us that around of 599 intraplate oceanic crust eruptions, 499 were at shield volcanoes, but only half of our subduction zone continental crust eruptions were stratovolcanoes.

In [ ]:
StH=ConfirmedEruptions
plt.scatter(EruptionData[EruptionData])