This notebook contains an analysis on the NOAA Deep Sea Coral Dataset. I found this dataset and wanted to learn what I could from it.
import pandas as pd
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
coral_data = pd.read_csv('deep_sea_corals_data.csv')
/Users/kyle/projects/earthscience/notebooks/env/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (3,7,11,15,16,17,19,20,21,22,23,25,26,27,28,29,30,31,32,33,35,36,37) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
Getting rid of the 0th row in the table as it only has additional metadata on the columns
coral_data = coral_data.drop(axis=0, index=0)
There are 40 different columns in this dataset. What kind of information can we learn about it?
coral_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 762132 entries, 1 to 762132 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ShallowFlag 762132 non-null float64 1 DatasetID 762132 non-null object 2 CatalogNumber 762132 non-null float64 3 SampleID 618498 non-null object 4 Repository 755149 non-null object 5 ScientificName 762132 non-null object 6 VernacularNameCategory 762132 non-null object 7 VernacularName 272594 non-null object 8 TaxonRank 762132 non-null object 9 Family 537880 non-null object 10 Genus 473151 non-null object 11 Species 217021 non-null object 12 Ocean 762132 non-null object 13 Country 739684 non-null object 14 Locality 640014 non-null object 15 latitude 762132 non-null object 16 longitude 762132 non-null object 17 DepthInMeters 762132 non-null object 18 ObservationDate 762100 non-null object 19 SurveyID 514760 non-null object 20 Purpose 444925 non-null object 21 SurveyComments 145271 non-null object 22 Station 452352 non-null object 23 EventID 724008 non-null object 24 SamplingEquipment 743855 non-null object 25 Cover 12862 non-null object 26 VerbatimSize 165324 non-null object 27 MinimumSize 762132 non-null object 28 MaximumSize 762132 non-null object 29 Condition 465113 non-null object 30 Habitat 330366 non-null object 31 Temperature 762132 non-null object 32 Salinity 762132 non-null object 33 Oxygen 762132 non-null object 34 pH 762132 non-null float64 35 pCO2 762132 non-null object 36 TA 762132 non-null object 37 DIC 762132 non-null object 38 RecordType 762132 non-null object 39 DataProvider 762132 non-null object dtypes: float64(3), object(37) memory usage: 238.4+ MB
The ObservationDate
column of the dataset contains strings, in the format 'YYYY', 'YYYY-MM',
or 'YYYY-MM-DD'
.
This function normalizes this data.
from datetime import datetime
import math
def clean_date(date):
""" Used to clean the observation date of the coral """
if isinstance(date, float) and math.isnan(date): return date # skip nan values
if date == '-999': return float('nan')
split_date = date.split('-')
if len(split_date) == 1:
year = int(split_date[0])
month = 1
day = 1
elif len(split_date) == 2:
year = int(split_date[0])
month = int(split_date[1])
day = 1
else:
year = int(split_date[0])
month = int(split_date[1])
day = int(split_date[2])
return datetime(year=year, month=month, day=day)
coral_data['ObservationDate'] = coral_data['ObservationDate'].map(lambda x: clean_date((x)))
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(1,1,1)
ax.set_title('Number Of Observations Over Time')
ax.set_ylabel('Observations')
ax.set_xlabel('Year')
n_years = coral_data['ObservationDate'].max().year - coral_data['ObservationDate'].min().year
_ = ax.hist(
coral_data['ObservationDate'],
bins=n_years
)
def clean_geopoints(point):
""" Clean lat/lon points """
try:
return float(point)
except Exception:
return float('nan')
coral_data['longitude'] = coral_data['longitude'].map(lambda x: clean_geopoints(x))
coral_data['latitude'] = coral_data['latitude'].map(lambda x: clean_geopoints(x))
def to_float(val):
""" Convert values to float if possible """
val = float(val)
if val == -999.0:
return float('nan')
return val
coral_data['Temperature'] = coral_data['Temperature'].map(lambda x: to_float(x))
coral_data['Oxygen'] = coral_data['Oxygen'].map(lambda x: to_float(x))
coral_data['DepthInMeters'] = coral_data['DepthInMeters'].map(lambda x: to_float(x))
We have data from all over the world. Lets see exactly where they are...
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(1,1,1, projection=ccrs.PlateCarree())
ax.coastlines()
ax.add_feature(cfeature.LAND)
ax.gridlines(draw_labels=True)
ax.scatter(
x=coral_data['longitude'],
y=coral_data['latitude']
)
<matplotlib.collections.PathCollection at 0x7fe6f92e1f10>
Let's take a look at how temperature readings have changed over time. A normal scatter plot will do nicely.
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(1,1,1)
ax.set_title('Temperature Readings (C) Over Time')
ax.set_ylabel('Temperature (C)')
ax.set_xlabel('Year')
ax.grid()
ax.scatter(
x=coral_data['ObservationDate'],
y=coral_data['Temperature'],
)
<matplotlib.collections.PathCollection at 0x7fe689d1f550>
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(1,1,1, projection=ccrs.PlateCarree())
ax.coastlines()
ax.add_feature(cfeature.LAND)
ax.gridlines(draw_labels=True)
ax.scatter(
x=coral_data['longitude'],
y=coral_data['latitude'],
c=coral_data['Temperature'], cmap='inferno',
)
<matplotlib.collections.PathCollection at 0x7fe6ff1634c0>
Looks like most points are focused in the pacific ocean. Let's centralize there
proj = ccrs.PlateCarree(central_longitude=-180)
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(1,1,1, projection=proj)
ax.set_extent([120, -60, -30, 80])
ax.coastlines()
ax.add_feature(cfeature.LAND)
ax.gridlines(draw_labels=True)
scat = ax.scatter(
x=coral_data['longitude'].map(lambda lon: (proj.transform_point(lon, 0, ccrs.PlateCarree()))[0]),
y=coral_data['latitude'].map(lambda lat: (proj.transform_point(0, lat, ccrs.PlateCarree()))[1]),
c=coral_data['Temperature'], cmap='inferno',
)
fig.colorbar(scat)
<matplotlib.colorbar.Colorbar at 0x7fe68a080a60>
So it looks like the Atlantic Ocean temperature readings are higher than most of the Pacific Ocean readings. I will separate the two ocean datasets and look how temperature has changed over time. It could be possible that the Atlantic Ocean temperature readings were recorded more recently than the Pacific Ocean, leading to what looks like the temperatures are increasing, where in reality it could just be due to the location of the readings.
atl_coral = coral_data[coral_data['Ocean'].isin(['North Atlantic', 'South Atlantic'])]
pac_coral = coral_data[coral_data['Ocean'].isin(['North Pacific', 'South Pacific'])]
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(2,1,1)
ax.set_title('Atlantic Ocean Temperature Readings (C) Over Time')
ax.set_ylabel('Temperature (C)')
ax.set_xlabel('Year')
ax.grid()
ax.scatter(
x=atl_coral['ObservationDate'],
y=atl_coral['Temperature'],
)
ax2 = fig.add_subplot(2,1,2)
ax2.set_title('Pacific Ocean Temperature Readings (C) Over Time')
ax2.set_ylabel('Temperature (C)')
ax2.set_xlabel('Year')
ax2.grid()
ax2.scatter(
x=pac_coral['ObservationDate'],
y=pac_coral['Temperature'],
)
<matplotlib.collections.PathCollection at 0x7fe6f92fee20>
So it looks like the higher overall temperatures are due to the Atlantic Ocean readings. It doesn't seem that there is a general upwards trend in the individual oceans.
Is there any interesting information we can find out via the oxygen content? Let's find out.
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(2,1,1)
ax.grid(linestyle='-')
ax.set_title('Pacific Ocean Dissolved Oxygen Content (mg/L) Over Time')
ax.set_xlabel('Year')
ax.set_ylabel('Dissolved Oxygen Content (mg/L)')
ax.scatter(
x=pac_coral['ObservationDate'],
y=pac_coral['Oxygen']
)
ax2 = fig.add_subplot(2,1,2)
ax2.set_title('Atlantic Ocean Dissolved Oxygen Content (mg/L) Over Time')
ax2.set_xlabel('Year')
ax2.set_ylabel('Dissolved Oxygen Content (mg/L)')
ax2.scatter(
x=atl_coral['ObservationDate'],
y=atl_coral['Oxygen']
)
<matplotlib.collections.PathCollection at 0x7fe689627c70>
Hmm, not sure what this means. I will have to think about a different way I can look at this data, to learn anything from it.
coral_data.groupby('Species').agg({'Species': 'count', 'DepthInMeters': 'mean'}).rename(columns={'Species': 'count'}).sort_values(by='count', ascending=False)
count | DepthInMeters | |
---|---|---|
Species | ||
ritteri | 25216 | 896.645463 |
pacifica | 13479 | 625.775428 |
pertusa | 12596 | 493.537790 |
occa | 9630 | 1275.515680 |
lindahlii | 6421 | 843.663292 |
... | ... | ... |
halmaheirense | 1 | 1089.000000 |
hamanni | 1 | 121.000000 |
hamatum | 1 | 210.000000 |
haswelli | 1 | 339.000000 |
zyggompha | 1 | 125.000000 |
1872 rows × 2 columns
Where to from here? I will continue to explore this data as I come up with more ideas.