Palmer LTER Inorganic Analysis

Antarctica's Western Peninsula is a rapidly changing ecosystem. To quantify these changes and prepare for the Global biogeochemical and ecological costs of these changes, we must look at long terms trends. Here is analysis of 15 years of Inorganic Nutrient data taken from Palmer LTER January cruises. The Cruise follows a regional grid along the Western Antarctica Peninsula (wAP) as follows:

In [90]:

The software in this notebook is intended to help standardize data to expedite analysis and high-quality research. Specifically, this iPython notebook transfers the Palmer Inorganic Nutrient dataset into a Pandas dataframe, cleans and standardizes measurements according to a simplified grid, provides annual depth profiles by station, and interpolates station data between and throughout years to get a glimpse of regional inorganic nutrient patterns. CTD data is also used to estimate Mixed Layer Depth (MLD) using a gradient and threshold method. The combined NO2 NO3 is integrated from the surface to the MLD for each station annually to quantify the nutrients removed from the water column due to Phytoplankton and Diatoms. From there, we can employ the stoichiometric ratio of marine biota-- known as the Redfield Ratio, to calculate how much CO2 is taken from the water column from this process called, Net Community Production (NCP).
All files necessary to run this notebook can be found in the github repo, all data was downloaded from UCSD's Datazoo, additional information is available from Palmer LTER.

This Notebook is intended to be downloaded and run on your personal machine OR viewed from Jupyter's nbviewer, because of this the indexing links from the table of contents will not work on github!

Last edited 01.06.16 by Leon Yin.

I want to thank Hugh Ducklow and Ryan Abernathey for their expert guidance.

In [1]:
import os
import json
import glob

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import as cm
from mpl_toolkits.basemap import Basemap
import seaborn
import colorsys
from scipy.interpolate import griddata
from scipy.integrate import simps
from numpy import trapz
from IPython.display import Image

%matplotlib inline
In [2]:
# Global variables
plt.rcParams['figure.figsize'] = (15,9)
skip          = -999                                     # skip value in thr dataset
mark          = ['o','p','d','v','s']                    # markers for scatterplots
surfDepth     = 10                                        #(m) depth of surface
maxDepth      = 250
cMap          = 'RdBu_r'
outOfGrid     = ['Trap','Rothera','B','E','notFound','outGrid']
dsPath        = 'data_sets/'                             # data set directory

Let's read the grid into Python through a Pandas dataframe, so we can work with it... Moreover, let's create some functions so we can continue to do so.

In [11]:
def read_grd(fileIn,col):
        grid = pd.read_csv(fileIn,sep=',',skiprows=1,names=col,na_filter=True)
        grid = grid[grid.stationID.isin(outOfGrid)==False]
        return grid
        print (fileIn,"is either unreadable or DNE.")
        raise SystemExit(0)

def draw_wAP(zoom=False,lat=-66.90,lon=-67.05):
        map = Basemap(width=1800000,height=1000000,
        map = Basemap(width=320000,height=180000,
    return map

def plt_grd(grid_df):
    map = draw_wAP(zoom=False,lat=-65.40,lon=-67.05)
    x1 = grid_df['lon'].values
    y1 = grid_df['lat'].values
    labels = grid_df['stationID']
    xS, yS = map(x1, y1)
    map.scatter(xS, yS, marker='o',s=26,color='k')

    for label, xpt, ypt in zip(labels, xS, yS):
        plt.text(xpt-70000, ypt-5000, label,size=8)
    plt.title('Palmer LTER Regional Grid')
In [13]:
# files
gridInfile = os.path.join(dsPath+'Basic Western Antarctic Peninsula Survey Grid.csv')
gridCol   = ["studyName","gridLine","gridStation","stationID","lat","lon","gridCode"]
In [5]:
# first read in the grid standard
grid = read_grd(gridInfile,gridCol)

Sometime is not right with the station names, so let's fix it!

In [14]:
newGrid = read_grd(gridInfile,gridCol)
newGrid.gridStation = newGrid.gridStation.astype(int).apply(lambda x: '{0:0>3}'.format(x))
newGrid.gridLine = newGrid.gridLine.astype(int).apply(lambda x: '{0:0>3}'.format(x))
newGrid['stationID'] = +"."+

Now let's take a look at the cruise measurements from 2000-onwards.

In [3]:
# CSV file from Datazoo
inOrgInfile = os.path.join(dsPath+'Dissolved Inorganic Nutrients.csv')

# Column formats for inorganic nutrients
inOrgCol    = ['studyName','eventNum','cast','bottle','datetime','stationID','gridLine',\
In [4]:
def read_inOrg(infile,cols):
    # Import the files
        inOrg = pd.read_csv(infile,skiprows=1,names=cols)
        # Munging...
        inOrg = inOrg[(inOrg['lon'].notnull()) & (inOrg['lat'].notnull())&
                      (inOrg['depth']!=skip) & (inOrg['depth'].notnull())&
        # Time keeping...
        inOrg =inOrg[inOrg.datetime>'1999-12-31 00:00:00'] # start date
        inOrg.datetime=pd.to_datetime(inOrg.datetime,format='%Y-%m-%d %H:%M:%S',exact=True)   
        inOrg['month']=pd.DatetimeIndex(inOrg['datetime']).month'{:,.5f}'.format).astype(float))*-1 # convert all coords to south
        inOrg.lon=abs('{:,.5f}'.format).astype(float))*-1 # convert all coords to west
        inOrg = inOrg.reset_index()
        print("The imported database contains",len(inOrg.index),"records.")
        return inOrg
        print (fileIn1,"is either unreadable or DNE.")
        raise SystemExit(0)
inOrg = read_inOrg(inOrgInfile,inOrgCol)
The imported database contains 9101 records.
index studyName eventNum cast bottle datetime stationID gridLine gridStation lat ... depth PO4 SiO4 NO2 NO3 NH4 Nitrite notes year month
0 7681 LMG00-01 65 NaN 23 2000-01-09 16:10:00 NaN 601.16 38.56 -64.93363 ... 3.0 1.38 82.85 0.15 20.18 0.51 -999 NaN 2000 1
1 7682 LMG00-01 65 NaN 21 2000-01-09 16:10:00 NaN 601.16 38.56 -64.93363 ... 3.5 1.38 82.97 0.15 20.04 0.59 -999 NaN 2000 1
2 7683 LMG00-01 65 NaN 19 2000-01-09 16:10:00 NaN 601.16 38.56 -64.93363 ... 6.3 1.37 83.42 0.16 20.08 0.46 -999 NaN 2000 1
3 7684 LMG00-01 65 NaN 17 2000-01-09 16:10:00 NaN 601.16 38.56 -64.93363 ... 10.1 1.38 83.53 0.16 20.41 0.45 -999 NaN 2000 1
4 7685 LMG00-01 65 NaN 15 2000-01-09 16:10:00 NaN 601.16 38.56 -64.93363 ... 12.3 1.44 83.32 0.15 21.00 0.81 -999 NaN 2000 1

5 rows × 22 columns

Let's take a peep at the spatial-temporal distribution of measurments in the grid...

In [9]:
def plt_interannual(df,zoom=False):
    # Initialize the plot
    map = draw_wAP(zoom)
    yearList    = np.array(df.year.unique())                # List of years from the LTRE
    yearVar     = len(yearList)                             # How long has this been going on?
    colorSwatch = cm.YlGnBu(np.linspace(0, 1, yearVar))     # Colors for each year
    for i in range(yearVar):
        querry = df[df.year==yearList[i]]            # Querry the database per year
        x1 = querry.lon.values
        y1 =
        x, y = map(x1, y1)                               # Convert the lat and lon to a Basemap-readable format.
        map.scatter(x, y, color=colorSwatch[i], marker='o',s=24,label=np.asarray(querry.year)[0])
                loc='lower right',
    plt.title('Interannual Distribution of Palmer LTER Inorganic Nutrients')

It looks like all stations are lined up with the excpetion of some outside of the station lines... But let's not make any assumptions just yet, and take a closer look.

In [10]:

We see that although these points are taken in the same proximity, their coordinates aren't flush to the correct percision...

In [11]:
print("We also see that there are ",len(,"and",len(inOrg.lon.unique()),"unique latitude and longitude coordiantes,")
print("and that the stationID's are not standardized:\n",inOrg.stationID.unique()[:14])
print("We want to see",len(grid.stationID),"coordinate pairs and stationIDs...")
We also see that there are  903 and 940 unique latitude and longitude coordiantes,
and that the stationID's are not standardized:
 [nan '500.06' '600.04' '600.06' '600.08' '600.1' '600.12' '600.14' '600.16'
 '600.18' '600.2' '600.22' '600.26' '500.22']
We want to see 120 coordinate pairs and stationIDs...

Standardizing the Dataset

Before we can do any anaylsis, we need to wrangle the data into a comparable format...
This task can be accomplished by standizing stationIDs and coordiantes that are within reasonable bounrds of idealized station coordaintes. This function compares each gritty coordinate pair with idealized pairs and outputs a Pandas DF contianing corrected coordinates (lat,lon) and stationID.

pandas dataframe (DF), latitude numPy array, longitue numPy array
Pandas Dataframe containing standardized lat/lon coordinates and stationID.
ex . updatedDF = coor2stationID(df,,df.lon)

In [12]:
latThresh = 0.0645
lonThresh = 0.0750

def cord2stationID(df):
    # Lets standardize the coordinates for SW
    latIn = -1*abs(
    lonIn = -1*abs(df.lon)

    # how many points are there?

    # List to hold corrected stations.
    newStations = []
    # Iterate through each lat lon pair in the list
    for i in range(x):
        # Find standard latitude that is closest to observed.
        querry = newGrid[<=latIn[i]+latThresh]
        querry = querry[>=latIn[i]-latThresh]
        # If DNE...
        if querry.empty:
        # IF DE, look for standard longitude closest to observed.     
            querry = querry[querry.lon<=lonIn[i]+lonThresh]
            querry = querry[querry.lon>=lonIn[i]-lonThresh]
            qLen = len(querry.index)

            # DNE...
            if querry.empty:  
            # If there is perfect a match   
            elif (qLen==1):
                newStations.append({'lat':np.asarray(, dtype=np.float)[0],
                                    'lon':np.asarray(querry.lon, dtype=np.float)[0],
                                    'stationID':np.asarray(querry.stationID, dtype=object)[0],
                                    'gridLine':np.asarray(querry.gridStation, dtype=object)[0],
                                    'stationLine':np.asarray(querry.gridLine, dtype=object)[0]})
            # the list has multiple values
                qLon = querry.lon.values
                qLat =
                qStation = querry.stationID.values
                qGridLine = querry.gridLine.values
                qStationLine = querry.gridStation.values
                minDist = 1e6
                # calculate which lat lon combo is closest to station values
                for j in range(qLen):
                    lonDiff = abs(lonIn[i]-qLon[j])
                    latDiff = abs(latIn[i]-qLat[j])
                    sumDist = lonDiff+latDiff
                        mindex = j
                # update the lists
    # Station corrections done, update df with new values.
        'gridLine','stationLine']] = pd.DataFrame(newStations)[['lat','lon','stationID',
    return df

How does this new cleaned data compare to the old?

In [13]:
inOrgClean = cord2stationID(inOrg)
map = draw_wAP()
querry = inOrgClean[inOrgClean['stationID'].isin(outOfGrid)==False]
x1 = querry.lon.values
y1 =
x, y = map(x1, y1)
map.scatter(x, y, color='k', marker='o',s=24,linewidth=.08,alpha=.8,label='Confirmed Grid Station')

querry = inOrgClean[inOrgClean['stationID']=='notFound']
x2 = querry.lon.values
y2 =
x, y = map(x2, y2)
map.scatter(x, y, marker='o',color='yellow',s=24,linewidth=.08,alpha=.8,label='Unmarked Station')

gridBounds = ['000.220','100.220','200.220','300.220','400.220','500.220','600.220']
gridLabels = newGrid[newGrid['stationID'].isin(gridBounds)]
x3 = gridLabels.lon.values
y3 =
xS, yS = map(x3, y3)
labels = gridLabels.gridLine.values.astype(int)
for label, xpt, ypt in zip(labels, xS, yS):
    plt.text(xpt-84000, ypt+19000, label,size=14)
            loc='lower right',
plt.title('Standardized Grid')