Notebook

Accessing Spectral Downwelling Irradiance (SPKIR) Data from the OOI Raw Data Server¶

The example code provided below shows a pathway for downloading and converting the raw SPKIR data (recorded in a mixed ASCII/Binary format) into a usable form for further processing and analysis. The data is accessible from the OOI Raw Data Server. For this demonstration we are using data from the Spring 2016 Deployment of the Oregon Shelf Surface Mooring (CE02SHSM)

Before proceeding, you need to obtain a copy of the cgsn_parsers modules used below. Using the Anaconda python distribution and the conda-forge channel, you can install these modules via:

# Via conda
conda install -c conda-forge cgsn_parsers

# Or via pip if not using Anaconda
pip install git+https://bitbucket.org/ooicgsn/cgsn-parsers

See the README in this repo for further information.

In [1]:

# Load required python modules
import requests
import numpy as np
import pandas as pd
import xarray as xr

from bokeh.plotting import figure, show
from bokeh.palettes import Colorblind as palette
from bokeh.io import output_notebook

import warnings
warnings.filterwarnings('ignore')

In [2]:

# Load the parser for the SPKIR data. Reads in the DCL logged raw data file and converts that data to a Bunch class data object.
from cgsn_parsers.parsers.parse_spkir import Parser

In [3]:

# Coastal Endurance Oregon Shelf Surface Mooring NSIF (7 meters) SPKIR data from June 1, 2016
baseurl = "https://rawdata.oceanobservatories.org/files/CE02SHSM/D00003/cg_data/dcl26/spkir/"
fname = "20160601.spkir.log"

# initialize the Parser object for SPKIR
spkir = Parser(baseurl + fname)
r = requests.get(spkir.infile, verify=True)

In [4]:

# Raw data is available in the raw data object for the parser class.
spkir.raw = r.content
len(spkir.raw), spkir.raw[:50]

# Note, SPKIR data is a form of mixed ASCII/binary. We read this data in differently then we would for an instrument
# that was pure ASCII.

Out[4]:

(1370784, b'2016/06/01 00:00:13.348 [spkir:DLOGP8]:Instrument ')

In [5]:

# The parser class method parse_data converts the raw data into a parsed bunch class data object
spkir.parse_data()
spkir.data.keys()

Out[5]:

dict_keys(['date_time_string', 'frame_counter', 'serial_number', 'internal_temperature', 'analog_rail_voltage', 'raw_channels', 'input_voltage', 'time', 'timer', 'sample_delay'])

Almost every EA dataset will include multiple sources of timing data. We always use the data logger date/time string (dcl_date_time_string) converted to an Epoch time stamp (seconds since 1970-01-01 UTC) as this time source is directly tied to GPS time. This converted Epoch time stamp is called 'time' in all of the datasets created by the cgsn_parsers. The source date/time string and any other time sources included in the dataset are also provided (if produced), in the raw format that is recorded in the data file.

From here, you can save the data to disk as a JSON formatted data file if you so desire. We use this method to store the parsed data files locally for all further processing.

# write the resulting Bunch object via the toJSON method to a JSON
# formatted data file (note, no pretty-printing keeping things compact)
with open(outfile, 'w') as f:
    f.write(spkir.data.toJSON())

We are going to proceed, instead, by converting the data into a pandas dataframe and then an xarray dataset for the following steps.

In [6]:

# Convert the data into a Panda dataframe for further analysis.
df = pd.DataFrame(spkir.data)
df['time'] = pd.to_datetime(df.time, unit='s')  # use the time variable to set the index
df.set_index('time', drop=False, inplace=True)
ds = df.to_xarray()

ds.coords['spectra'] = [412, 444, 490, 510, 555, 620, 683]
ds.update({'raw_channels': (('time', 'spectra'), np.vstack(df.raw_channels.values))})

ds

Out[6]:

<xarray.Dataset>
Dimensions:               (spectra: 7, time: 16160)
Coordinates:
  * time                  (time) datetime64[ns] 2016-06-01T00:00:14.302000 ...
  * spectra               (spectra) int32 412 444 490 510 555 620 683
Data variables:
    analog_rail_voltage   (time) int64 177 178 177 177 178 178 178 178 177 ...
    date_time_string      (time) object '2016/06/01 00:00:14.302' ...
    frame_counter         (time) int64 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
    input_voltage         (time) int64 283 274 280 284 275 276 285 285 276 ...
    internal_temperature  (time) int64 127 127 127 127 127 127 127 127 127 ...
    raw_channels          (time, spectra) int64 2157272896 2160208704 ...
    sample_delay          (time) int64 -133 -133 -133 -133 -133 -133 -133 ...
    serial_number         (time) int64 297 297 297 297 297 297 297 297 297 ...
    timer                 (time) float64 9.01 9.96 10.93 12.02 12.99 13.95 ...

In [7]:

# Provide a simple plot of a days worth of data
output_notebook()

# make a list of our wavelengths for the legend and set the colors
cols = ['412 nm', '444 nm', '490 nm', '510 nm', '555 nm', '620 nm', '683 nm']
colors = palette[7]

# make the figure, 
p = figure(x_axis_type="datetime", title="Raw SPKIR Data -- Bursts", width = 850, height = 500)
p.xaxis.axis_label = 'Date and Time'
p.yaxis.axis_label = 'Counts'

for i in range(7):
    p.line(ds.time.values, ds['raw_channels'][:, i].values, color=colors[i], legend=cols[i])

p.toolbar_location = 'above'
show(p)

Loading BokehJS ...

In [8]:

# The SPKIR data is collected in a burst mode (~1 Hz data sampled for 3 minutes every 15 minutes). We're going to take
# a median average of each burst to clean up variablity in the data created by the movement of the NSIF relative to the 
# water column and to make the ultimate data files smaller and easier to work with.
burst = ds.resample(time='15Min').median()

In [9]:

# make the figure, 
p = figure(x_axis_type="datetime", title="Raw SPKIR Data -- Averaged", width = 850, height = 500)
p.xaxis.axis_label = 'Date and Time'
p.yaxis.axis_label = 'Counts'

for i in range(7):
    p.line(burst.time.values, burst['raw_channels'][:, i].values, color=colors[i], legend=cols[i])

p.toolbar_location = 'above'
show(p)

The following two functions and the implementation below, takes the work from the examples above and combines them into a simple routine we can use to access, download and initially process the SPKIR data for the month of June (change the example regex to get whatever data it is you are after).

In [10]:

# Add some addition modules
from bs4 import BeautifulSoup
import re

# Function to create a list of the data files of interest on the raw data server
def list_files(url, tag=''):
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    pattern = re.compile(str(tag))
    return [node.get('href') for node in soup.find_all('a', text=pattern)]

# Function to download a file, parse it, apply median-averaging to the bursts and create a final dataframe.
def process_file(file):
    # Initialize the parser, download and parse the data file
    spkir = Parser(baseurl + file)
    r = requests.get(spkir.infile, verify=True)
    spkir.raw = r.content
    spkir.parse_data()

    # Convert the parsed data to a dataframe
    df = pd.DataFrame(spkir.data)
    df['time'] = pd.to_datetime(df.time, unit='s')  # use the time variable to set the index
    df.set_index('time', drop=False, inplace=True)
    ds = df.to_xarray()

    ds.coords['spectra'] = [412, 444, 490, 510, 555, 620, 683]
    ds.update({'raw_channels': (('time', 'spectra'), np.vstack(df.raw_channels.values))})

    # calculate the burst averaging
    burst = ds.resample(time='15Min').median()
    return burst

In [11]:

# Create a list of the files from June 2016 using a simple regex as a tag to discriminate the files
files = list_files(baseurl, '201606[0-9]{2}.spkir.log')

# Process the data files for June and concatenate into a single dataframe
frames = [process_file(f) for f in files]
june = xr.concat(frames, 'time')

In [12]:

# Plot the burst averaged data for the month of June 2016.
# make the figure, 
p = figure(x_axis_type="datetime", title="Raw SPKIR Data -- June 2016", width = 850, height = 500)
p.xaxis.axis_label = 'Date and Time'
p.yaxis.axis_label = 'Counts'

for i in range(7):
    p.line(june.time.values, june['raw_channels'][:, i].values, color=colors[i], legend=cols[i])

p.toolbar_location = 'above'    
show(p)

At this point, you have the option to save the data, or apply the processing routines available in pyseas and cgsn_processing, to convert the data from raw engineering units to scientific units using the calibration coefficients that are available online.

In [13]:

june['time'] = june.time.values.astype(float) / 10.0**9  # Convert from datetime object in nanoseconds to seconds since 1970
june.to_netcdf('C:\\ooi\\ce02shsm_june2016_raw_spkir.nc')