The example code provided below shows a pathway for downloading and converting the raw OPTAA data (recorded in binary format) into a usable form for further processing and analysis. The data is accessible from the OOI Raw Data Server. For this demonstration we are using data from the Spring 2016 Deployment of the Oregon Shelf Surface Mooring (CE02SHSM)
Before proceeding, you need to obtain a copy of the cgsn_parsers modules used below. Using the Anaconda python distribution and the conda-forge channel, you can install these modules via:
# Via conda
conda install -c conda-forge cgsn_parsers
# Or via pip if not using Anaconda
pip install git+https://bitbucket.org/ooicgsn/cgsn-parsers
See the README in this repo for further information.
# Load required python modules
import requests
import numpy as np
import pandas as pd
import xarray as xr
from bokeh.plotting import figure, show
from bokeh.palettes import Colorblind as palette
from bokeh.io import output_notebook
import warnings
warnings.filterwarnings('ignore')
from cgsn_parsers.parsers.parse_optaa import Parser
# Coastal Endurance Oregon Shelf Surface Mooring NSIF (7 meters) OPTAA data for June 01, 2016 at 12:30 UTC
baseurl = "https://rawdata.oceanobservatories.org/files/CE02SHSM/D00003/cg_data/dcl27/optaa/"
fname = "20160601_123018.optaa.log"
# initialize the Parser object for OPTAA
optaa = Parser(baseurl + fname)
r = requests.get(optaa.infile, verify=True) # use verify=False for expired certificate
# Raw data is available in the raw data object for the parser class.
optaa.raw = r.content
len(optaa.raw), optaa.raw[:4] # print a snippet of the raw data
(448731, b'\xff\x00\xff\x00')
# The parser class method parse_data converts the raw data into a parsed bunch class data object
optaa.parse_data()
optaa.data.keys() # print the resulting dictionary keys in the data object
dict_keys(['time', 'a_signal_dark', 'elapsed_run_time', 'external_temp_raw', 'pressure_raw', 'num_wavelengths', 'c_signal_dark', 'c_signal_raw', 'serial_number', 'internal_temp_raw', 'a_reference_dark', 'c_reference_dark', 'a_reference_raw', 'c_reference_raw', 'a_signal_raw'])
Almost every dataset will include multiple sources of timing data. In this case, because the OPTAA reports the data in binary format, we use the DCL recorded file start time added to the relative elapsed run time timestamp in the OPTAA data to create the Epoch (seconds since 1970-01-01) time used in these records.
From here, you can save the data to disk as a JSON formatted data file if you so desire. We use this method to store the parsed data files locally for all further processing.
# write the resulting Bunch object via the toJSON method to a JSON
# formatted data file (note, no pretty-printing keeping things compact)
with open(outfile, 'w') as f:
f.write(optaa.data.toJSON())
We are going to proceed, instead, by converting the data into a pandas dataframe and then an xarray dataset for the following steps.
# Convert the data into a panda dataframe and then an xarray dataset for further analysis.
df = pd.DataFrame(optaa.data)
df['time'] = pd.to_datetime(df.time, unit='s') # use the time variable to set the index
df.set_index('time', drop=False, inplace=True)
ds = df.to_xarray()
# add the wavelength number as a coordinate and dimension to the dataset
ds.coords['wavelength_index'] = np.arange(df.num_wavelengths.values[0])
ds.update({'a_reference_raw': (('time', 'wavelength_index'), np.vstack(df.a_reference_raw.values)),
'a_signal_raw': (('time', 'wavelength_index'), np.vstack(df.a_signal_raw.values)),
'c_reference_raw': (('time', 'wavelength_index'), np.vstack(df.c_reference_raw.values)),
'c_signal_raw': (('time', 'wavelength_index'), np.vstack(df.c_signal_raw.values))})
ds
<xarray.Dataset> Dimensions: (time: 656, wavelength_index: 81) Coordinates: * time (time) datetime64[ns] 2016-06-01T12:30:18 ... * wavelength_index (wavelength_index) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 ... Data variables: a_reference_dark (time) int64 462 462 463 463 463 463 463 464 465 465 ... a_reference_raw (time, wavelength_index) int32 971 1099 1238 1382 ... a_signal_dark (time) int64 639 640 641 641 642 642 642 643 644 645 ... a_signal_raw (time, wavelength_index) int32 1430 1670 1932 2206 ... c_reference_dark (time) int64 468 468 469 469 469 469 469 470 470 470 ... c_reference_raw (time, wavelength_index) int32 745 854 962 1079 1220 ... c_signal_dark (time) int64 705 706 706 706 706 707 707 707 708 708 ... c_signal_raw (time, wavelength_index) int32 856 1018 1195 1390 ... elapsed_run_time (time) int64 10374 10623 10873 11125 11377 11627 ... external_temp_raw (time) int64 39734 39737 39738 39734 39735 39739 ... internal_temp_raw (time) int64 50678 50670 50674 50675 50672 50671 ... num_wavelengths (time) int64 81 81 81 81 81 81 81 81 81 81 81 81 81 ... pressure_raw (time) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... serial_number (time) int64 267 267 267 267 267 267 267 267 267 267 ...
# Provide a simple plot of a bursts worth of data
output_notebook()
# make a list of our columns
cols = ['a_reference_raw', 'a_signal_raw', 'c_reference_raw', 'c_signal_raw']
colors = palette[4]
# make the figure,
p = figure(title="Raw OPTAA Data -- Burst", width = 850, height = 500)
p.xaxis.axis_label = 'Wavelength Number'
p.yaxis.axis_label = 'Counts'
# loop through our columns and colours
for c, cname in zip(colors, cols):
for i in range(0, len(ds.time), 10):
p.line(ds.wavelength_index.values, ds[cname].values[i, :], color=c, legend=cname)
p.toolbar_location = 'above'
show(p)
# The OPTAA data is collected hourly in a burst mode (~1 Hz data sampled for 2-3 minutes). We need to take a median
# average of each burst to clean up variablity in the data created by the movement of the NSIF relative to the
# water column and to make the ultimate data files smaller and easier to work with.
burst = ds.resample(time='30Min').median()
# make the figure,
p = figure(title="Raw OPTAA Data -- Averaged", width = 850, height = 500)
p.xaxis.axis_label = 'Wavelength Number'
p.yaxis.axis_label = 'Counts'
for c, cname in zip(colors, cols):
p.line(burst.wavelength_index.values, burst[cname].values[0, :], color=c, legend=cname)
p.toolbar_location = 'above'
show(p)
The following two functions and the implementation below, takes the work from the examples above and combines them into a simple routine we can use to access, download and initially process the SPKIR data for the month of June (change the example regex to get whatever data it is you are after).
# Add some addition modules
from bs4 import BeautifulSoup
import re
# Function to create a list of the data files of interest on the raw data server
def list_files(url, tag=''):
page = requests.get(url).text
soup = BeautifulSoup(page, 'html.parser')
pattern = re.compile(str(tag))
return [node.get('href') for node in soup.find_all('a', text=pattern)]
# Function to download a file, parse it, apply median-averaging to the bursts and create a final dataset.
def process_file(file):
# Initialize the parser, download and parse the data file
optaa = Parser(baseurl + file)
r = requests.get(optaa.infile, verify=True)
optaa.raw = r.content
optaa.parse_data()
# Convert the parsed data to a dataframe and from there to an xarray dataset
df = pd.DataFrame(optaa.data)
df['time'] = pd.to_datetime(df.time, unit='s') # use the time variable to set the index
df.set_index('time', drop=False, inplace=True)
ds = df.to_xarray()
# add the wavelength number as a coordinate and dimension to the dataset, and reset the arrays
ds.coords['wavelength_index'] = np.arange(df.num_wavelengths.values[0])
ds.update({'a_reference_raw': (('time', 'wavelength_index'), np.vstack(df.a_reference_raw.values)),
'a_signal_raw': (('time', 'wavelength_index'), np.vstack(df.a_signal_raw.values)),
'c_reference_raw': (('time', 'wavelength_index'), np.vstack(df.c_reference_raw.values)),
'c_signal_raw': (('time', 'wavelength_index'), np.vstack(df.c_signal_raw.values))})
# apply a median average to the burst
burst = ds.resample(time='30Min').median()
return burst
# Create a list of the files from June using a simple regex as tag to discriminate the files
files = list_files(baseurl, '201606[0-9]{2}_[0-9]{6}.optaa.log')
# Process the data files for June and concatenate into a single dataset
frames = [process_file(f) for f in files]
june = xr.concat(frames, 'time')
# Plot the burst averaged data for the month of June 2016 (showing just 1 wavelength, nominally around 676 nm).
p = figure(x_axis_type="datetime", title="Raw OPTAA Data -- June 2016", width = 800, height = 500)
p.xaxis.axis_label = 'Date and Time'
p.yaxis.axis_label = 'Counts'
# loop through our columns and colours
for c, cname in zip(colors, cols):
p.line(june.time.values, june[cname].values[:, 69], color=c, legend=cname)
show(p)
At this point, you have the option to save the data, or apply the processing routines available in pyseas and cgsn_processing, to convert the data from raw engineering units to scientific units using the calibration coefficients that are available online.
june['time'] = june.time.values.astype(float) / 10.0**9 # Convert from datetime object in nanoseconds to seconds since 1970
june.to_netcdf('C:\\ooi\\ce02shsm_june2016_raw_optaa.nc')