Patrick BROCKMANN
Software engineer at LSCE (Climate and Environment Sciences Laboratory)
Date: 18 November 2015
Release: 0.7
The Linked Paleo Data (LiPD) container based on the Linked Data JSON (JSON-LD) format is a practical solution to the problem of organizing and storing hierarchical paleoclimate data in a generalizable schema. This is an important step forward towards standardizing the representation and linkage of diverse paleoclimate datasets.
In this IPython notebook, I have experimental converters to interact with the LiPD container using ordinary spreadsheets. The motivation to create this method is guided by the fact that the paleoclimate community uses mainly spreadsheets to edit and store the data and the metadata of their measurements, and not JSON-based formats. What is missing is a way to convert such spreadsheet-based data to LiPD format and vice versa.
Working directly with LiPD has two other disadvantages:
Therefore, I propose to stick with the use of spreadsheets but standardize them into a structured spreadsheet where the data and the metadata are stored in two separate worksheets of the same spreadsheet document. The dot notation is used to represent the hierarchical nature of the metadata attributes. Following the nomenclature of LiPD, I call this structured spreadsheet PDS for Paleo Data Spreadsheet.
With a PDS, users can directly edit their data in an ordinary spreadsheet program like Excel or OpenOffice and later convert them to LiPD, which is a good container for storing data in a document database like mongoDB (since it uses JSON). In addition, I have implemented converters to transform PDS to python pandas dataframes, which are convenient for subsequent data analysis in e.g. an IPython notebook.
The Data worksheet:
The Metadata worksheet:
Notes:
Currently, pandas does not yet support the reading and the writing of ODS (Open Document Spreadsheets) but there are many requests for this feature and it should be feasible soon (pandas/issue 2311).
The LiPD container refers to a headerless CSV file where the data are stored. Each column is therefore referenced only by a column number and is very poorly documented, and this could lead to confusion. I think it would be safer and clearer to use the first row as a header to name the columns by the parameter names.
The LiPD container can contain a list object for values, e.g. [{ "author": [{"name" : "N1"}, {"name" : "N2"}, {"name" : "N3"}] }]. It would be simpler to disable this possibility and have only unique values.
The spreadsheet cells must be formatted correctly, i.e. numbers cells are specified as number and not text. Same for dates.
A compliance checker needs to be built to check that input files conform to the PDS structure, something like the netCDF CF-checker.
The working group of PAGES (Past Global Changes) called 2K Network proposes to collect data using a PAGES2k/NOAA metadata template.
This template is very detailled but not easilly convertible to other structures such a pandas dataframes or LiPD container.
PDSlib is a python module that offers two-way converters for different structures: PDS, pandas dataframes (df), and LiPD container.
dfD, dfM = PDSlib.PDS_to_df('PDS_01.xls')
dfD
depth | year | temperature | |
---|---|---|---|
0 | 0 | 2015 | 14.34 |
1 | 100 | 2015 | 12.70 |
2 | 200 | 2015 | 10.30 |
3 | 400 | 2015 | 10.20 |
4 | 450 | 2015 | NaN |
5 | 500 | 2015 | 10.10 |
6 | 1000 | 2015 | 10.10 |
dfM
Attribute | Value | |
---|---|---|
0 | depth.parameterType | measured |
1 | depth.units | cm |
2 | depth.notes | depth refers to top of sample |
3 | depth.datatype | csvw:NumericFormat |
4 | depth.description | depth below ocean floor |
5 | year.parameterType | inferred |
6 | year.units | AD |
7 | year.method | linear interpolation |
8 | year.datatype | csvw:NumericFormat |
9 | year.description | calendar year AD |
10 | temperature.parameterType | inferred |
11 | temperature.description | sea-surface temperature inferred from Mg/Ca ra... |
12 | temperature.climateInterpretation.seasonality | MJJ |
13 | temperature.climateInterpretation.parameter | T |
14 | temperature.climateInterpretation.parameterDetail | seaSurface |
15 | temperature.climateInterpretation.interpDirection | positive |
16 | temperature.climateInterpretation.basis | Mg/Ca calibration to SST |
17 | temperature.calibration.equation | BAR2005: Mg/Ca=0.794*exp(0.10*SST) |
18 | temperature.calibration.uncertainty | 1.3 |
19 | temperature.calibration.reference | Barker et al., (2005), Thornalley et al., (2009) |
20 | temperature.datatype | csvw:NumericFormat |
21 | temperature.material | foramifera carbonate |
22 | temperature.proxy | Mg/Ca |
23 | temperature.units | deg C |
24 | paleoDataTableName | data |
25 | filename | atlantic0220Thornalley2009.csv |
a_LiPD = PDSlib.df_to_LiPD(dfD, dfM, verbose=True)
import json
print json.dumps(a_LiPD, sort_keys=False, indent=4, separators=(',', ': '))
{ "depth": { "parameterType": "measured", "units": "cm", "notes": "depth refers to top of sample", "datatype": "csvw:NumericFormat", "description": "depth below ocean floor" }, "year": { "parameterType": "inferred", "units": "AD", "datatype": "csvw:NumericFormat", "method": "linear interpolation", "description": "calendar year AD" }, "temperature": { "parameterType": "inferred", "description": "sea-surface temperature inferred from Mg/Ca ratios", "climateInterpretation": { "basis": "Mg/Ca calibration to SST", "parameter": "T", "parameterDetail": "seaSurface", "interpDirection": "positive", "seasonality": "MJJ" }, "calibration": { "equation": "BAR2005: Mg/Ca=0.794*exp(0.10*SST)", "uncertainty": 1.3, "reference": "Barker et al., (2005), Thornalley et al., (2009)" }, "datatype": "csvw:NumericFormat", "material": "foramifera carbonate", "proxy": "Mg/Ca", "units": "deg C" }, "paleoDataTableName": "data", "filename": "atlantic0220Thornalley2009.csv" }
This exemple comes from the article Linked Data JSON (JSON-LD) format but somes keys have been changed from column numbers to parameter names accordingly the note #2.
dfD, dfM = PDSlib.LiPD_to_df(a_LiPD)
dfD
depth | year | temperature | |
---|---|---|---|
0 | 0 | 2015 | 14.34 |
1 | 100 | 2015 | 12.70 |
2 | 200 | 2015 | 10.30 |
3 | 400 | 2015 | 10.20 |
4 | 450 | 2015 | NaN |
5 | 500 | 2015 | 10.10 |
6 | 1000 | 2015 | 10.10 |
dfM
Attribute | Value | |
---|---|---|
0 | depth.parameterType | measured |
1 | depth.units | cm |
2 | depth.notes | depth refers to top of sample |
3 | depth.datatype | csvw:NumericFormat |
4 | depth.description | depth below ocean floor |
5 | year.parameterType | inferred |
6 | year.units | AD |
7 | year.datatype | csvw:NumericFormat |
8 | year.method | linear interpolation |
9 | year.description | calendar year AD |
10 | temperature.parameterType | inferred |
11 | temperature.description | sea-surface temperature inferred from Mg/Ca ra... |
12 | temperature.climateInterpretation.basis | Mg/Ca calibration to SST |
13 | temperature.climateInterpretation.parameter | T |
14 | temperature.climateInterpretation.parameterDetail | seaSurface |
15 | temperature.climateInterpretation.interpDirection | positive |
16 | temperature.climateInterpretation.seasonality | MJJ |
17 | temperature.calibration.equation | BAR2005: Mg/Ca=0.794*exp(0.10*SST) |
18 | temperature.calibration.uncertainty | 1.3 |
19 | temperature.calibration.reference | Barker et al., (2005), Thornalley et al., (2009) |
20 | temperature.datatype | csvw:NumericFormat |
21 | temperature.material | foramifera carbonate |
22 | temperature.proxy | Mg/Ca |
23 | temperature.units | deg C |
24 | paleoDataTableName | data |
25 | filename | atlantic0220Thornalley2009.csv |
PDSlib.df_to_PDS(dfD, dfM, 'PDS_02.xls')
PDS_01.xls is identical to PDS_02.xls proof that conversions are correct.
PDS ===> LiPD ===> PDS
a = PDSlib.PDS_to_LiPD('PDS_Uemura.xls', verbose=True)
import json
print json.dumps(a, sort_keys=False, indent=4, separators=(',', ': '))
df_to_LiPD: Warning: at line 6 : Publication not present in Data worksheet => considered as global attribut df_to_LiPD: Warning: at line 7 : Publication not present in Data worksheet => considered as global attribut df_to_LiPD: Warning: at line 8 : Location not present in Data worksheet => considered as global attribut df_to_LiPD: Warning: at line 9 : Location not present in Data worksheet => considered as global attribut df_to_LiPD: Warning: at line 10 : Publication not present in Data worksheet => considered as global attribut df_to_LiPD: Warning: at line 11 : Publication not present in Data worksheet => considered as global attribut df_to_LiPD: Warning: at line 13 : Location not present in Data worksheet => considered as global attribut df_to_LiPD: Warning: at line 14 : Location not present in Data worksheet => considered as global attribut { "Depth": { "units": "m" }, "Age": { "units": "ka" }, "18O": { "units": "per mil" }, "dD": { "units": "per mil" }, "d-excess": { "units": "per mil" }, "Publication": { "date": 2012, "DOI": "10.5194/cp-8-1109-2012", "citation": "Uemura R., Masson-Delmotte V., Jouzel J., Landais A., Motoyama H., and Stenni B.: Ranges of moisture-source temperature estimated from Antarctic ice cores stable isotope records over glacial-interglacial cycles, Climate of the Past, 8, 1109-1125, 10.5194/cp-8-1109-2012, 2012.", "pubString": "Uemura et al., 2012" }, "Location": { "siteName": "Dome Fuji", "type": "point", "coordinates": "-77.31667, 39.6666, 3810", "zone": "Antarctica" }, "archiveType": "ice" }
PDSlib.LiPD_to_PDS(a, 'PDS_04.xls')
dfD, dfM = PDSlib.PDS_to_df('PDS_Uemura.xls')
dfM.sort_values(by='Attribute')
Attribute | Value | |
---|---|---|
2 | 18O.units | per mil |
1 | Age.units | ka |
0 | Depth.units | m |
12 | Location.coordinates | -77.31667, 39.6666, 3810 |
7 | Location.siteName | Dome Fuji |
13 | Location.type | point |
8 | Location.zone | Antarctica |
5 | Publication.DOI | 10.5194/cp-8-1109-2012 |
9 | Publication.citation | Uemura R., Masson-Delmotte V., Jouzel J., Land... |
6 | Publication.date | 2012 |
10 | Publication.pubString | Uemura et al., 2012 |
11 | archiveType | ice |
4 | d-excess.units | per mil |
3 | dD.units | per mil |
dfD.columns
Index([u'Depth', u'Age', u'18O', u'dD', u'd-excess'], dtype='object')
PDSlib.df_plot(dfD, 1, [2,3,4])
<bokeh.models.widgets.panels.Tabs at 0x7fe408b34b50>
PDSlib.PDS_to_html("PDS_Uemura.xls", 1, [2,3,4], "PDS_Uemura.html", width=600, height=400)
a_LiPD = [{
"author": [
{"name" : "Thornalley, D.J.R"},
{"name" : "Elderfield, H."},
{"name" : "McCave, N"}
],
"type" : "article",
"identifier" :
{"type": "doi",
"id": "10.1038/nature07717",
"url": "http://dx.doi.org/10.1038/nature07717"}
,
"pubYear": 2009
}]
dfD, dfM = PDSlib.LiPD_to_df(a_LiPD)
dfM
Attribute | Value | |
---|---|---|
0 | pubYear | 2009 |
1 | identifier.url | http://dx.doi.org/10.1038/nature07717 |
2 | identifier.type | doi |
3 | identifier.id | 10.1038/nature07717 |
4 | type | article |
5 | author.0.name | Thornalley, D.J.R |
6 | author.1.name | Elderfield, H. |
7 | author.2.name | McCave, N |
a_LiPD = { "type": "FeatureCollection",
"features": [
{ "type": "Feature",
"geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
"properties": {"prop0": "value0"}
},
{ "type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": [
[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
]
},
"properties": {
"prop0": "value0",
"prop1": 0.0
}
},
{ "type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
[100.0, 1.0], [100.0, 0.0] ]
]
},
"properties": {
"prop0": "value0",
"prop1": {"this": "that"}
}
}
]
}
dfD, dfM = PDSlib.LiPD_to_df(a_LiPD)
dfM
Attribute | Value | |
---|---|---|
0 | type | FeatureCollection |
1 | features.0.geometry.type | Point |
2 | features.0.geometry.coordinates | [102.0, 0.5] |
3 | features.0.type | Feature |
4 | features.0.properties.prop0 | value0 |
5 | features.1.geometry.type | LineString |
6 | features.1.geometry.coordinates | [[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [10... |
7 | features.1.type | Feature |
8 | features.1.properties.prop0 | value0 |
9 | features.1.properties.prop1 | 0 |
10 | features.2.geometry.type | Polygon |
11 | features.2.geometry.coordinates | [[[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [1... |
12 | features.2.type | Feature |
13 | features.2.properties.prop0 | value0 |
14 | features.2.properties.prop1.this | that |
These last 2 conversions examples are not bidirectional since list items in LiPD containers are transformed using numbers for each item when PDSlib.LiPD_to_df is used:
Should lists be allowed in LiPD ? If so, more work is required to recreate a list from numbered items when using PDSlib.df_to_LiPD