Notebook

Python & `HDF5` (Hierarchical Data Format)

Link to Kyle's [Bitbucket Repo](https://bitbucket.org/yingkaisha/python-in-remote-sensing/src/tip/_libs/) and some [testing data-sets](https://bitbucket.org/yingkaisha/python-in-remote-sensing/src/tip/_data/_demos/)

Links to HDF5 & Python

http://www.hdfgroup.org/HDF5/whatishdf5.html
http://www.nersc.gov/users/software/programming-libraries/io-libraries/hdf5/
http://docs.h5py.org/en/latest/

To use `HDF5` & Python, you'll need to install 2 libraries/packages from the Anaconda distribuion.

Type the following commands into the terminal prompt.

conda install hdf5
conda install h5py

The `HDF5` File Structure

The 3 components of an HDF5 file (files appended with a ".h5").

Datasets

Array-like objects
Stores numerical data on disk

Groups

Hierarchical containers (e.g. like a file system)
Stores datasets and other groups

Attributes

User-defined metadata
Can be attached to datasets (and groups)

Primary benefits of HDF5-files:

Organization in hierarchical Groups and Attributes.
Groups act like folders, allowing related datasets to be stored together.
Attributes allows the direct attachement of metadata to the actual data they describe.

Example of creating `HDF5` files (taken from book 'Python and `HDF5`')

Creation of `HDF5` files shows the organizational structure...

import h5py 

>>> f = h5py.File("weather_data.h5")

>>> f["/15/Temperature"] = temperature_station15
>>> f["/15/Temperature"].attrs["dt"] = 10.0
>>> f["/15/Temperature"].attrs["startTime"] = 1375204299

>>> f["/15/Wind"] = wind
>>> f["/15/Wind"].attrs["dt"] = 5.0

>>> f["/20/Temperature"] = temperature_station20

In the above code-chunk:

the /15/... is similar to a folder system on a computer.

e.g. All of the data related to /15/... is stored under the /15/... (e.g. Temperature & Wind).

Similarly, accessing the metadata attributes can be done in the following way...

>>> dataset = f["15/Temperature"]
>>> for key, value in dataset.attr.iteritems():
        print "%s: %s" % (key, value)

    dt: 10.0
    start_time = 1375204299

Metadata is stored in the attr attribute of a dataset, which is a dictionary with key-value pairs.

`h5dump()` module from Kyle's Bitbucket Repo

```python >>> def h5dump(filename): if isinstance(filename,types.StringType): infile = h5py.File(filename,'r') elif isinstance(filename,h5py._hl.files.File): infile=filename else: raise IOError, "need an h5 filename or h5.File instance" infile.visititems(print_attrs) print('-------------------') print("attributes for the root file") print('-------------------') for key,value in infile.attrs.items(): print("attribute name: ",key,"--- value: ",value) return Non ```

Load `example.h5` into the notebook

In [14]:

import h5py as h5
from h5lib import h5dump, print_attrs

f = h5.File('example.h5')

print '\nHDF5 file \'example.h5\' just loaded:\n\n%r' % f
print '\nWe see that the file has been opened in read-mode, hence the \'mode +r\'...'

HDF5 file 'example.h5' just loaded:

<HDF5 file "example.h5" (mode r+)>

We see that the file has been opened in read-mode, hence the 'mode +r'...

Data-type of loaded `.h5` file

In [40]:

print '\nThe loaded .h5 file is of type...\n%s' % type(f)
print '\nLoaded file is of type \'File\'.'

The loaded .h5 file is of type...
<class 'h5py._hl.files.File'>

Loaded file is of type 'File'.

Viewing the metadata of `HDF5` files

There are a few ways of viewing the stored metadata...

One is to use h5dump() (contribution from Kyle). This routine essentially prints out all of the material stored within the .h5 file.

Using `h5dump` on "`example.h5`"

In [33]:

h5dump(f)

item name:  Example SDS <HDF5 dataset "Example SDS": shape (16, 5), type ">i2">
    HDF4_OBJECT_TYPE: SDS
    HDF4_OBJECT_NAME: Example SDS
    HDF4_REF_NUM: 2
item name:  Example Vdata <HDF5 dataset "Example Vdata": shape (10,), type "|V6">
    TITLE: Example Vdata
    CLASS: TABLE
    VERSION: 1.0
    FIELD_0_NAME: Idx
    FIELD_1_NAME: Temp
    FIELD_2_NAME: Dewpt
    HDF4_OBJECT_TYPE: Vdata
    HDF4_OBJECT_NAME: Example Vdata
    HDF4_REF_NUM: 11
item name:  Example Vdata_t <HDF5 named type "Example Vdata_t" (dtype |V6)>
item name:  HDF4_DIMGROUP <HDF5 group "/HDF4_DIMGROUP" (0 members)>
item name:  MonthlyRain <HDF5 group "/MonthlyRain" (2 members)>
    HDF4_OBJECT_TYPE: Vgroup
    HDF4_OBJECT_NAME: MonthlyRain
    HDF4_REF_NUM: 12
item name:  MonthlyRain/Data Fields <HDF5 group "/MonthlyRain/Data Fields" (2 members)>
    HDF4_OBJECT_TYPE: Vgroup
    HDF4_OBJECT_NAME: Data Fields
    HDF4_REF_NUM: 13
item name:  MonthlyRain/Data Fields/RrLandRain <HDF5 dataset "RrLandRain": shape (28, 72), type ">f4">
    HDF4_OBJECT_TYPE: SDS
    HDF4_OBJECT_NAME: RrLandRain
    HDF4_REF_NUM: 16
    DIMENSION_NAMELIST: ['/HDF4_DIMGROUP/YDim:MonthlyRain' '/HDF4_DIMGROUP/XDim:MonthlyRain']
item name:  MonthlyRain/Data Fields/TbOceanRain <HDF5 dataset "TbOceanRain": shape (28, 72), type ">f4">
    HDF4_OBJECT_TYPE: SDS
    DIMENSION_NAMELIST: ['/HDF4_DIMGROUP/YDim:MonthlyRain' '/HDF4_DIMGROUP/XDim:MonthlyRain']
    HDF4_OBJECT_NAME: TbOceanRain
    HDF4_REF_NUM: 15
item name:  MonthlyRain/Grid Attributes <HDF5 group "/MonthlyRain/Grid Attributes" (0 members)>
    HDF4_OBJECT_TYPE: Vgroup
    HDF4_OBJECT_NAME: Grid Attributes
    HDF4_REF_NUM: 14
-------------------
attributes for the root file
-------------------
attribute name:  HDFEOSVersion_GLOSDS --- value:  HDFEOS_V2.16
attribute name:  StructMetadata.0_GLOSDS --- value:  GROUP=SwathStructure
END_GROUP=SwathStructure
GROUP=GridStructure
	GROUP=GRID_1
		GridName="MonthlyRain"
		XDim=72
		YDim=28
		UpperLeftPointMtrs=(0.000000,70000000.000000)
		LowerRightMtrs=(360000000.000000,-70000000.000000)
		Projection=GCTP_GEO
		GROUP=Dimension
		END_GROUP=Dimension
		GROUP=DataField
			OBJECT=DataField_1
				DataFieldName="TbOceanRain"
				DataType=DFNT_FLOAT32
				DimList=("YDim","XDim")
			END_OBJECT=DataField_1
			OBJECT=DataField_2
				DataFieldName="RrLandRain"
				DataType=DFNT_FLOAT32
				DimList=("YDim","XDim")
			END_OBJECT=DataField_2
		END_GROUP=DataField
		GROUP=MergedFields
		END_GROUP=MergedFields
	END_GROUP=GRID_1
END_GROUP=GridStructure
GROUP=PointStructure
END_GROUP=PointStructure
END

The loaded .h5 file is a organized as an [OrderedDict](https://docs.python.org/2/library/collections.html#collections.OrderedDict) data-container (recall [Python dictionaries](https://docs.python.org/2/tutorial/datastructures.html)).

Getting all "items" (e.g. `key-value` pairs, or all `groups` and `datasets`) stored in the `.h5` file

Accessing the .items() method of an .h5 file will give you all of the stored datasets and groups.

In [63]:

print '\nStored groups and datasets within .h5 file:\n'.upper()
for each_item in f.items():
    print each_item

STORED GROUPS AND DATASETS WITHIN .H5 FILE:

(u'Example SDS', <HDF5 dataset "Example SDS": shape (16, 5), type ">i2">)
(u'Example Vdata', <HDF5 dataset "Example Vdata": shape (10,), type "|V6">)
(u'Example Vdata_t', <HDF5 named type "Example Vdata_t" (dtype |V6)>)
(u'HDF4_DIMGROUP', <HDF5 group "/HDF4_DIMGROUP" (0 members)>)
(u'MonthlyRain', <HDF5 group "/MonthlyRain" (2 members)>)

If only the keys are desired, then simply use the .keys() method.

Example

In [71]:

print '\nPrinting out all the keys in the imported hdf5 file.\n'
for n, each_key in enumerate(f.keys()):
    print 'Key %d:\t%s' % (n + 1, each_key)

print '\n'

Printing out all the keys in the imported hdf5 file.

Key 1:	Example SDS
Key 2:	Example Vdata
Key 3:	Example Vdata_t
Key 4:	HDF4_DIMGROUP
Key 5:	MonthlyRain

Accessing metadata of a specific `dataset` or `group`

Can use the print_attrs() routine (thanks Kyle). A look at the routine...

>>> def print_attrs(name, obj):
        print("item name: ",name,repr(obj))
        for key, val in obj.attrs.iteritems():
            print("    %s: %s" % (key, val))

Example 1: Metadata of a `dataset`

In [72]:

print_attrs('Example SDS', f['Example SDS'])

item name:  Example SDS <HDF5 dataset "Example SDS": shape (16, 5), type ">i2">
    HDF4_OBJECT_TYPE: SDS
    HDF4_OBJECT_NAME: Example SDS
    HDF4_REF_NUM: 2

Example 2: Metadata of a `group`

In [83]:

print_attrs('MonthlyRain', f['MonthlyRain'])

print '''\n
Here, 
    - we see 'MonthlyRain' is a group belonging to the group 'Vgroup'.
    - ...also see there are 2 additional members (or subgroups) attached to 
      this group.
    - the subgroups can be accessed with the .keys() method.
\n'''

item name:  MonthlyRain <HDF5 group "/MonthlyRain" (2 members)>
    HDF4_OBJECT_TYPE: Vgroup
    HDF4_OBJECT_NAME: MonthlyRain
    HDF4_REF_NUM: 12


Here, 
    - we see 'MonthlyRain' is a group belonging to the group 'Vgroup'.
    - ...also see there are 2 additional members (or subgroups) attached to 
      this group.
    - the subgroups can be accessed with the .keys() method.

In [90]:

for n, each_key in enumerate(f['MonthlyRain']):
    print 'Key %d:\t%s' % (n+1, each_key)

print '''\n
We can check/confirm what object type each group/dataset is, by using the 'type()' 
function.\n
'''

Key 1:	Data Fields
Key 2:	Grid Attributes


We can check/confirm what object type each group/dataset is, by using the 'type()' 
function.

In [92]:

print 'f[\'MonthlyRain\'] is a of type:\n\n%s\n' % type(f['MonthlyRain'])

f['MonthlyRain'] is a of type:

<class 'h5py._hl.group.Group'>

Accessing the `Data Fields` subgroup in `f['MonthlyRain']` & accessing its metadata

In [98]:

print_attrs('Data Fields Metadata', f['MonthlyRain']['Data Fields'])

print '\nHere we see 2 members of the Data Fields group, so we can access \n\
additional fields with the .keys() method.\n'

print 'Additional groups/datasets:\n'
for n, each_key in enumerate(f['MonthlyRain']['Data Fields']):
    print 'Key %d:\t%s' % (n+1, each_key)

item name:  Data Fields Metadata <HDF5 group "/MonthlyRain/Data Fields" (2 members)>
    HDF4_OBJECT_TYPE: Vgroup
    HDF4_OBJECT_NAME: Data Fields
    HDF4_REF_NUM: 13

Here we see 2 members of the Data Fields group, so we can access 
additional fields with the .keys() method.

Additional groups/datasets:

Key 1:	RrLandRain
Key 2:	TbOceanRain

In [97]:

type(f['MonthlyRain']['Data Fields']['TbOceanRain'])

Out[97]:

h5py._hl.dataset.Dataset

Can access the numerical values of specific datasets by using the `.value()` method.

In [104]:

print '\n', f['MonthlyRain']['Data Fields']['TbOceanRain'].value

[[ -1.  -1.  -1. ...,  -1.  -1.  -1.]
 [ -1.  -1.  -1. ...,  -1.  -1.  -1.]
 [ 93.  -1.  -1. ...,  58.  -1.  -1.]
 ..., 
 [ -1.  -1.  -1. ...,  -1.  -1.  -1.]
 [ -1.  -1.  -1. ...,  -1.  -1.  -1.]
 [ -1.  -1.  -1. ...,  -1.  -1.  -1.]]

Python & HDF5 (Hierarchical Data Format)

To use HDF5 & Python, you'll need to install 2 libraries/packages from the Anaconda distribuion.

The HDF5 File Structure

Example of creating HDF5 files (taken from book 'Python and HDF5')

Creation of HDF5 files shows the organizational structure...

h5dump() module from Kyle's Bitbucket Repo

Load example.h5 into the notebook

Data-type of loaded .h5 file

Viewing the metadata of HDF5 files

Using h5dump on "example.h5"

Getting all "items" (e.g. key-value pairs, or all groups and datasets) stored in the .h5 file