HOWTO access and display MODIS satellite data with Python 3

First steps in Python3! This example has been run with the python.org Python v. 3.3.3 (installed from the .dmg), with numpy, matplotlib, GDAL, and IPython (and its dependencies for this notebook) installed into a virtual environment with pip.

First, let's check that we're really using Python 3:

In [2]:
import sys
print(sys.version)
3.3.3 (v3.3.3:c3896275c0f6, Nov 16 2013, 23:39:35) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

Very well. So let's load the pylab libraries, set them up for plotting inline, import everything else we need, and off we go. The goal is to open a MODIS Level 1B file, that is, a file that contains satellite data that has been corrected for sensor calibration, but not re-mapped or gridded in any way.

This data comes from NASA in a format called HDF-EOS (http://hdfeos.org/), a hierachchical data format based on HDF4, which is maintained by the HDF Group. Theoretically, the pyhdf library would be suitable for interfacing with such a file, but it has the tendency to be fickle. The web page (http://pysclint.sourceforge.net/pyhdf/) does not give me confidence it has been ported to Python 3.

No matter, the version of GDAL I have installed has been compiled with HDF4 and HDF-EOS support. GDAL supports a huge number of data formats and therefore provides somewhat of a convoluted kitchen-sink of an interface. But it'll do.

In [3]:
%pylab inline
import os, os.path
from osgeo import gdal
Populating the interactive namespace from numpy and matplotlib

Accessing the data in a MODIS L1B file

Here is the file we want to look at, and it comes with a separate geolocation file. We open both and receive GDAL Dataset objects back. The file name tells us it contains data at (nominally) 1 km resolution, and dates from Julian day 199 in 2004.

In [4]:
fn = "MOD021KM.A2004199.2140.005.2010140125955.hdf"
geofn = "MOD03.A2004199.2140.005.2010140193808.hdf"
In [5]:
dat = gdal.Open(fn)
geodat = gdal.Open(geofn)
print(type(dat))
<class 'osgeo.gdal.Dataset'>

Metadata can be accessed from the gdal.Dataset object, for example to confirm the date/time stamp of the data.

In [6]:
metadata = dat.GetMetadata_Dict()
print(metadata['RANGEBEGINNINGDATE'], metadata['RANGEBEGINNINGTIME'])
2004-07-17 21:40:00.000000

Indeed, the data file knows the name of its own georeference data file, though the metadata key naming isn't immediately intuitive. ANCILLARYINPUTPOINTER? I wouldn't have guessed. (There are other ways of getting the same data, in fact.)

In [7]:
dat.GetMetadata_Dict()['ANCILLARYINPUTTYPE'], dat.GetMetadata_Dict()['ANCILLARYINPUTPOINTER']
Out[7]:
('Geolocation', 'MOD03.A2004199.2140.005.2010140193808.hdf')

HDF files are, as the name suggests, hierarchically organized. In the GDAL dataset API this means that a gdal.Dataset object contains subdatasets. For HDF-EOS files we stop at one level of subdatasetting: each subdatasets contains raster bands of the same dimension and data type. What subdatasets there are in a dataset, and what rasterbands in a subdataset, is laid down in NASA's data specification and user manuals for the data in question.

In the case of MODIS, we need to know this: MODIS acquires data in 36 spectral bands, 2 at a nominal resolution of 250 m, 5 at 500 m, and the remaining 19 at 1 km. The data file we have is the 1 km one. It also contains all the higher-resolution datasets, aggregated to 1 km. For reasons that will become clear, I'd like to pull out the subdatasets that were generated from the 250 and 500 m data. Luckily (either because I've looked through all the subdatasets or because I've read NASA's documentation) I know which ones these are. Similarly, I know where to find latitude and longitude arrays in the georeference data.

In [8]:
subdatasets = dat.GetSubDatasets()
geosubdatasets = geodat.GetSubDatasets()
In [9]:
subdatasets[4][1], subdatasets[7][1], geosubdatasets[8][1], geosubdatasets[9][1]
Out[9]:
('[2x2030x1354] EV_250_Aggr1km_RefSB MODIS_SWATH_Type_L1B (16-bit unsigned integer)',
 '[5x2030x1354] EV_500_Aggr1km_RefSB MODIS_SWATH_Type_L1B (16-bit unsigned integer)',
 '[2030x1354] Latitude (32-bit floating-point)',
 '[2030x1354] Longitude (32-bit floating-point)')

There they are. And the subdataset descriptions contain additional information:

  • the shape of the data arrays (2030x1354 samples for each raster band in the dataset)
  • the shape of the latitude and longitude coordinate arrays (the same, luckily, so we don't have to interpolate the data)
  • the data type (16-bit unsigned integer)

We open the subdatasets and see that each indeed contains the expected number of rasterbands.

In [10]:
highres = gdal.Open(subdatasets[4][0], gdal.GA_ReadOnly)
midres = gdal.Open(subdatasets[7][0], gdal.GA_ReadOnly)
latsds = gdal.Open(geosubdatasets[8][0], gdal.GA_ReadOnly)
lonsds = gdal.Open(geosubdatasets[9][0], gdal.GA_ReadOnly)
In [11]:
print(highres.RasterCount, midres.RasterCount, latsds.RasterCount, lonsds.RasterCount)
2 5 1 1

I would like to produce something close to a true color RGB image, so I'd like to access the bands closest to red, green and blue. These would be band 2 (the second from the 250 m subdataset), and bands 4 and 3 (the second and first from the 500 m data). Attention, trap: rasterband indexing in GDAL starts at 1, not 0.

In [12]:
red = highres.GetRasterBand(1)
green = midres.GetRasterBand(2)
blue = midres.GetRasterBand(1)

Finally, let's check that our three color RasterBand objects are of the same size. They are, because each has been aggregated to a resolution of 1 km (at nadir, which means looking straight down - it's actually more than 1 km at the left and right edges).

In [17]:
for color in [red, green, blue]:
    print(color.XSize, color.YSize)
1354 2030
1354 2030
1354 2030

Visualizing MODIS L1B data

To plot the data as an image, we need to do two things:

  • make them available to a numpy array
  • rescale them from 16 bit to 8 bit for easier plotting

For the first, in this demonstation, we're going to be unconcerned about memory and simply allocate a whole $N \times M \times 3$ array containing 8-bit integer data. Note that the XSize is the number of columns and the YSize the number of rows, so in the array declaration they have to be reversed (rows come first).

In [18]:
imgdata = numpy.zeros([blue.YSize, blue.XSize, 3], np.uint8)
imgdata.shape
Out[18]:
(2030, 1354, 3)

In the second step, the maximum and minimum data value are calculated, any invalid data masked out, and the remaining data re-scaled to the interval between 0 and 255. This is done for each RGB band. Note that we get the data as a numpy array from the rasterband object via color.ReadAsArray().

In [19]:
for idx, color in enumerate([red, green, blue]):
    min, max = color.ComputeRasterMinMax()
    data = np.ma.masked_greater_equal(color.ReadAsArray(), max)
    imgdata[:, :, idx] = np.multiply(255 / (max - min), data - min).astype(np.uint8)

The result isn't very "true color", simply because of the high dynamic range of the image in particular the cloud and ice areas, which are much brighter than the land.

In [20]:
fig = plt.figure(figsize=(18, 27))
plt.imshow(imgadata)
Out[20]:
<matplotlib.image.AxesImage at 0x1015e80d0>