Notebook

Parquet Format data types support across versions¶

Changes in Parquet format releases - only related to data types (and not encodings, compression, pages, etc):

Version 1.0.0 - Jul 30, 2013 (parquet.thrift)
- ConvertedTypes that already exist at this points: UTF8, MAP, MAP_KEY_VALUE, LIST
Version 2.0.0 - Dec 14, 2013 (parquet.thrift, changes)
- ENUM ConvertedType
Version 2.1.0 - May 6, 2014 (parquet.thrift, changes)
- DECIMAL ConvertedType
Version 2.2.0 - Feb 6, 2015 (parquet.thrift, changes)
- Additional ConvertedTypes (https://github.com/apache/parquet-format/pull/3, PARQUET-12) for DATE, TIME_MILLIS, TIMESTAMP_MILLIS UINT_8/UINT_16/UINT_32/UINT_64, INT_8/INT_16/INT_32/INT_64, JSON, BSON, INTERVAL
Version 2.3.0 (identical to 2.2.0, small fixup)
Version 2.4.0 - Oct 17, 2017 (changes)
- Additional ConvertedTypes: TIME_MICROS, TIMESTAMP_MICROS
- Introduction of LogicalType (that for now covers the existing converted types, AFAIK) + UUIDType, NullType logical types
Version 2.5.0 - Mar 29, 2018 (changes)
Version 2.6.0 - Sep 27, 2018 (changes)
- Addition of NanoSeconds time unit for Time and Timestamp logical types
Version 2.7.0 - Sep 25, 2019 (changes)
- (bloom filter, encryption)
Version 2.8.0 - Jan 13, 2020 (changes)
- (Byte Stream Split encoding)

Currently, when writing parquet files with Arrow (parquet-cpp), we default to parquet format "1.0". This would in theory mean that we don't use ConvertedTypes or LogicalTypes introduced in format "2.0"+. However, in practice, we already write most of those annotations with the default of version="1.0" as well.

For example, we already write many of the Converted/LogicalTypes from 2.0+ (e.g. decimal, date, timestamp with millis/micros) by default. But we don't write converted/logical types for:

Integer types other than int32/int64 (eg unsigned integers). The ConvertedType annotations (UINT_8, INT_8, etc) for those were introduced in version 2.2.0 (Feb 6, 2015), the LogicalType equivalents in version 2.4.0 (Oct 17, 2017).
Nanosecond-resolution timestamps. The LogicalType time unit for this was introduced in version 2.6.0 (Sep 27, 2018)

When specifying version="2.0", then the above 2 cases also use the appropriate logical type annotations.

All other Converted/LogicalTypes are already written, even with version="1.0". I assume this is because for those types there is no risk on misinterpreting the physical data type when not understanding the type annotations (in contrast, eg UINT64 is stored as INT64 physical type, and thus would lead to wrong data when interpreted as its physical type).

Illustration with code example:

In [1]:

import pyarrow as pa
import pyarrow.parquet as pq
import decimal

In [2]:

pa.__version__

Out[2]:

'4.0.0.dev60+gab5fc979c'

In [3]:

table = pa.table({
    "int64": [1, 2, 3],
    "int32": pa.array([1, 2, 3], type="int32"),
    "uint32": pa.array([1, 2, 3], type="uint32"),
    "date": pa.array([1, 2, 3], type=pa.date32()),
    "timestamp": pa.array([1, 2, 3], type=pa.timestamp("ms")),
    "timestamp_tz": pa.array([1, 2, 3], type=pa.timestamp("ms", tz="UTC")),
    "timestamp_ns": pa.array([1000, 2000, 3000], type=pa.timestamp("ns")),
    "timestamp_ns_tz": pa.array([1000, 2000, 3000], type=pa.timestamp("ns", tz="UTC")),
    "string": pa.array(["a", "b", "c"]),
    "list": pa.array([[1, 2], [3, 4], [5, 6]]),
    "decimal": pa.array([decimal.Decimal("1.0"), decimal.Decimal("2.0"), decimal.Decimal("3.0")]),
})

In [4]:

pq.write_table(table, "test_v1.parquet", version="1.0")
pq.write_table(table, "test_v2.parquet", version="2.0")

In [5]:

schema1 = pq.read_metadata("test_v1.parquet").schema
schema1

Out[5]:

<pyarrow._parquet.ParquetSchema object at 0x7eff2e9d4d40>
required group field_id=0 schema {
  optional int64 field_id=1 int64;
  optional int32 field_id=2 int32;
  optional int64 field_id=3 uint32;
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=6 timestamp_tz (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=7 timestamp_ns (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=8 timestamp_ns_tz (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional binary field_id=9 string (String);
  optional group field_id=10 list (List) {
    repeated group field_id=11 list {
      optional int64 field_id=12 item;
    }
  }
  optional fixed_len_byte_array(1) field_id=13 decimal (Decimal(precision=2, scale=1));
}

The corresponding Arrow schema:

In [6]:

print(schema1.to_arrow_schema().to_string(show_field_metadata=False))

int64: int64
int32: int32
uint32: int64
date: date32[day]
timestamp: timestamp[ms]
timestamp_tz: timestamp[ms, tz=UTC]
timestamp_ns: timestamp[us]
timestamp_ns_tz: timestamp[us, tz=UTC]
string: string
list: list<item: int64>
  child 0, item: int64
decimal: decimal(2, 1)

Note:

Integers other than int32/int64 (eg unsigned ints) are lost
Nanoseconds have become microseconds

In [7]:

schema2 = pq.read_metadata("test_v2.parquet").schema
schema2

Out[7]:

<pyarrow._parquet.ParquetSchema object at 0x7eff2e9d2080>
required group field_id=0 schema {
  optional int64 field_id=1 int64;
  optional int32 field_id=2 int32;
  optional int32 field_id=3 uint32 (Int(bitWidth=32, isSigned=false));
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=6 timestamp_tz (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=7 timestamp_ns (Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=8 timestamp_ns_tz (Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional binary field_id=9 string (String);
  optional group field_id=10 list (List) {
    repeated group field_id=11 list {
      optional int64 field_id=12 item;
    }
  }
  optional fixed_len_byte_array(1) field_id=13 decimal (Decimal(precision=2, scale=1));
}

In [8]:

print(schema2.to_arrow_schema().to_string(show_field_metadata=False))

int64: int64
int32: int32
uint32: uint32
date: date32[day]
timestamp: timestamp[ms]
timestamp_tz: timestamp[ms, tz=UTC]
timestamp_ns: timestamp[ns]
timestamp_ns_tz: timestamp[ns, tz=UTC]
string: string
list: list<item: int64>
  child 0, item: int64
decimal: decimal(2, 1)

Now all types are preserved.

Looking more closely to the columns with nanosecond timestamps:

In [9]:

schema1.column(6)  # this actually also has the TIMESTAMP_MICROS set, but is incorrectly not shown, see https://issues.apache.org/jira/browse/ARROW-11399

Out[9]:

<ParquetColumnSchema>
  name: timestamp_ns
  path: timestamp_ns
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [10]:

schema2.column(6)

Out[10]:

<ParquetColumnSchema>
  name: timestamp_ns
  path: timestamp_ns
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [11]:

schema1.column(7)

Out[11]:

<ParquetColumnSchema>
  name: timestamp_ns_tz
  path: timestamp_ns_tz
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): TIMESTAMP_MICROS

In [12]:

schema2.column(7)

Out[12]:

<ParquetColumnSchema>
  name: timestamp_ns_tz
  path: timestamp_ns_tz
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

For the version="2.0", a legacy converted type is no longer set (since this is not possible, there is no ConvertedType for nanoseconds). But this also means that such files won't correctly read by older parquet readers that do not yet support the nanosecond LogicalType (introduced in Parquet format 2.6.0, Sep 2018). Such readers will read those columns as int64 instead of timestamp.

In [ ]: