Changes in Parquet format releases - only related to data types (and not encodings, compression, pages, etc):
Currently, when writing parquet files with Arrow (parquet-cpp), we default to parquet format "1.0".
This would in theory mean that we don't use ConvertedTypes or LogicalTypes introduced in format "2.0"+. However, in practice, we already write most of those annotations with the default of version="1.0"
as well.
For example, we already write many of the Converted/LogicalTypes from 2.0+ (e.g. decimal, date, timestamp with millis/micros) by default. But we don't write converted/logical types for:
When specifying version="2.0"
, then the above 2 cases also use the appropriate logical type annotations.
All other Converted/LogicalTypes are already written, even with version="1.0"
. I assume this is because for those types there is no risk on misinterpreting the physical data type when not understanding the type annotations (in contrast, eg UINT64 is stored as INT64 physical type, and thus would lead to wrong data when interpreted as its physical type).
Illustration with code example:
import pyarrow as pa
import pyarrow.parquet as pq
import decimal
pa.__version__
'4.0.0.dev60+gab5fc979c'
table = pa.table({
"int64": [1, 2, 3],
"int32": pa.array([1, 2, 3], type="int32"),
"uint32": pa.array([1, 2, 3], type="uint32"),
"date": pa.array([1, 2, 3], type=pa.date32()),
"timestamp": pa.array([1, 2, 3], type=pa.timestamp("ms")),
"timestamp_tz": pa.array([1, 2, 3], type=pa.timestamp("ms", tz="UTC")),
"timestamp_ns": pa.array([1000, 2000, 3000], type=pa.timestamp("ns")),
"timestamp_ns_tz": pa.array([1000, 2000, 3000], type=pa.timestamp("ns", tz="UTC")),
"string": pa.array(["a", "b", "c"]),
"list": pa.array([[1, 2], [3, 4], [5, 6]]),
"decimal": pa.array([decimal.Decimal("1.0"), decimal.Decimal("2.0"), decimal.Decimal("3.0")]),
})
pq.write_table(table, "test_v1.parquet", version="1.0")
pq.write_table(table, "test_v2.parquet", version="2.0")
schema1 = pq.read_metadata("test_v1.parquet").schema
schema1
<pyarrow._parquet.ParquetSchema object at 0x7eff2e9d4d40> required group field_id=0 schema { optional int64 field_id=1 int64; optional int32 field_id=2 int32; optional int64 field_id=3 uint32; optional int32 field_id=4 date (Date); optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)); optional int64 field_id=6 timestamp_tz (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)); optional int64 field_id=7 timestamp_ns (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)); optional int64 field_id=8 timestamp_ns_tz (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)); optional binary field_id=9 string (String); optional group field_id=10 list (List) { repeated group field_id=11 list { optional int64 field_id=12 item; } } optional fixed_len_byte_array(1) field_id=13 decimal (Decimal(precision=2, scale=1)); }
The corresponding Arrow schema:
print(schema1.to_arrow_schema().to_string(show_field_metadata=False))
int64: int64 int32: int32 uint32: int64 date: date32[day] timestamp: timestamp[ms] timestamp_tz: timestamp[ms, tz=UTC] timestamp_ns: timestamp[us] timestamp_ns_tz: timestamp[us, tz=UTC] string: string list: list<item: int64> child 0, item: int64 decimal: decimal(2, 1)
Note:
schema2 = pq.read_metadata("test_v2.parquet").schema
schema2
<pyarrow._parquet.ParquetSchema object at 0x7eff2e9d2080> required group field_id=0 schema { optional int64 field_id=1 int64; optional int32 field_id=2 int32; optional int32 field_id=3 uint32 (Int(bitWidth=32, isSigned=false)); optional int32 field_id=4 date (Date); optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)); optional int64 field_id=6 timestamp_tz (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)); optional int64 field_id=7 timestamp_ns (Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)); optional int64 field_id=8 timestamp_ns_tz (Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)); optional binary field_id=9 string (String); optional group field_id=10 list (List) { repeated group field_id=11 list { optional int64 field_id=12 item; } } optional fixed_len_byte_array(1) field_id=13 decimal (Decimal(precision=2, scale=1)); }
print(schema2.to_arrow_schema().to_string(show_field_metadata=False))
int64: int64 int32: int32 uint32: uint32 date: date32[day] timestamp: timestamp[ms] timestamp_tz: timestamp[ms, tz=UTC] timestamp_ns: timestamp[ns] timestamp_ns_tz: timestamp[ns, tz=UTC] string: string list: list<item: int64> child 0, item: int64 decimal: decimal(2, 1)
Now all types are preserved.
Looking more closely to the columns with nanosecond timestamps:
schema1.column(6) # this actually also has the TIMESTAMP_MICROS set, but is incorrectly not shown, see https://issues.apache.org/jira/browse/ARROW-11399
<ParquetColumnSchema> name: timestamp_ns path: timestamp_ns max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false) converted_type (legacy): NONE
schema2.column(6)
<ParquetColumnSchema> name: timestamp_ns path: timestamp_ns max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false) converted_type (legacy): NONE
schema1.column(7)
<ParquetColumnSchema> name: timestamp_ns_tz path: timestamp_ns_tz max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false) converted_type (legacy): TIMESTAMP_MICROS
schema2.column(7)
<ParquetColumnSchema> name: timestamp_ns_tz path: timestamp_ns_tz max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false) converted_type (legacy): NONE
For the version="2.0"
, a legacy converted type is no longer set (since this is not possible, there is no ConvertedType for nanoseconds). But this also means that such files won't correctly read by older parquet readers that do not yet support the nanosecond LogicalType (introduced in Parquet format 2.6.0, Sep 2018). Such readers will read those columns as int64 instead of timestamp.