[GH-3609]: Add new sort order for int96 timestamps#3610
Conversation
| return sizeStatisticsEnabled; | ||
| } | ||
|
|
||
| public boolean getInt96TimestampStatisticsEnabled() { |
There was a problem hiding this comment.
Why make this an option? Since we don't use int96 stats otherwise, I think it would be perfectly fine to keep it simple and just produce the new stats all the time.
| /** | ||
| * Chronological order for INT96 timestamps: values are compared by the Julian day (the last 4 | ||
| * bytes, as a little-endian signed int32), then by the nanoseconds within the day (the first 8 | ||
| * bytes, as a little-endian signed int64). Only supported for the INT96 physical type. |
There was a problem hiding this comment.
I don't think that this Javadoc is the right place for documenting the format and how to compare. This is part of the spec so the spec needs to be clear and this needs to state what the order means.
| } | ||
| }; | ||
|
|
||
| /* |
There was a problem hiding this comment.
Did you intend for this to be Javadoc?
| * @param columnOrder the column order | ||
| * @return a new PrimitiveType with the same fields and the given column order | ||
| */ | ||
| public PrimitiveType withColumnOrder(ColumnOrder columnOrder) { |
There was a problem hiding this comment.
If we always produce INT96 stats using the timestamp order, would we need this addition to the API? I think we could always produce a PrimitiveType with the right order when constructing types. We would just need to make sure that deserialization correctly distinguishes between unordered and timestamp order.
|
|
||
| private boolean readInt96TimestampStatisticsEnabled() { | ||
| return options == null | ||
| || options.isEnabled(ParquetInputFormat.INT96_TIMESTAMP_STATISTICS_READING_ENABLED, true); |
There was a problem hiding this comment.
Why would we have an option for reading INT96 stats? Wouldn't this always be true if we know the int96 stats are using the timestamp order?
| this.parquetFileWriter = parquetFileWriter; | ||
| this.writeSupport = Objects.requireNonNull(writeSupport, "writeSupport cannot be null"); | ||
| this.schema = schema; | ||
| this.schema = ParquetFileWriter.applyInt96TimestampOrder(schema, props); |
There was a problem hiding this comment.
I don't think this is the right place to fixup order. It would be better to always use the new order for INT96 when constructing schemas. When converting from file schemas, we would need to detect timestamp vs unordered, but anything going through the write path should automatically use timestamp order because the write path should always produce it.
| * INT96 timestamp statistics are enabled, so that statistics are accumulated with the | ||
| * chronological comparator and the proper column order is written to the footer. | ||
| */ | ||
| static MessageType applyInt96TimestampOrder(MessageType schema, ParquetProperties props) { |
There was a problem hiding this comment.
See my comments above about this, but I'm skeptical that we should fixup the schema.
| * key to configure whether INT96 min/max statistics written with the INT96_TIMESTAMP_ORDER | ||
| * column order are read (enabled by default) | ||
| */ | ||
| public static final String INT96_TIMESTAMP_STATISTICS_READING_ENABLED = "parquet.int96.timestamp.statistics.read.enabled"; |
There was a problem hiding this comment.
I don't think this should be an option.
Rationale for this change
See parquet-format proposal for rationale: apache/parquet-format#584
What changes are included in this PR?
Add new
INT96_TIMESTAMP_ORDERsort order and related parsing. On the writer side, the PR adds a default-falseparquet.int96.timestamp.statistics.enabledflag. When enabled, int96 columns are written with the new order. When disabled, no stats are emitted for int96 columns. On the reader side, the PR adds a default-trueparquet.int96.timestamp.statistics.read.enabledflag: int96 statistics are propagated only if the flag is enabled and they use the new order.Are these changes tested?
Yes, new unit tests.
Are there any user-facing changes?
Two new flags and a Thrift change.
Closes #3609