Skip to content

[GH-3609]: Add new sort order for int96 timestamps#3610

Open
divjotarora wants to merge 1 commit into
apache:masterfrom
divjotarora:int96-stats
Open

[GH-3609]: Add new sort order for int96 timestamps#3610
divjotarora wants to merge 1 commit into
apache:masterfrom
divjotarora:int96-stats

Conversation

@divjotarora

Copy link
Copy Markdown
Contributor

Rationale for this change

See parquet-format proposal for rationale: apache/parquet-format#584

What changes are included in this PR?

Add new INT96_TIMESTAMP_ORDER sort order and related parsing. On the writer side, the PR adds a default-false parquet.int96.timestamp.statistics.enabled flag. When enabled, int96 columns are written with the new order. When disabled, no stats are emitted for int96 columns. On the reader side, the PR adds a default-true parquet.int96.timestamp.statistics.read.enabled flag: int96 statistics are propagated only if the flag is enabled and they use the new order.

Are these changes tested?

Yes, new unit tests.

Are there any user-facing changes?

Two new flags and a Thrift change.

Closes #3609

return sizeStatisticsEnabled;
}

public boolean getInt96TimestampStatisticsEnabled() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this an option? Since we don't use int96 stats otherwise, I think it would be perfectly fine to keep it simple and just produce the new stats all the time.

/**
* Chronological order for INT96 timestamps: values are compared by the Julian day (the last 4
* bytes, as a little-endian signed int32), then by the nanoseconds within the day (the first 8
* bytes, as a little-endian signed int64). Only supported for the INT96 physical type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this Javadoc is the right place for documenting the format and how to compare. This is part of the spec so the spec needs to be clear and this needs to state what the order means.

}
};

/*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend for this to be Javadoc?

* @param columnOrder the column order
* @return a new PrimitiveType with the same fields and the given column order
*/
public PrimitiveType withColumnOrder(ColumnOrder columnOrder) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always produce INT96 stats using the timestamp order, would we need this addition to the API? I think we could always produce a PrimitiveType with the right order when constructing types. We would just need to make sure that deserialization correctly distinguishes between unordered and timestamp order.


private boolean readInt96TimestampStatisticsEnabled() {
return options == null
|| options.isEnabled(ParquetInputFormat.INT96_TIMESTAMP_STATISTICS_READING_ENABLED, true);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we have an option for reading INT96 stats? Wouldn't this always be true if we know the int96 stats are using the timestamp order?

this.parquetFileWriter = parquetFileWriter;
this.writeSupport = Objects.requireNonNull(writeSupport, "writeSupport cannot be null");
this.schema = schema;
this.schema = ParquetFileWriter.applyInt96TimestampOrder(schema, props);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right place to fixup order. It would be better to always use the new order for INT96 when constructing schemas. When converting from file schemas, we would need to detect timestamp vs unordered, but anything going through the write path should automatically use timestamp order because the write path should always produce it.

* INT96 timestamp statistics are enabled, so that statistics are accumulated with the
* chronological comparator and the proper column order is written to the footer.
*/
static MessageType applyInt96TimestampOrder(MessageType schema, ParquetProperties props) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments above about this, but I'm skeptical that we should fixup the schema.

* key to configure whether INT96 min/max statistics written with the INT96_TIMESTAMP_ORDER
* column order are read (enabled by default)
*/
public static final String INT96_TIMESTAMP_STATISTICS_READING_ENABLED = "parquet.int96.timestamp.statistics.read.enabled";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add new sort order for int96 timestamps

2 participants