From 570ec73d863a42ca4f95fa4aaa6145a5d0507414 Mon Sep 17 00:00:00 2001
From: Divjot Arora
Date: Tue, 23 Jun 2026 12:53:14 +0000
Subject: [PATCH] WIP: add file using INT96_TIMESTAMP_ORDER
---
data/README.md | 1 +
data/int96_timestamp_order.md | 77 +++++++++++++++++++++++++++++
data/int96_timestamp_order.parquet | Bin 0 -> 427 bytes
3 files changed, 78 insertions(+)
create mode 100644 data/int96_timestamp_order.md
create mode 100644 data/int96_timestamp_order.parquet
diff --git a/data/README.md b/data/README.md
index f2fe47e..eb823e1 100644
--- a/data/README.md
+++ b/data/README.md
@@ -61,6 +61,7 @@
| datapage_v2_empty_datapage.snappy.parquet | A compressed FLOAT column with DataPageV2, a single row, value is null, the file uses Snappy compression, but there is no data for uncompression (see [related issue](https://github.com/apache/arrow-rs/issues/7388)). The zero bytes must not be attempted to be uncompressed, as this is an invalid Snappy stream. |
| unknown-logical-type.parquet | A file containing a column annotated with a LogicalType whose identifier has been set to an abitrary high value to check the behaviour of an old reader reading a file written by a new writer containing an unsupported type (see [related issue](https://github.com/apache/arrow/issues/41764)). |
| int96_from_spark.parquet | Single column of (deprecated) int96 values that originated as Apache Spark microsecond-resolution timestamps. Some values are outside the range typically representable by 64-bit nanosecond-resolution timestamps. See [int96_from_spark.md](int96_from_spark.md) for details. |
+| int96_timestamp_order.parquet | Single `required int96` column written with the `INT96_TIMESTAMP_ORDER` column order ([parquet-format #584](https://github.com/apache/parquet-format/pull/584)). Values are chosen so a byte-wise comparison disagrees with the chronological order, so the min/max statistics (and column index) are only correct for a reader that honors the new order. See [int96_timestamp_order.md](int96_timestamp_order.md) for details. |
| binary_truncated_min_max.parquet | A file containing six columns with exact, fully-truncated and partially-truncated max and min statistics and with the expected is_{min/max}_value_exact. (see [note](Binary-truncated-min-and-max-statistics)).|
TODO: Document what each file is in the table above.
diff --git a/data/int96_timestamp_order.md b/data/int96_timestamp_order.md
new file mode 100644
index 0000000..4d9cfd8
--- /dev/null
+++ b/data/int96_timestamp_order.md
@@ -0,0 +1,77 @@
+
+
+# `int96_timestamp_order.parquet`
+
+A single `required int96` column written with the `INT96_TIMESTAMP_ORDER` column order added in
+[parquet-format #584](https://github.com/apache/parquet-format/pull/584). It exercises a reader's
+ability to honor the new order: the column carries min/max statistics and a column index, and the
+footer's `column_orders[0]` is set to `INT96_TIMESTAMP_ORDER` (union field 3) rather than
+`TYPE_ORDER`.
+
+INT96 timestamps are 12 little-endian bytes: an 8-byte nanoseconds-within-the-day followed by a
+4-byte Julian day. The defined order compares the Julian day (as a signed int32) first, then the
+nanoseconds (as a signed int64) — i.e. chronological order.
+
+## Why this file is non-trivial
+
+The values are deliberately chosen so that a **byte-wise (lexicographic) comparison disagrees with
+the chronological order**. Because the low-order nanosecond bytes come first in the little-endian
+layout, a reader that compares the raw 12 bytes (or that ignores the new order) computes the wrong
+min/max. A reader must implement the chronological comparison to pass.
+
+| Value | Julian day | nanos-of-day | Timestamp | first byte |
+|----------------|------------|-------------------|------------------------------------|------------|
+| EARLY | 2440000 | 123 | 1968-05-23 00:00:00.000000123 | `0x7B` |
+| SAME_DAY_EARLY | 2440588 | 1000 | 1970-01-01 00:00:00.000001000 | `0xE8` |
+| LATE_IN_DAY | 2440588 | 86399999999999 | 1970-01-01 23:59:59.999999999 | `0xFF` |
+| NEXT_DAY | 2440589 | 0 | 1970-01-02 00:00:00.000000000 | `0x00` |
+
+Values are written to the file out of order: `LATE_IN_DAY, NEXT_DAY, EARLY, SAME_DAY_EARLY` (so that
+the correct min/max are also neither the first nor the last value).
+
+- Correct (`INT96_TIMESTAMP_ORDER`) min/max: **EARLY / NEXT_DAY**
+- Byte-wise (incorrect) min/max would be: **NEXT_DAY / LATE_IN_DAY** (ordered by the leading
+ nanosecond byte `0x00 < 0x7B < 0xE8 < 0xFF`)
+
+The min/max written to the statistics (and the column index) are therefore:
+
+```
+min = 0x 7B 00 00 00 00 00 00 00 40 3B 25 00 (EARLY: nanos 123, Julian day 2440000)
+max = 0x 00 00 00 00 00 00 00 00 8D 3D 25 00 (NEXT_DAY: nanos 0, Julian day 2440589)
+```
+
+## How it was generated
+
+Written by parquet-java (parquet-mr 1.18.0-SNAPSHOT) via the
+`TestInt96TimestampStatistics#writeInt96TimestampOrderInteropFile` test:
+
+```
+mvn -pl parquet-hadoop test \
+ -Dtest='TestInt96TimestampStatistics#writeInt96TimestampOrderInteropFile' \
+ -Dparquet.testing.data.dir=/data
+```
+
+## Schema
+
+```
+message int96_timestamp_order {
+ required int96 ts;
+}
+```
diff --git a/data/int96_timestamp_order.parquet b/data/int96_timestamp_order.parquet
new file mode 100644
index 0000000000000000000000000000000000000000..e9e0422f28ba0f6cc26905f99bd70c044e2da1a6
GIT binary patch
literal 427
zcmY+B%}T>S6oqfnP$P&^a6$&M$R?pe8#=_a4OXzbu41w7YksB>XxchWRdDH63yR>*
zm+01I*KT|T7rsTE@dw(Ofw|{!@0ky#+dd;mK^J_#e7r7qTS!F;z-kpdJ_i8z#}@pr
z2VUR)iIXD>z6!rW^dR7KDq7qv-nR`AAtZoY_yk;%vA(lD$mMz_lCcJW4Q(!=IRa@NvkfWe
zy-CO}XEd*7%)Fv(nvO>%aY)mtReM3Z+v}X3o8@Sh4imHKyCf#D=MBgp3SHk%l7>fI
cf6(xvIH9$uc}UzwTsy!92ZB8bpv%AM8(;)d@c;k-
literal 0
HcmV?d00001