From 570ec73d863a42ca4f95fa4aaa6145a5d0507414 Mon Sep 17 00:00:00 2001
From: Divjot Arora <div.arora@databricks.com>
Date: Tue, 23 Jun 2026 12:53:14 +0000
Subject: [PATCH] WIP: add file using INT96_TIMESTAMP_ORDER

---
 data/README.md                     |   1 +
 data/int96_timestamp_order.md      |  77 +++++++++++++++++++++++++++++
 data/int96_timestamp_order.parquet | Bin 0 -> 427 bytes
 3 files changed, 78 insertions(+)
 create mode 100644 data/int96_timestamp_order.md
 create mode 100644 data/int96_timestamp_order.parquet

diff --git a/data/README.md b/data/README.md
index f2fe47e..eb823e1 100644
--- a/data/README.md
+++ b/data/README.md
@@ -61,6 +61,7 @@
 | datapage_v2_empty_datapage.snappy.parquet | A compressed FLOAT column with DataPageV2, a single row, value is null, the file uses Snappy compression, but there is no data for uncompression (see [related issue](https://github.com/apache/arrow-rs/issues/7388)). The zero bytes must not be attempted to be uncompressed, as this is an invalid Snappy stream. |
 | unknown-logical-type.parquet | A file containing a column annotated with a LogicalType whose identifier has been set to an abitrary high value to check the behaviour of an old reader reading a file written by a new writer containing an unsupported type (see [related issue](https://github.com/apache/arrow/issues/41764)). |
 | int96_from_spark.parquet | Single column of (deprecated) int96 values that originated as Apache Spark microsecond-resolution timestamps. Some values are outside the range typically representable by 64-bit nanosecond-resolution timestamps. See [int96_from_spark.md](int96_from_spark.md) for details. |
+| int96_timestamp_order.parquet | Single `required int96` column written with the `INT96_TIMESTAMP_ORDER` column order ([parquet-format #584](https://github.com/apache/parquet-format/pull/584)). Values are chosen so a byte-wise comparison disagrees with the chronological order, so the min/max statistics (and column index) are only correct for a reader that honors the new order. See [int96_timestamp_order.md](int96_timestamp_order.md) for details. |
 | binary_truncated_min_max.parquet | A file containing six columns with exact, fully-truncated and partially-truncated max and min statistics and with the expected is_{min/max}_value_exact.  (see [note](Binary-truncated-min-and-max-statistics)).|
 
 TODO: Document what each file is in the table above.
diff --git a/data/int96_timestamp_order.md b/data/int96_timestamp_order.md
new file mode 100644
index 0000000..4d9cfd8
--- /dev/null
+++ b/data/int96_timestamp_order.md
@@ -0,0 +1,77 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+# `int96_timestamp_order.parquet`
+
+A single `required int96` column written with the `INT96_TIMESTAMP_ORDER` column order added in
+[parquet-format #584](https://github.com/apache/parquet-format/pull/584). It exercises a reader's
+ability to honor the new order: the column carries min/max statistics and a column index, and the
+footer's `column_orders[0]` is set to `INT96_TIMESTAMP_ORDER` (union field 3) rather than
+`TYPE_ORDER`.
+
+INT96 timestamps are 12 little-endian bytes: an 8-byte nanoseconds-within-the-day followed by a
+4-byte Julian day. The defined order compares the Julian day (as a signed int32) first, then the
+nanoseconds (as a signed int64) — i.e. chronological order.
+
+## Why this file is non-trivial
+
+The values are deliberately chosen so that a **byte-wise (lexicographic) comparison disagrees with
+the chronological order**. Because the low-order nanosecond bytes come first in the little-endian
+layout, a reader that compares the raw 12 bytes (or that ignores the new order) computes the wrong
+min/max. A reader must implement the chronological comparison to pass.
+
+| Value          | Julian day | nanos-of-day      | Timestamp                          | first byte |
+|----------------|------------|-------------------|------------------------------------|------------|
+| EARLY          | 2440000    | 123               | 1968-05-23 00:00:00.000000123      | `0x7B`     |
+| SAME_DAY_EARLY | 2440588    | 1000              | 1970-01-01 00:00:00.000001000      | `0xE8`     |
+| LATE_IN_DAY    | 2440588    | 86399999999999    | 1970-01-01 23:59:59.999999999      | `0xFF`     |
+| NEXT_DAY       | 2440589    | 0                 | 1970-01-02 00:00:00.000000000      | `0x00`     |
+
+Values are written to the file out of order: `LATE_IN_DAY, NEXT_DAY, EARLY, SAME_DAY_EARLY` (so that
+the correct min/max are also neither the first nor the last value).
+
+- Correct (`INT96_TIMESTAMP_ORDER`) min/max: **EARLY / NEXT_DAY**
+- Byte-wise (incorrect) min/max would be: **NEXT_DAY / LATE_IN_DAY** (ordered by the leading
+  nanosecond byte `0x00 < 0x7B < 0xE8 < 0xFF`)
+
+The min/max written to the statistics (and the column index) are therefore:
+
+```
+min = 0x 7B 00 00 00 00 00 00 00 40 3B 25 00   (EARLY:    nanos 123,  Julian day 2440000)
+max = 0x 00 00 00 00 00 00 00 00 8D 3D 25 00   (NEXT_DAY:  nanos 0,    Julian day 2440589)
+```
+
+## How it was generated
+
+Written by parquet-java (parquet-mr 1.18.0-SNAPSHOT) via the
+`TestInt96TimestampStatistics#writeInt96TimestampOrderInteropFile` test:
+
+```
+mvn -pl parquet-hadoop test \
+  -Dtest='TestInt96TimestampStatistics#writeInt96TimestampOrderInteropFile' \
+  -Dparquet.testing.data.dir=<parquet-testing>/data
+```
+
+## Schema
+
+```
+message int96_timestamp_order {
+  required int96 ts;
+}
+```
diff --git a/data/int96_timestamp_order.parquet b/data/int96_timestamp_order.parquet
new file mode 100644
index 0000000000000000000000000000000000000000..e9e0422f28ba0f6cc26905f99bd70c044e2da1a6
GIT binary patch
literal 427
zcmY+B%}T>S6oqfnP$P&^a6$&M$R?pe8#=_a4OXzbu41w7YksB>XxchWRdDH63yR>*
zm+01I*KT|T7rsTE@dw(Ofw|{!@0ky#+dd;mK^J_#e7r7qTS!F;z-kpdJ_i8z#}@pr
z2VUR)iIXD>z6!rW^dR7KDq7qv-nR`AAtZoY_yk;%vA(lD$mMz_lCcJW4Q(!=I<J5m
zl@3ZrxnJ++X-21cm`(a)mQaRdKt&65IiG^2VgnmC7^sLXMa9BI+|}MBe(bqqt41Bl
z8*gob0IqOSMTEl7)dJ9-WTa_J2?Zd6R9ocj3qHMc&C;AQCmSbp=#0XQ>Ra@NvkfWe
zy-CO}XEd*7%)Fv(nvO>%aY)mtReM3Z+v}X3o8@Sh4imHKyCf#D=MBgp3SHk%l7>fI
cf6(xvIH9$uc}UzwTsy!92ZB8bpv%AM8(;)d@c;k-

literal 0
HcmV?d00001