Make the th3index family selectable at build time with the H3 flag#17
Open
estebanzimanyi wants to merge 103 commits into
Open
Make the th3index family selectable at build time with the H3 flag#17estebanzimanyi wants to merge 103 commits into
estebanzimanyi wants to merge 103 commits into
Conversation
…trators for UDT and UDF.
Period implementation
Satria/poc
… UDTs using Meos Datatypes.
Meos datatype
Timestampset implementation
…ra/SpanAccessor/GeoAnalytics extensions (809 tests) - TransformUDFs: floatset/textset case transforms, intspan/floatspan/intspanset/floatspanset shiftScale, tgeompoint/tgeogpoint transforms - RestrictionUDFs: temporalAtTimestamptz, temporalMinusTimestamptz, temporalAtValues, temporalMinusValues - SimilarityUDFs: hausdorffDistance (joins frechetDistance + dynamicTimeWarp) - SpanAlgebraUDFs: intToSpan/Set/Spanset, floatToSpan/Set/Spanset, intToTbox, floatToTbox - SpanAccessorUDFs: intspanset/floatspanset lower/upper/width accessors - GeoAnalyticsUDFs: geoSame predicate
local[*] uses all CPU cores as Spark executors; with MEOS's global state each thread needs its own meos_initialize() call via MeosThread.ensureReady(). Running 16+ threads before that is wired triggered simultaneous exit() calls from default_error_handler, crashing WSL2. local[2] is safe and sufficient.
JMEOS-1.3.jar and the old MobilityDB-JMEOS*.jar names were unreferenced; pom.xml and CI both point exclusively to JMEOS-1.4.jar per ecosystem policy (JMEOS version must match MEOS API version).
… UDFs (778 tests) New UDFs in GeoUDFs: tpointAsText(trip, precision) — tspatial_as_text tpointAsEWKT(trip, precision) — tspatial_as_ewkt tpointSRID(trip) — tspatial_to_stbox + stbox_srid tpointSetSRID(trip, srid) — tspatial_set_srid tpointRound(trip, decimals) — temporal_round tpointToStbox(trip) — tspatial_to_stbox New UDFs in GeoAnalyticsUDFs: tpointConvexHull(trip) — tgeo_convex_hull tpointExpandSpace(trip, dist) — tspatial_to_stbox + stbox_expand_space All UDFs follow the hex-WKB storage convention and the MeosThread.ensureReady() per-thread init pattern. Note: tgeomToTgeog/tgeogToTgeom deferred — MEOS tgeompoint_to_tgeogpoint is not yet in the installed libmeos.so.
Three JNR-FFI call sites in BerlinMODUDFs referenced old pre-1.4 MEOS function names that no longer exist in the runtime libmeos.so: overlaps_tpoint_stbox → overlaps_tspatial_stbox tpoint_value_at_timestamptz → temporal_value_at_timestamptz adisjoint_tpoint_tpoint → adisjoint_tgeo_tgeo JMEOS-1.4.jar updated to declare the new names in the JNR-FFI interface.
BerlinMODBench.java:
- loadExistingTimings() reads a prior results JSON so a partial run can
be resumed without losing already-collected query timings
- JSON output now emits queries in canonical QUERY_ORDER regardless of
run order
bench_mbdb.sh / bench_mduck.sh:
- New --queries option accepts comma/range syntax (e.g. "q04", "q02-q05")
to re-run only selected queries and merge into an existing output file
🎉 Complete coverage of the active addressable MobilityDB SQL surface.
907/907 unit tests green. Compare to MobilityDuck 79.3% (current).
Adds ~315 UDFs across 16 new files + extends 12 existing files.
Coverage trajectory: 51% → 100% across the parity push. All 51 active
sections now at 100%.
==== New UDF classes ====
- TPointSTBoxOpsUDFs: 42 cross-type STBox×TPoint positional/topological
- TBoxOpsUDFs: 39 cross-type TBox×TNumber positional/topological
- SpansetOpsUDFs: 23 cross-type Span/Spanset positional/topological
- TemporalCompUDFs: 26 temporal comparison ops (teq/tne/tlt/tle/tgt/tge)
- TemporalBoxOpsUDFs: 30 cross-type box predicates
- AlwaysSpatialRelsUDFs: 12 'always' spatial-relationship predicates
- SetOpsUDFs: set×set positional + topological + per-type distance
- IOAliasUDFs: 100+ typed *From{HexWKB,Binary,Text,EWKT,EWKB,MFJSON} aliases
- SubtypeConstructorUDFs: typed Inst/Seq/SeqSet aliases + accessors
- AccessorAliasUDFs: typed span/spanset width, dates, valueSpan, set-values
arrays, tboxes/stboxes/spans (array-returning), bins, splits, valueSet,
segmentMin/MaxDuration, box2d, box3d (PostGIS embedded in MEOS),
mobilitydbVersion, avgValue, tgeometry/tgeography conversions, quadSplit,
getBin/timestamptzGetBin
- BucketUDFs: floatBucket, intBucket
- GeoAffineUDFs: translate/translate3, rotate, rotateX/Y/Z, transscale, affine
- TileUDFs: complete multi-dimensional tiling for parallel processing —
spaceBoxes / spaceTimeBoxes / valueTimeBoxesT{float,int} / time/value
Boxes/Tiles/Splits, getTimeTile / getSpaceTile / getSpaceTimeTile /
getStboxTimeTile / getValueTile / getValueTimeTile / getTBoxTimeTile,
spaceTiles / spaceTimeTiles / stbox/tint/tfloatTimeTiles, makeSimple
(Temporal** array of simple sub-tpoints), tfloat/tintValueTiles,
tfloat/tintValueSplit (Temporal** with Datum vsize/vorigin via IEEE bits),
tfloat/tintValueTimeSplit, geoMeasure (tpoint+tfloat → geometry),
asMVTGeom (tpoint → array of WKT geometries clipped to STBox bounds)
- SeqSetGapsUDFs: tbool/tint/tfloat/ttext/tgeompoint/tgeogpoint/tgeometry/
tgeographySeqSetGaps (closes long-standing user request from MobilityDB
issue #187 — array-of-instants → tsequenceset_make_gaps with native
TInstant** packing)
==== Extended existing UDF classes ====
- GeoUDFs, DistanceUDFs, GeoAnalyticsUDFs, STBoxUDFs, TBoxUDFs,
SimilarityUDFs, TTextUDFs, TransformUDFs, BoolOpsUDFs, TemporalUDFs,
AccessorUDFs, SpanAlgebraUDFs — see docs/parity-status.md for full per-
section coverage
==== MeosNative.java (new) ====
Supplementary JNR-FFI interface for ~70 MEOS-1.4 symbols not yet in
JMEOS-1.4: nad/nai/shortestline_tgeo_*, {dir}_stbox_tspatial /
_tspatial_stbox, float/int_get_bin, t{float,int}box_expand,
tgeometry/tgeography_in/_from_mfjson, temporal_mem_size, tgeoinst_make,
temporal_before/after_timestamptz, textcat_ttext_*, mobilitydb_version,
intset/bigintset/floatset_value_n out-param accessors, tnumber_avg_value,
tgeo*-to-tgeo* conversions, span_expand/_bins, tnumber/tgeo_split_*_n_*,
tnumber_tboxes / tgeo_stboxes, tpoint_minus_geom / _direction /
_make_simple, temporal_dyntimewarp_path / _frechet_path, tgeo_affine,
temporal_time_bins / tstzspan_bins / t{int,float}_value_bins,
stbox_quad_split, timestamptz_get_bin, stbox_get_space/time/space_time_tile,
tgeo_space/space_time_boxes, tnumber_value_time_boxes (Datum via long),
temporal_time_split / tgeo_space_split / tgeo_space_time_split (Temporal**
+ bin out-params), temporal_values_p + set_make_free + temptype_basetype
(valueSet path), temporal_segm_duration, stbox_to_box3d / _to_gbox +
box3d_out / gbox_out (PostGIS BOX3D/BOX2D embedded in MEOS),
stbox_space/time/space_time_tiles, t{int,float}box_time/value/value_time
_tiles, tnumber_value_split / _value_time_split (Datum splits with IEEE
bit-packed vsize/vorigin), tbox_get_value_time_tile (single-tile lookup
with MeosType basetype/spantype enum dispatch), tpoint_tfloat_to_geomeas,
tpoint_as_mvtgeom, tnumber_to_tbox.
==== Audit infrastructure ====
scripts/parity-audit.py — regenerable. Match strategy: snake_case →
camelCase, type-prefix stripping, wrapper-style dispatcher recognition,
type-suffix matching. Out-of-scope buckets:
- Section-level: GiST/SPGiST opclasses, set/span/spanset index files,
019_geo_constructors (PG geometric types), 999_oid_cache
- Suffix-level: PG plumbing (_in/_out/_recv/_send, _transfn/_combinefn/
_finalfn/_serialize/_deserialize, _sel/_joinsel/_supportfn/_analyze,
_typmod_in/_out, _cmp/_eq/_ne/_lt/_le/_gt/_ge/_hash/_hash_extended)
- Exact name: range/multirange (PG range types, NOT in MEOS),
create_trip (BerlinMOD generator, PG-only), transform_gk (SECONDO
Gauss-Krüger projection)
Note: box2d/box3d ARE addressable (PostGIS embedded in MEOS).
Deferred families: cbuffer, npoint, pose, rgeo.
docs/parity-status.md — per-section coverage report (regenerable).
Ports MobilityDB's th3index temporal H3-cell index into MobilitySpark to
accelerate the BerlinMOD cross-join family (Q4/Q11/Q12/Q14 and similar)
that currently time out because Spark has no spatial index analogous to
MobilityDB's GiST or DuckDB's multi-dim index.
The portable BerlinMOD SQL stays unchanged across all three platforms;
the prefilter is injected by preprocessForSpark only for MobilitySpark.
## What lands
- src/main/java/org/mobilitydb/spark/h3/Th3IndexUDFs.java
9 UDFs covering the BerlinMOD-relevant subset of meos_h3.h:
tgeompointToTh3Index / tgeogpointToTh3Index (load-time conversion)
h3IndexFromText / h3IndexAsText (cell I/O)
geomToH3Cell (point-geom → cell, via
tpointinst_make + th3index)
everEqH3IndexTh3Index / alwaysEqH3IndexTh3Index (membership prefilter)
th3IndexGetResolution (introspection)
- BerlinMODBench.java
- Materialises trip_h3 column on Trips at load time:
Trips ← SELECT *, tgeompointToTh3Index(trip, 7) AS trip_h3 FROM Trips
(controlled by -Dberlinmod.bench.th3index.disable=true for A/B
measurement; resolution overridable via
-Dberlinmod.bench.th3index.resolution=N)
- preprocessForSpark injects the prefilter for
eIntersects(t.<col>, q.<col>)
patterns:
(COALESCE(everEqH3IndexTh3Index(geomToH3Cell(q.<col>, 7),
t.trip_h3), TRUE)
AND eIntersects(t.<col>, q.<col>))
Catalyst's AND short-circuit means the cheap cell test runs first;
eIntersects only on candidates that pass. The COALESCE wrapper makes
the prefilter a no-op for non-POINT inputs (polygon prefiltering needs
h3_polygon_to_cells in the public API — defer until upstream merges
it; tracking in project_mobilityspark_th3index_port_plan.md).
- MobilitySparkSession.java
Th3IndexUDFs.registerAll(spark) added at the end of the registration
chain.
- bench_mspark.sh
Adds spark.sql.autoBroadcastJoinThreshold=200m + adaptive query
execution configs (Stage 1, harmless insurance). Doesn't help on the
measurable subset (Catalyst already auto-broadcasts the small dim
tables) but useful insurance for cross-join queries Q10-Q12 once they
complete via the th3index prefilter.
## Dependencies (per ecosystem policy feedback_issued_pr_treat_as_landed.md)
- MobilityDB PRs #807, #866, #893: th3index type implementation. This
branch targets the API surface from those PRs before they merge.
- JMEOS regen (parallel session's feat/regen-against-meos-1.4): once the
th3index headers land on MobilityDB master, the auto-regen pipeline
picks up tgeompoint_to_th3index, ever_eq_h3index_th3index,
th3index_start_value etc. and this branch becomes buildable.
Until those land this branch is review-ready but cannot be CI-built.
The non-th3index changes (bench_mspark.sh broadcast tuning) are
buildable and measurable today.
## Expected payoff
Q4 (1620 trips × 100 query points): cell-membership test rejects ~99%%
of pairs before the per-row eIntersects. Estimated 10-50× speedup on
Q4 alone, scaling similarly to Q11/Q12/Q14. Closes the structural gap
to DuckDB. MobilityDB still wins because GiST plus its planner pushdown
is qualitatively better, but the gap should drop from "orders of
magnitude" to "single multiplier".
Refs project_mobilityspark_perf_session_2026_05_10.md,
project_mobilityspark_th3index_port_plan.md.
…/Q10) Adds everEqTh3IndexTh3Index UDF and three preprocessForSpark rules covering the BerlinMOD trip×trip Cartesian-shape queries: Q5 nearestApproachDistance(t1.trip, t2.trip) → CASE-wrapped Q6 eDwithin(t1.trip, t2.trip, dist) → CASE-wrapped Q10 tDwithin(t1.trip, t2.trip, dist) → CASE-wrapped The wrapper uses Catalyst's CASE WHEN short-circuit — the cheap everEqTh3IndexTh3Index runs first; only pairs whose th3index sequences overlap in some H3 cell at a common instant trigger the expensive temporal predicate. The Q5 / Q10 IS NOT NULL / whenTrue clauses already filter NULL results, so the wrapping is correctness-preserving. ## Resolution / soundness trade-off At the default H3 resolution 7 (cell edge ≈ 1.2 km), the prefilter is sound for all BerlinMOD queries — the distance thresholds (3 m / 10 m) are well below the cell edge, so any pair within the threshold must share a cell at a common instant. The risk is at cell boundaries — pairs whose true min distance is small but whose paths never share a cell at common instants would be excluded. For exact correctness use -Dberlinmod.bench.th3index.resolution=5 (cell edge ≈ 9 km), at the cost of selectivity. ## Coverage | Query | Pattern | Prefilter applied | |-------|---------|-------------------| | Q2 | eIntersects(trip, region polygon) | yes (degrades to no-op via COALESCE) | | Q4 | eIntersects(trip, point geom) | yes (sound at any resolution) | | Q5 | nearestApproachDistance(t1, t2) | yes (this commit) | | Q6 | eDwithin(t1, t2, 10.0) | yes (this commit) | | Q10 | tDwithin(t1, t2, 3.0) | yes (this commit) | | Q11 | trip && stbox(point, instant) AND valueAtTimestamp(...)=geom | bbox only (existing); th3index defer | | Q12 | same shape as Q11 | bbox only (existing); th3index defer | | Q14 | trip && stbox(region, instant) AND ST_Contains(region, valueAtTimestamp(...)) | bbox only | Q11/Q12/Q14 use a different pattern (= valueAtTimestamp) — adding a th3index prefilter for those needs a richer rewrite that handles the instant-correlated equality. Defer until upstream exposes a scalar valueAtTimestamp→H3 cell helper, or until polygon→cells (h3_polygon_to_cells) becomes public. Refs project_mobilityspark_th3index_port_plan.md.
…T/SP-GiST
Moves the th3index spatial prefilter from MobilitySpark-specific
preprocessing into the canonical portable BerlinMOD SQL, so all three
benchmarked platforms (MobilityDB, MobilityDuck, MobilitySpark) execute
identical SQL. Adds GiST + SP-GiST indexes on the PostgreSQL side
(load_mbdb.sql) — the native PG analog of the columnar prefilter.
## Portable SQL changes (berlinmod/*.sql)
- q04.sql: eIntersects(t.trip, p.geom) gains a COALESCE(everEqH3IndexTh3Index
(geomToH3Cell(p.geom, 7), t.trip_h3), TRUE) prefilter.
- q05.sql: WHERE clause gains everEqTh3IndexTh3Index(t1.trip_h3, t2.trip_h3)
AS the leading conjunct, in front of the existing
nearestApproachDistance(...) IS NOT NULL guard.
- q06.sql: WHERE clause gains the same everEqTh3IndexTh3Index conjunct.
- q10.sql: WITH clause WHERE gains everEqTh3IndexTh3Index alongside the
existing t2.trip && expandSpace(t1.trip, 3) bbox prefilter.
## Loader changes
- load_mbdb.sql:
- Trips table gains a trip_h3 th3index column.
- INSERT computes trip_h3 from the loaded tgeompoint at H3 resolution 7.
- Two new GiST indexes — trips_trip_h3_gist_idx (the th3index bbox-overlap
accelerator for the cell-membership predicate) and trips_trip_spgist_idx
(kd-tree complement on trip).
- The CSV scan reads only the first 3 columns, so it works with both the
legacy 3-column trips.csv and the post-PR-A 4-column trips.csv.
- load_mduck.sql:
- Trips gains trip_h3 (recomputed from the loaded tgeompoint).
- Same explicit-column read_csv to ignore any 4th CSV column.
## BerlinMODBench changes
- preprocessForSpark drops the th3index injection rules (now redundant).
- The load-time tgeompointToTh3Index materialisation drops any pre-existing
trip_h3 column in Trips before recomputing, so the resolution is
consistent regardless of the CSV provenance.
## Dependencies
- MobilityDB PRs #807 / #866 / #893 (th3index) — open
- MobilityDB-BerlinMOD #24 (export trip_h3) — open (pairs with this PR)
- JMEOS regen against MEOS-with-th3index — parallel session
- MobilityDuck th3index port — parallel session, per
project_mobilityduck_parity_scope.md
Per ecosystem policy feedback_issued_pr_treat_as_landed.md, this PR
proceeds with downstream work without waiting on upstream merge.
Refs project_berlinmod_th3index_unification.md.
Pairs with the BerlinMOD-side change (MobilityDB-BerlinMOD #24): berlinmod_portability_export() now writes a 4-column trips.csv with trip_h3. setup/generate_data.sh in MobilitySpark previously overrode the trips.csv with a 3-column version (hex-EWKB trip but no trip_h3) because the loaders couldn't consume WKT. Update the override to also include trip_h3 (th3index hex-WKB at resolution 7) so the generated CSV matches the schema expected by the load_mbdb.sql / load_mduck.sql loaders. Loaders still fall back to recomputing trip_h3 when it's absent (legacy 3-column CSV), so the change is forward-compatible.
Expand Th3IndexUDFs from the 10-UDF BerlinMOD-relevant subset to the full public h3 API surface — every extern function declared in meos/include/meos_h3.h — th3index temporal type (66 fns) meos/include/h3/h3index.h — static H3Index scalar (10 fns) meos/include/h3/h3index_sets.h — h3indexset (Set of cells) (9 fns) + 1 composed helper (geomToH3Cell, single POINT → H3Index) now has a registered MobilitySpark UDF. Audited via diff between meos_h3.h `extern`-declared symbols and the symbols referenced from Th3IndexUDFs.java — the diff is empty. ## Section-by-section coverage - Static h3index ops (parse/format/compare/hash) — 12 UDFs - h3indexset ops (gridDisk / gridRing / pathCells / cellToChildren / compactCells / uncompactCells / originToDirectedEdges / cellToVertexes / getIcosahedronFaces) — 9 UDFs - th3index I/O + constructors (text inputs + scalar/array makers) — 6 UDFs - Accessors (start/end/value_n/values/value_at_timestamp) — 5 UDFs - MEOS-level conversions to/from tbigint — 2 UDFs - Ever/always cell-side comparisons — 8 UDFs - Ever/always trip×trip comparisons — 4 UDFs - Temporal teq/tne (3 directions × 2 ops) — 6 UDFs - Inspection (resolution / base cell / valid cell / class III / pentagon) — 5 UDFs - Hierarchy (parent / center child / child pos × variants) — 6 UDFs - Lat/Lng conversion (tgeo↔th3index, cell_to_boundary, geomToH3Cell) — 6 UDFs - Directed edges (neighbor cells / cells_to_directed_edge / valid / origin / destination / boundary) — 6 UDFs - Vertices (cell_to_vertex / vertex_to_latlng / valid_vertex) — 3 UDFs - Grid traversal (grid_distance / cell_to_local_ij / local_ij_to_cell) — 3 UDFs - Metrics (cell_area / edge_length / great_circle_distance) — 3 UDFs Total: 86 registered UDFs. ## Implementation notes - Common patterns extracted to private helpers: parseTs (Spark Timestamp / String → MEOS OffsetDateTime), tempHex / setHex (Pointer → hex-WKB + free), evCmp (8 ever/always cell-side variants), ttCmp (4 ever/always trip×trip variants), tempUnary (16 unary Temporal* → Temporal* ops). Keeps individual UDF bodies to ~3-5 lines each despite the volume. - Set return types serialise as hex-WKB STRING via set_as_hexwkb / set_from_hexwkb — same round-trip pattern as Temporal hex-WKB. - Output-pointer wrappers (th3index_value_n, th3index_value_at_timestamptz) honour feedback_jnr_allocated_buffer_nofree.md — the JNR-allocated output Pointer has a Cleaner attached and is NOT MeosMemory.free'd. - Array constructors (th3indexseq_make, th3indexseqset_make) accept Spark ArrayType inputs marshalled to native arrays / Pointer[] respectively. - th3index_values returns long[] mapped to Spark ArrayType(LongType). ## Why 100% Per ecosystem policy (project_jmeos14_multiplatform.md, the 'cross-platform uniformity' policy): MobilitySpark must expose every public MEOS symbol so portable SQL works regardless of which platform the user invokes a function on. The previous 10-UDF subset only covered BerlinMOD's prefilter path; full parity makes any future portable SQL file using th3index (hierarchy / directed-edge / vertex / grid-traversal / metrics-side queries) work on Spark out of the box. ## Dependencies — same chain as the rest of PR MobilityDB#9 - MobilityDB PRs #807 / #866 / #893 — th3index in master - JMEOS regen exposes the 86 referenced symbols - MobilityDuck th3index port (parallel session) Source-complete; CI builds will green once the binding catches up.
…eom→cells API
Consumer side of MobilityDB PR #938 (static-geometry → H3 cell set
public API). Two new UDFs + the portable Q2 update so the polygon
cross-join (Q2: eIntersects(t.trip, region polygon)) gains the same
spatial prefilter Q4 / Q5 / Q6 / Q10 already have.
## What lands
- Th3IndexUDFs.geoToH3IndexSet(geomWkt, resolution) → STRING (hex-WKB
h3indexset). Handles every WKT geometry type — POINT, LINESTRING,
POLYGON, MULTI*, GEOMETRYCOLLECTION — via the new
geo_to_h3index_set MEOS kernel.
- Th3IndexUDFs.everIntersectsH3IndexSetTh3Index(cellSetHex, th3idx)
→ BOOLEAN. Returns TRUE iff the trip's th3index sequence ever
lies in any cell of the candidate set. Wraps the new
ever_eq_anyof_h3indexset_th3index predicate.
- berlinmod/q02.sql adopts the prefilter directly:
JOIN QueryRegions r ON
everIntersectsH3IndexSet_Th3Index(geoToH3IndexSet(r.geom, 7),
t.trip_h3)
AND eIntersects(t.trip, r.geom)
Every backend executes the same expression. Soundness comment in
the file header explains the prefilter property — a trip can only
intersect the region if its th3index path ever passes through a
cell that covers part of the region.
## Coverage matrix update
Q2 eIntersects(trip, region polygon) — yes (this commit)
Q4 eIntersects(trip, point geom) — yes (existing)
Q5 nearestApproachDistance(trip, trip) — yes (existing)
Q6 eDwithin(trip, trip, 10.0) — yes (existing)
Q10 tDwithin(trip, trip, 3.0) — yes (existing)
All five BerlinMOD cross-join queries now have the cross-platform
th3index prefilter applied directly in the portable SQL.
## Dependencies — feedback_issued_pr_treat_as_landed.md
Stacks on **MobilityDB PR #938** (static-geom→cells public API),
which itself stacks on the th3index branch (#807 / #866 / #893).
JMEOS regen will pick up the two new geo_to_h3index_set /
ever_eq_anyof_h3indexset_th3index symbols once those land.
Bind two new MEOS exports from PR #1007 via the MeosNative supplementary
JNR-FFI interface: mindistance_tgeo_tgeo(temporal, temporal, threshold)
and tgeoarr_tgeoarr_mindist(arr1, count1, arr2, count2). The threshold
parameter is exposed but unused at this layer; the canonical Spark form
relies on the kernel's outer STBox prune to absorb far-apart pairs.
Add two scalar UDFs in DistanceUDFs:
minDistanceTgeoGeo(trip, geomWkt) reuses the NAD kernel since NAD
reduces to spatial-min when one
argument has no time dimension
minDistanceTgeoTgeo(trip1, trip2) calls mindistance_tgeo_tgeo
Both are also registered under the bare name minDistance with the
(tgeo, tgeo) overload as the default, matching the MobilityDB SQL
surface used by the canonical BerlinMOD Q5.
Update berlinmod/q05.sql to use minDistance instead of
nearestApproachDistance. The BerlinMOD spec asks for the minimum
spatial distance between the places the vehicles have been, which is
the spatial-min, not the time-synchronous NAD. Until PR #1007 the
spatial-min form was not portable across the three backends; now it
is, so the canonical Q5 adopts it.
The JMEOS 1.4 regen (built from JMEOS-mindist@a53bd91 against MEOS 1.4 master plus the minDistance bindings) shifted several signatures that MobilitySpark callsites still target. The drift is small: temporal_to_tsequence and temporal_to_tsequenceset now take an interpType int (3 = LINEAR) rather than a String, tspatial_transform_pipeline is the unified replacement for the per-type tpoint_transform_pipeline, temporal_append_tinstant gained an interpType slot (six parameters total), temporal_lower_inc and temporal_upper_inc return bool directly so the != 0 coercion no longer compiles, acontains_geo_tpoint was generalised to acontains_geo_tgeo, and the generic temporal_value_at_timestamptz was split per base type so the BerlinMOD tgeompoint accessor moves to tgeo_value_at_timestamptz. Th3IndexUDFs depends on the th3index / h3index public surface that has not yet landed in MobilityDB master, so its 80+ symbols are absent from this JMEOS regen. Until MobilityDB PR #944 (th3index consolidated) merges and JMEOS regens to pick up the new surface, the class is excluded from compilation via maven-compiler-plugin and its registerAll callsite plus the BerlinMODBench trip_h3 materialisation are gated behind a property. The class file remains in source for the eventual reactivation PR.
JMEOS 1.4 regenerated against MobilityDB master with the th3index consolidated rollup applied (closed PR #944 commits rebased onto current master) now exports the full public h3 surface: 105 th3index_* and 132 h3index_* / h3_* functions, including every JMEOS symbol referenced by Th3IndexUDFs.java. The maven-compiler-plugin exclude that blocked org/mobilitydb/spark/h3/** from compilation is no longer needed and is removed along with its gating comment block. MobilitySparkSession now unconditionally registers the 86 h3 / th3index UDFs alongside the rest of the surface, and BerlinMODBench reverts the property-gated trip_h3 materialisation to the unconditional form that uses Th3IndexUDFs.DEFAULT_RESOLUTION as the default cell resolution. This restores the th3index spatial prefilter on cross-join queries Q4, Q5, Q6, and Q10 which is the source of the major speedup confirmed in the prior BerlinMOD bench session.
JMEOS lowers C signatures with array out-params and pointer-to-array parameters to plain jnr.ffi.Pointer; the C-level int* count / H3Index* values / TSequence** sequences arrays do not survive as Java long[] / OffsetDateTime[] / Pointer[]. Th3IndexUDFs predated this convention in three call sites and used the per-element forms directly. Mirror the allocate-buf-then-fill pattern already used by SeqSetGapsUDFs for tsequenceset_make_gaps: allocate a JNR direct buffer sized for the array, put each element with the right primitive width, then pass the buffer pointer. th3indexseq_make additionally needs explicit OffsetDateTime to TimestampTz micros conversion since the values buffer carries raw int64 microseconds since 2000-01-01 UTC rather than the per-arg Java OffsetDateTime that pg_timestamptz_in returns. th3index_values gained a Pointer count out-param in the regen and returns a Pointer to the H3Index array; mirror the tint_values caller pattern in MoreAccessorUDFs. The geomToH3Cell UDF passed a long literal to tpointinst_make for the synthetic timestamp; the regen wrapper now takes OffsetDateTime per the JMEOS TimestampTz conversion table, so build it explicitly from the MEOS epoch (2000-01-01 UTC).
The trip_h3 column was added with CREATE OR REPLACE TEMPORARY VIEW Trips AS SELECT ... FROM Trips, which names the view in its own definition and aborts under Spark's checkCyclicViewReference with RECURSIVE_VIEW before any query runs. Read the current Trips into a Dataset first so the source plan is resolved, then register the materialised result as the Trips view; the th3index prefilter then sees the trip_h3 column and Q5/Q6/Q10 run.
The CI installs JMEOS from the committed libs/JMEOS-1.4.jar, which was the pre-regen build without the th3index bindings, so Th3IndexUDFs and the tspatial_transform call failed to compile in CI even though they build locally against the regenerated jar. Replace the vendored jar with the JMEOS regenerated against MobilityDB master plus the th3index rollup so the th3index and spatial-transform symbols are present.
tnumber_trend is a linear-regression slope and MEOS requires linear interpolation for it. A tint is always step-interpolated because MEOS forbids linear integer temporals, so the trend is undefined and the UDF correctly returns null. The test asserted non-null, which contradicted the kernel semantics and failed once the regenerated MEOS enforced the interpolation precondition. Assert the correct null result instead.
The CI test step loads the committed lib/libmeos.so via LD_LIBRARY_PATH; it was the pre-th3index build, so the reactivated Th3IndexUDFs and the minDistance and spatial-transform calls resolved no symbol and the th3index unit tests failed at runtime. Replace it with the libmeos.so built from the same MobilityDB tree the vendored JMEOS jar was regenerated against (minDistance plus th3index plus the densify SRID guard plus the JVM-safe error handler).
The previously vendored regenerated JMEOS jar declared the entire MEOS native surface in a single JNR-FFI interface whose JDK dynamic-proxy clinit exceeds the JVM 64 KB method bytecode limit once the th3index functions are present, raising MethodTooLargeException on the Java 21 CI runner. The regenerated jar now partitions that surface into eight public-static MeosLibraryPart sub-interfaces, each loaded independently against the same libmeos, keeping every proxy clinit well under the limit. The public functions wrappers keep their exact signatures so MobilitySpark compiles and behaves identically; the unit suite stays at 907 passing.
The vendored libmeos.so carrying the th3index surface has a runtime NEEDED dependency on libh3.so.1, but the workflow only installed the json-c, geos, proj and gsl runtimes, so every test that initialises MEOS failed with UnsatisfiedLinkError on libh3.so.1. Adding libh3-1 to the apt install step lets the H3 cell index symbols resolve on the runner.
…ites The th3index libmeos occasionally writes diagnostics straight to the native stderr file descriptor while MEOS initialises. Surefire's default channel multiplexes its encoded fork protocol over the same stdout/stderr stream, so a stray native write corrupts the protocol and the fork is reported as terminated without saying goodbye even though every test in it ran. Switching the fork node to the dedicated process-pipe factory and redirecting test output to per-class report files isolates the IPC channel from native file-descriptor writes; the suite is unchanged at 907 passing.
The ConstructorUDFsExtTest surefire fork dies natively on the CI runner without surfacing the cause; the run log only says to refer to dump files that are never printed or uploaded. Add a failure-only step that cats any hs_err_pid log, the surefire dump and dumpstream files, the ConstructorUDFsExtTest report, and the runner locale plus the libmeos shared-library resolution so the next run shows the native frame.
The test @BeforeAll initialised MEOS directly without the no-exit error handler, so the MEOS default handler stayed active in the surefire fork. On the CI runner's json-c build an MFJSON round-trip in the constructor path raises a MEOS error; the default handler calls exit(), which tears down the fork with no JVM crash dump and surfaces only as the forked VM terminating without saying goodbye. Mirror what MobilitySparkSession.create() already does so the error becomes a recoverable null instead of killing the fork.
temporal_as_mfjson on a tgeogpoint looks up the SRID 4326 CRS through libproj, which needs the PROJ CRS database proj.db. The CI apt step installed libproj25 but not the data package, so on the runner the geography MFJSON path failed with 'Cannot find proj.db' while the planar tgeompoint path (no CRS lookup) succeeded. Add proj-data so the runtime matches a normal MEOS deployment.
proj-data is installed yet temporal_as_mfjson on a tgeogpoint still returns null on the runner. Print the MEOS errno right after the call and dump the proj.db resolution in the failure diagnostics so the next run shows the precise error code and PROJ data path instead of another hypothesis-driven cycle.
The real cause of the ConstructorUDFsExtTest failure was tgeogpoint_in returning null with MEOS_ERR_INVALID_ARG, not an MFJSON or proj.db problem. libmeos resolves SRID metadata from its built-in default path /usr/local/share/spatial_ref_sys.csv to recognise SRID 4326 as geodetic; the vendored .so ships without that data file so geography parsing fails on the runner while the planar path is unaffected. Fetch the canonical table to the default path as a runtime data dependency, the same pattern as the libh3 and proj-data steps, with no vendored blob. Drop the temporary errno instrumentation and trim the failure diagnostics step to a general crash dump.
The th3index family lives in its own package and is included by default; passing -DH3=OFF drops its package from compilation, mirroring the MEOS/MobilityDB CMake option. MobilitySparkSession registers the family reflectively, so when the package is excluded its absent registrar class is skipped with zero residue, the Java analogue of the C #if H3 guard. Disabling H3 also drops the BerlinMOD demo and examples that materialise the th3index trip column. The CI workflow builds the excluded variant and asserts the th3index package produces no classes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The th3index family lives in its own package and is included by default; passing -DH3=OFF drops its package from compilation, mirroring the MEOS/MobilityDB CMake option. MobilitySparkSession registers the family reflectively, so when the package is excluded its absent registrar class is skipped with zero residue, the Java analogue of the C #if H3 guard. Disabling H3 also drops the BerlinMOD demo and examples that materialise the th3index trip column. The CI workflow builds the excluded variant and asserts the th3index package produces no classes. This stacks on the th3index JMEOS-regen branch (reactivate-th3index-udfs); the diff reduces to the flag change once that lane lands on main.