Rename everIntersectsH3IndexSetTh3Index to everEqH3IndexSetTh3Index#21
Open
estebanzimanyi wants to merge 147 commits into
Open
Rename everIntersectsH3IndexSetTh3Index to everEqH3IndexSetTh3Index#21estebanzimanyi wants to merge 147 commits into
estebanzimanyi wants to merge 147 commits into
Conversation
…trators for UDT and UDF.
Period implementation
Satria/poc
… UDTs using Meos Datatypes.
Meos datatype
Timestampset implementation
…dex libmeos The uplifted branch vendors the th3index libmeos.so, which NEEDs libh3.so.1 and resolves geodetic SRID metadata from /usr/local/share/spatial_ref_sys.csv. Add libh3-1 and proj-data to the runtime deps and fetch the canonical spatial_ref_sys table (runtime data dependency, not a vendored blob), so the full suite runs green in CI. Mirrors the proven fork reactivation CI setup.
…ys network) npoint_make / nsegment_make call ensure_route_exists, which reads the ways network cache absent in the network-less CI/unit environment, so sample construction returns null there (it passed locally only because a ways table happened to be loaded). The npoint*/nsegment* comparison UDFs stay registered and parse-and-compare network-valid inputs correctly (parsing does not validate the route); only constructing a fresh sample needs the network. The eight network-free types exercise the identical code paths.
parity-audit.py classifies each MobilityDB CREATE FUNCTION and CREATE AGGREGATE name into one coverage tier (exact, overload/type-dispatch heuristic, RFC #920 bare-name, substring, missing), counts aggregates and comparison/ordering operators, and writes docs/parity-status.md with the per-tier headline and per-section table. docs/parity-100.md states the two parity axes and the known gaps (cbuffer_hash determinism, rgeo v_clip via MobilityDB #963, the windowed and set/span union aggregates, and per-type verification of the bare-name operators).
Register setUnion (set_union_transfn + set_union_finalfn, returning a set), spanUnion (span_union_transfn + spanset_union_finalfn, returning a span set) and merge (temporal_merge_transfn + temporal_tagg_finalfn, returning a temporal), with set / span-set hex-WKB output serializers and AggregateUDAFs test coverage.
A windowed temporal aggregate over a sliding window. Each row encodes "temporalHex|intervalText" since Spark UDAFs are single-input and the window interval is constant across the group; the WindowedFn base parses both, applies the type-specific *_w*_transfn, and finalizes via temporal_tagg_finalfn (tnumber_tavg_finalfn for wAvg). Registers wIntMax/wFloatMax, wIntMin/wFloatMin, wIntSum/wFloatSum and wAvg, with AggregateUDAFs test coverage.
…to integration/family-flags-base # Conflicts: # .github/workflows/maven.yml
…ing families The cbuffer, npoint, pose and rgeo sequence/sequence-set constructors call temporal_to_tsequence and temporal_to_tsequenceset with the LINEAR interpolation expressed as the integer interpType constant (3) from meos.h, matching the regenerated JMEOS 1.4 signature used across the temporal surface.
…ily flags Each optional extended temporal-type family (cbuffer, npoint, pose, rgeo, h3) is included or excluded at build time through a Maven flag whose name mirrors the MEOS/MobilityDB CMake options: -DCBUFFER=OFF, -DNPOINT=OFF, -DPOSE=OFF, -DRGEO=OFF and -DH3=OFF drop that family's package from compilation. RGEO depends on POSE so disabling POSE also drops rgeo, and disabling H3 also drops the BerlinMOD demo and examples that materialise the th3index trip column. MobilitySparkSession registers the families reflectively, so an excluded package's absent registrar class is skipped with zero residue while the remaining families stay intact. The CI workflow builds the fully excluded variant and asserts the dropped packages produce no classes.
…acc/all # Conflicts: # src/test/java/org/mobilitydb/spark/temporal/ConstructorUDFsExtTest.java
…/all # Conflicts: # .github/workflows/maven.yml # src/main/java/org/mobilitydb/spark/MobilitySparkSession.java # src/test/java/org/mobilitydb/spark/temporal/MathUDFsExtTest.java
… into acc/all # Conflicts: # src/main/java/org/mobilitydb/spark/cbuffer/CbufferUDFs.java
… swap the unified JMEOS jar
…mpiles on GeneratedFunctions
cbuffer_cmp is not a stable total order under standalone MEOS: the embedded geometry carries uninitialized padding (the same reason cbuffer_hash and cbuffer_hash_extended are unbound), so a pairwise gt/cmp scalar invariant on two values is non-deterministic across calls. MobilityDB exercises cbuffer ordering through 151_cbufferset_tbl as numValues(set(array_agg(DISTINCT cb ORDER BY cb))) — an order-independent distinct count where cbuffer_cmp only feeds the aggregate's ORDER BY. The Spark test mirrors that form: it dedups and orders the values (TreeSet) and asserts the count of distinct cbuffers, which is deterministic because equal cbuffers serialize to identical hex-WKB.
The MobilitySpark UDF layer resolves the MEOS native surface through the unified functions.GeneratedFunctions facade (the MEOS-API/meos-idl.json codegen output shared with MobilityFlink), bundled as libs/JMEOS-1.4.jar, and the bundled lib/libmeos.so the CI copies into place carries the matching surface: th3index, the mul_ temporal multiplication naming, the always-covers spatial relations, the JVM-safe noexit error handler, and reentrant GEOS. Owned char* returns are freed in the facade, so String-returning UDFs leave no native allocation behind. The per-worktree isolated Maven repository is dropped from tracking and ignored.
The native-leak tests sample VmRSS after System.gc(), which never returns glibc's freed malloc arenas to the OS, so a UDF that allocates and frees a large transient buffer (the merged trajectory geometry in trajectory()) leaves the freed chunks resident in the arena. That retention is glibc-version dependent, making the RSS proxy unstable across environments. forceGc() now calls malloc_trim(0) through a minimal libc binding before sampling, so VmRSS reflects genuinely-retained native memory; a real leak survives the trim and still trips the assert. glibc-only — the binding is null and skipped elsewhere.
minDistance(tgeompoint[], tgeompoint[]) is registered under the SQL name minDistance, backed by GeneratedFunctions.mindistance_tgeoarr_tgeoarr. The UDF marshals each array of hex trips into a native Temporal*[] and keeps the buffers strongly reachable across the native call with reachabilityFence, so the kernel never reads memory the JVM reclaimed mid-call. BerlinMOD Q5 uses the set-set form minDistance(array_agg(trips), array_agg(trips)) over the licence pairs -- identical SQL across MobilityDB, MobilityDuck and MobilitySpark, the N-by-N resolved inside the aggregate by the STBox prune -- so the Spark-specific q05_spark variant is removed. The bundled jar and libmeos carry the mindistance_tgeoarr_tgeoarr name and reentrant GEOS.
queries.sql is the one `-- @query`-delimited source that all three runners (PostgreSQL, DuckDB, MobilitySpark) split, so the benchmark SQL cannot drift between platforms; the Spark runner applies preprocessForSpark as a dialect transform on each section. The bench measures each query with the noop sink, which forces full materialisation of projection-only expressions such as the set-set minDistance(tgeompoint[], tgeompoint[]) in Q5. Query geometries parse with the configurable GEOM_SRID (-Dberlinmod.srid) so they match the SRID the trips carry, avoiding mixed-SRID spatial operators in Q11/Q12/Q15. The trip_h3 column materialises through the registered tgeompointToTh3Index UDF under the berlinmod.bench.th3index.enable flag.
Spark registers a name as exclusively scalar or aggregate, so a name backing both fails to load. The bare merge stays the scalar form (AccessorUDFs) and the column aggregate is mergeAgg, tracking the upstream Agg-suffix rename of the aggregate forms.
…e type The canonical overloaded aggregates tmin/tmax (renamed tminAgg/tmaxAgg by MobilityDB #828) and tsum span tint, tfloat and ttext signatures. Spark registers a UDAF name as exclusively one aggregate and cannot overload by signature, so each becomes a single UDAF that dispatches on the base type returned by temporal_basetype_name (#1139) — int4, float8 or text — to the matching per-type transfn. This replaces the invented per-type tIntMin / tFloatMin / tTextMin … surface with the canonical names; tsum keeps the bare name since no box accessor collides with it. The vendored libmeos.so and JMEOS jar export temporal_basetype_name.
VmRSS-based leak detection bounds native-heap growth, but glibc 2.39 on the ubuntu-noble runner retains ~15-21 MB of freed malloc arena across the 5 000-call probe even after malloc_trim(0), so a 10 MB bound flags freed-but-unreturned memory rather than a leak. The 50 MB bound clears that arena floor while still catching real Temporal* leaks (≥100 KB/call → ≥500 MB) with a 10x margin.
The UDF wraps ever_eq_h3indexset_th3index; the name carries the correct verb (eq not intersects), matching the C symbol convention. The portable queries.sql is updated in the same commit so the SQL and the registered name remain consistent.
Rebuilds JMEOS from the current canonical pin (PROJ per-thread context fix, uniform result out-param naming) so GeneratedFunctions matches the meos.h surface consumed by the Spark UDF callers.
The regenerated GeneratedFunctions from pin 67fcb0e63c changes two groups of public wrapper signatures: - tXXX_value_at_timestamptz(Pointer,OffsetDateTime,boolean): the result out-param is now managed internally; the wrapper returns Pointer (null when not found). Callers in MoreAccessorUDFs and BerlinMODUDFs drop the explicit allocation and boolean-found check, reading the value directly from the returned pointer. - XXXset_values(Pointer,Pointer count): a count out-param is added to match the MEOS C API; callers in SpanAccessorUDFs pass null (count is obtained separately via set_num_values). - h3index_in/h3index_out dropped from the public surface; replaced by h3index_parse (returns long) and h3index_to_string (returns String).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The UDF wraps
ever_eq_h3indexset_th3index; the nameeverEqH3IndexSetTh3Indexcarries the correct verb (eq, not intersects), matching the C symbol under the
mechanical camelCase rule for the DuckDB/Spark surface.
berlinmod/queries.sqlis updated in the same commit so the SQL and the registered name are consistent.
Stacks on #18 (green); the single new commit is the rename.