Skip to content

Generate the Spark Connect registrar from the canonical named surface (stacks on #25)#26

Open
estebanzimanyi wants to merge 14 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/named-surface-codegen
Open

Generate the Spark Connect registrar from the canonical named surface (stacks on #25)#26
estebanzimanyi wants to merge 14 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/named-surface-codegen

Conversation

@estebanzimanyi
Copy link
Copy Markdown
Member

Extract the canonical named-operation surface from the MobilityDB SQL CREATE FUNCTION catalog and the doxygen @sqlfn/@csqlfn chain, derive the canonical-name to Spark-impl mapping from the MobilitySpark UDF bodies, and emit MobilitySparkConnectExtensionsGen.scala: a SparkSessionExtensions that injects each canonical function under its identity name with no camelCase remap and no hand-written table. Single-impl functions bind to the one impl; multi-impl names whose first argument differs in MEOS type dispatch per row on the arg0 WKB type tag (meos_typeof_hexwkb) to the type-matching impl, with the Temporal-receiver impl as the catch-all default. The facade targets Java 17 so it runs on the Spark 3.5 runtime. The registrar serves the OGC reads (asMFJSON, stbox, the Xmin/Ymin/Xmax/Ymax/Tmin/Tmax accessors, numSequences, sequenceN, trajectory) under identity names over Spark Connect.

… surface

Bump codegen/input/meos-idl.json to the MEOS-API IDL and regenerate
functions.GeneratedFunctions over the full consolidated superset: mul_* (incl.
tbigint); minDistance; the circular-buffer and network-point MF-JSON readers; the
ever- and always-covers families (ecovers_*/acovers_*); trgeo_*; the H3 /
th3index family (ever_eq_h3indexset_th3index, h3index_in/out, H3Index lowered to
long); PostgreSQL type I/O; tgeogpoint_great_circle_distance;
meos_initialize_noexit_error_handler. 2916 functions.
…ld flags

The functions.GeneratedFunctions facade is generated at build time from the MEOS
IDL with the optional type families selected by the same flag names and ON|OFF
(also 1|0) values as the MobilityDB/MEOS build: -DCBUFFER, -DNPOINT, -DPOSE,
-DRGEO, -DH3. Every family is included by default; passing -DCBUFFER=OFF (or =0)
drops that family's functions from the generated binding so a subset jar ships
without it (RGEO needs POSE). FunctionsGenerator maps each function's source
header to its family and omits excluded families; jmeos-core runs the generator
at generate-sources (so the flag flows through mvn) and compiles the generated
functions.GeneratedFunctions.
…try C API

MobilityDB #1137 renamed the public rigid-geometry C API from trgeo to
trgeometry. The MEOS IDL the facade is generated from adopts the new names
(verified 1:1 against the master meos_rgeo.h: 67 trgeo->trgeometry; the
trgeoinst_make instant constructor is unchanged, matching master), so the
generated functions.GeneratedFunctions and the bundled jar resolve against a
post-#1137 libmeos.
Bumps codegen/input/meos-idl.json to the public+bound MEOS surface of the
ecosystem pin: the set-set spatial-join family (edwithin/tdwithin/adisjoint
_tgeoarr_tgeoarr), the mindistance_tgeoarr_tgeoarr rename, the trgeometry
analytics (frechet/hausdorff/dyntimewarp/centroid/length/speed), tpose and
tnpoint value accessors, tcbuffer traversed-area, and the aggregate combine
functions. 3031 bound functions (was 2916).
Hoists the tier-aware MeosOps* facade (62 classes) into JMEOS so every
JVM binding inherits the one canonical Java idiom from the shared jar
instead of duplicating it per engine. The facade forwards to
functions.GeneratedFunctions under a package-private MeosOpsRuntime probe
gated by the canonical -Dmeos.enabled property; javadoc is engine-neutral.
Relocates the maintained generator (regen_facade_from_jar + the gap / sql /
tbigint / h3 emitters + parity_audit + meos-ref) under jmeos-core/tools so
the facade stays regenerated, not hand-edited; regeneration is idempotent
against the pin jar.
MeosSetSetJoin exposes the MEOS *_tgeoarr_tgeoarr family as eDwithinPairs /
tDwithinPairs / aDisjointPairs over two arrays of temporal-geometry handles:
it marshals the native pointer arrays the kernel prunes in C, keeps them
reachable across the call with reachabilityFence, and reads back the
flattened 0-based index pairs (and, for tDwithin, the per-pair tstzspanset of
in-range times). Both JVM engines call it from the shared org.mobilitydb.meos
layer, so the NxN spatial-join surface derives once. Verified against
libmeos.
The IDL and bundled libmeos carry the 54a9d4bc54 public surface: the per-thread
PROJ context, the box3d_in/gbox_in parsers, and tpose_to_tpoint. The parity-gap
forwarders bind the value-at-timestamptz wrappers through their result-returning
form and drop the pointcloud initializer absent from the surface.
Compile jmeos-core for Java 17 and rewrite the type-pattern switches in STBox and
the time types as instanceof-pattern if/else chains (instanceof patterns are Java
17). The facade bytecode then loads on the Spark Connect server's Java 17 runtime
(Spark 3.5's supported JRE), and still runs on later runtimes.
Add extract_named_surface.py, which produces meos-named-surface.json from the two
canonical sources already in the MobilityDB tree: the SQL CREATE FUNCTION catalog
(named functions, overloads, per-argument DEFAULTs -> valid call arities) and the
doxygen chain (@sqlfn on the PG wrapper, @csqlfn on the MEOS function) linking
each SQL name to its PG and MEOS C functions. This is the layer above the C-FFI
IDL from which a binding's named surface and its Spark Connect registrar are
generated, rather than hand-maintained. 1284 named functions, asMFJSON resolving
to temporal_as_mfjson with minArity 1 / maxArity 4.
…tter

extract_spark_impls.py scans the MobilitySpark UDFs (register name + field +
body GeneratedFunctions call) and joins on the named surface's SQL->MEOS C
linkage to recover canonical name -> Spark impl mechanically, so the emitter
needs no hand-written remap. The join classifies each function for emission:
single-impl (identity name over one impl), multi-impl (identity name with a
WKB-type-tag dispatch builder), and join gaps to close.
…face

generate_spark_registrar.py joins the canonical named surface with the Spark impl
scan and emits MobilitySparkConnectExtensionsGen.scala: a SparkSessionExtensions
that injects each canonical function under its identity name (asMFJSON, not
temporalAsMfjson), no hand-written remap. Shipped ScalaUDF closures live in a
companion object so they capture only the serializable UDF; the builder null-pads
the impl's optional args to the call-site arity. The 81 single-impl functions are
generated, compiled, and serve live over Spark Connect under their identity names;
the 139 multi-impl names are listed for the per-row meos_typeof_hexwkb dispatch.
…trar

A multi-impl canonical name (one SQL name over several type-specific Spark impls)
whose first argument differs in MEOS type is emitted as a single ScalaUDF that
peeks meostype_name(meos_typeof_hexwkb(arg0)) per row and routes to the impl whose
receiver type matches, with the Temporal-receiver impl as the catch-all default.
The receiver category is read from the impl's primary MEOS function (the last
non-marshaling GeneratedFunctions call in the UDF body) first C-parameter type.

The registrar serves the /items-collection OGC function set under identity names:
asMFJSON, stbox, the Xmin/Ymin/Xmax/Ymax/Tmin/Tmax accessors, numSequences,
sequenceN, trajectory. Functions that differ only on a later argument (atTime on
its time argument) are listed for the arg-N dispatch extension.
…th the SQL default

Several MobilitySpark UDFs register one MEOS operation under both a bare name and a
camelCase name (asText/tpointAsText, getTime/time, cumulativeLength/...). When a
canonical name's impls all share one primary MEOS function, bind the identity name
to a single impl rather than treating it as a type dispatch.

Capture each optional argument's SQL DEFAULT literal in the named surface and fill
an omitted optional argument with it, but only when the impl exposes a full
overload's worth of arguments (impl arity equals that overload's maxArity, so the
positions align); otherwise null-pad and let the impl's own default hold. This
serves asText/asEWKT (maxdecimaldigits default 15) while keeping asMFJSON at full
coordinate precision (its impl exposes a non-leading argument subset).
Generalize the multi-impl dispatch from arg0 to the first argument position at
which the type-specific impls differ, peeking that argument's MEOS type tag: atTime
routes on its time argument (tstzspan/tstzset/tstzspanset), duration routes on its
span argument. The concrete type for a generic Span/Set receiver is taken from the
MEOS type-name embedded in the impl's primary function name.

Each dispatch route carries its impl's arity and SQL-default fills, so an omitted
optional argument of the chosen impl is filled with the canonical default rather
than a null pad (duration on a tstzspanset supplies boundspan=FALSE); boolean SQL
defaults are emitted alongside integer ones.

The named surface is regenerated from the pin, so speed(tgeompoint) resolves
through the dedicated Tpoint_speed wrapper to tpoint_speed and binds to the speed
impl.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant