From 287fd13993278d846fccf566d6be968dfb8a7acf Mon Sep 17 00:00:00 2001 From: Ze'ev Klapow Date: Wed, 10 Jun 2026 09:47:05 -0400 Subject: [PATCH] Add ABI, async protocol, and hostgen DSL reference docs --- docs/src/SUMMARY.md | 4 ++ docs/src/reference/abi.md | 73 +++++++++++++++++++++ docs/src/reference/async-protocol.md | 26 ++++++++ docs/src/reference/extension-abi.md | 84 ++++++++++++++++++++++++ docs/src/reference/hostgen-dsl.md | 88 ++++++++++++++++++++++++++ examples/custom-python-build/README.md | 31 +-------- examples/rust-host/README.md | 8 +-- 7 files changed, 278 insertions(+), 36 deletions(-) create mode 100644 docs/src/reference/abi.md create mode 100644 docs/src/reference/async-protocol.md create mode 100644 docs/src/reference/extension-abi.md create mode 100644 docs/src/reference/hostgen-dsl.md diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md index 98871d7..79e47ee 100644 --- a/docs/src/SUMMARY.md +++ b/docs/src/SUMMARY.md @@ -20,4 +20,8 @@ # Reference +- [Base ABI Specification](reference/abi.md) +- [Extension ABI JSON & Lowering](reference/extension-abi.md) +- [Async Wire Protocol](reference/async-protocol.md) +- [hostgen DSL Reference](reference/hostgen-dsl.md) - [Glossary](reference/glossary.md) diff --git a/docs/src/reference/abi.md b/docs/src/reference/abi.md new file mode 100644 index 0000000..5614b1d --- /dev/null +++ b/docs/src/reference/abi.md @@ -0,0 +1,73 @@ +# Base ABI Specification + +This page specifies the contract between a host and `boomslang.wasm`: the functions the guest exports and the conventions for calling them. It is the contract `PythonInstance` implements on the Java side, and what a non-Java embedder must implement directly. (Host functions the guest *imports* are covered by the [extension ABI](extension-abi.md).) + +Source of truth: `python-host-core/src/export.rs` (guest) and `core/src/main/java/com/hubspot/boomslang/PythonInstance.java` (Java host). + +> There is currently no ABI version export; compatibility between a host and a wasm artifact is by construction (build them from the same commit). A version handshake is tracked in [issue #43](https://github.com/HubSpot/boomslang/issues/43). + +## Conventions + +- The guest exports a single linear memory. All pointers are `i32` offsets into it. +- **The host owns buffer lifecycles.** Allocate guest buffers with `alloc`, write through the exported memory, pass `(ptr, len)` pairs, and free with `dealloc` after the call. The guest never frees host-allocated buffers, and the guest's internal allocations are not the host's concern. +- All strings are UTF-8. Passing invalid UTF-8 where a string is expected returns `-1`. +- Every execution-family export (`compile_source`, `load_bytecode`, `execute`, `execute_function`, `install_module`, `uninstall_module`) **clears the captured stdout/stderr buffers on entry**. Read outputs after each call, before the next one. +- Error reporting is two-channel: a coarse return code, plus the Python traceback captured in the stderr buffer. Detailed error strings only exist in stderr. + +## Exports + +| Export | Signature | Semantics | +| --- | --- | --- | +| `alloc` | `(size: i32) -> i32` | Allocate `size` bytes in guest memory (mimalloc); returns pointer. | +| `dealloc` | `(ptr: i32, size: i32)` | Free an `alloc`'d buffer. `size` is currently ignored but pass the allocated size. | +| `compile_source` | `(source_ptr: i32, source_len: i32, output_ptr: i32, output_max_len: i32) -> i32` | Compile Python source to marshal bytecode, written to the caller-provided output buffer. Returns the bytecode length, `-1` on invalid UTF-8 or compile error (traceback in stderr), `-3` if the bytecode exceeds `output_max_len`. | +| `load_bytecode` | `(ptr: i32, len: i32) -> i32` | Unmarshal and execute bytecode from `compile_source`. `0` ok; `1` Python error (traceback in stderr). | +| `execute` | `(script_ptr: i32, script_len: i32) -> i32` | Execute Python source in `__main__`. `0` ok; `1` Python error; `-1` invalid UTF-8. | +| `execute_function` | `(name_ptr: i32, name_len: i32, args_ptr: i32, args_len: i32) -> i32` | Call a named function from previously loaded code with one string argument (`args_len` 0 → empty string). `0` / `1` / `-1` as above. | +| `get_stdout_len` / `get_stderr_len` | `() -> i32` | Byte length of the captured stream. | +| `get_stdout` / `get_stderr` | `(ptr: i32, max_len: i32) -> i32` | Copy up to `max_len` bytes of the captured stream into the caller's buffer; returns bytes written. | +| `install_module` | `(name_ptr: i32, name_len: i32, source_ptr: i32, source_len: i32) -> i32` | Install a pure-Python module under `name` (dotted names allowed). `0` / `1` / `-1`. | +| `uninstall_module` | `(name_ptr: i32, name_len: i32) -> i32` | Remove an installed module. `0` / `1` / `-1`. | +| `reset_state` | `()` | Clear capture buffers and reset the `__main__` namespace. Note: the Java host does not call this — it resets by restoring the copy-on-write memory snapshot, which is stricter. | +| `get_heap_pages` | `() -> i32` | Current guest memory size in 64 KiB pages. Used by hosts to size snapshots. | + +## Imports + +A complete embedder must provide, on the same linker/instance: + +1. **WASI preview1** — filesystem, clock, random, stdio. +2. **Extension imports** — the bundled runtime imports `boomslang.call` and `boomslang.log` ([extension ABI](extension-abi.md)); custom builds import whatever their extensions declare. + +Instantiation fails on any missing import. + +## Call sequences + +**Execute a script and read output** (what `PythonInstance.execute` does): + +```text +ptr = alloc(len(script)) # write script bytes at ptr +rc = execute(ptr, len(script)) # 0 ok, 1 python error, -1 bad utf-8 +dealloc(ptr, len(script)) +n = get_stdout_len() +buf = alloc(n); get_stdout(buf, n) # read n bytes from memory at buf +dealloc(buf, n) # same dance for stderr +``` + +**Compile once, run many** (`compile` / `loadCode`): + +```text +out = alloc(MAX) # Java uses MAX = 10 MiB +n = compile_source(src, len, out, MAX) # n = bytecode length, or -1 / -3 +bytecode = memory[out .. out+n]; dealloc(out, MAX) +... +ptr = alloc(len(bytecode)) # later, possibly many times +rc = load_bytecode(ptr, len(bytecode)) # 0 / 1 +``` + +The bytecode is CPython marshal data — valid only for the exact runtime build that produced it. + +## Known sharp edges + +- `-1` is overloaded: it means both "invalid UTF-8 input" and "Python-level failure" for `compile_source`. Disambiguate via stderr. +- There is no structured error channel; hosts surface failures by pairing the return code with the captured stderr. +- Output larger than the host's configured cap (Java default 10 MB) is rejected host-side, not guest-side. diff --git a/docs/src/reference/async-protocol.md b/docs/src/reference/async-protocol.md new file mode 100644 index 0000000..0df6cf5 --- /dev/null +++ b/docs/src/reference/async-protocol.md @@ -0,0 +1,26 @@ +# Async Wire Protocol (v1) + +`boomslang_host.asyncio` (the Python client) and the host-side `AsyncHostRegistry` talk over a small, versioned protocol invoked through the stock `boomslang_host.call(name, args)` bridge. This page is the wire-level specification; usage is in the [async guide](../guide/async.md). + +The `__async_*` names are a **reserved control namespace** — extension host functions may not use them (hostgen validation rejects them). + +| Control call | Args | Returns | +|---|---|---| +| `__async_protocol__` | — | integer protocol version (currently `1`) | +| `__async_start__` | `name\npayload` | decimal token for a registered named async handler | +| `__async_poll__` | timeout ms (`<0` blocks, `0` polls) | one header line per ready completion: `token\t{1\|0}\t` | +| `__async_result__` | token | base64 of that completion's value bytes (consumes it) | +| `__async_cancel__` | token | cancels the in-flight future | + +Typed async extension functions bypass `__async_start__`: their WASM import returns the `i64` token directly from the shared registry. Polling, result retrieval, and cancellation still flow through the control calls above. + +## Design rationale + +- **Versioned.** The Python client is frozen into each consumer's WASM Wizer snapshot, so the host must stay compatible with already-shipped clients. `__async_protocol__` lets a client refuse a host older than the protocol it was built against; bump `AsyncHostRegistry.PROTOCOL_VERSION` only for breaking wire changes. +- **Poll and result are decoupled.** `__async_poll__` returns only headers (token, ok flag, length); values are fetched one at a time via `__async_result__`. A batch of completions therefore never exceeds the single host-call result buffer. (A single value larger than that buffer is still a limitation — chunked retrieval is a future protocol addition.) +- **Failures never hang.** Synchronous handler errors are recorded via `AsyncHostRegistry.startFailed` and surface as a failed completion (the coroutine raises `HostAsyncError`); the client also rejects any non-positive token immediately. +- **Binary-safe value channel.** Completion values are carried as base64 of raw bytes, so extending async returns to `bytes` later needs no wire change. + +## Implementations + +The protocol is implemented by the Java `AsyncHostRegistry` (`core/`), the generated Rust host registry (hostgen's `rust_host.rs` template), and the Python client (`boomslang_host/asyncio.py`). They must agree byte-for-byte; consolidation of the duplicated implementations is tracked in [issue #45](https://github.com/HubSpot/boomslang/issues/45). diff --git a/docs/src/reference/extension-abi.md b/docs/src/reference/extension-abi.md new file mode 100644 index 0000000..81930b9 --- /dev/null +++ b/docs/src/reference/extension-abi.md @@ -0,0 +1,84 @@ +# Extension ABI JSON & Lowering + +An extension declares its host functions once, in `build.rs`, with the [hostgen DSL](hostgen-dsl.md). The build emits an **ABI JSON** file — the language-neutral contract from which host adapters (Java, Rust, or hand-written for any runtime) are generated. + +## Schema + +```json +{ + "abi_version": 1, + "extension": { + "name": "boomslang_host", + "wasm_module": "boomslang", + "prewarm": ["_boomslang_host", "boomslang_host", "boomslang_host.asyncio"] + }, + "functions": [ + { + "name": "call", + "params": [ + { "name": "name", "type": "string" }, + { "name": "args", "type": "string" } + ], + "returns": "string", + "async": false + }, + { + "name": "log", + "params": [ + { "name": "level", "type": "int" }, + { "name": "message", "type": "string" } + ], + "returns": null, + "async": false + } + ] +} +``` + +| Field | Meaning | +| --- | --- | +| `abi_version` | Schema version. Generators require an **exact match** (currently `1`) and fail with a clear error otherwise. If omitted, defaults to `1`. | +| `extension.name` | Extension identifier. Drives generated names: Python module ``, guest file `ext_.rs`, Java class `HostFunctions`, Rust host file `host_.rs`. | +| `extension.wasm_module` | The WASM **import module** the functions live under (e.g. import `boomslang.call`). Defaults to the extension name when omitted. | +| `extension.prewarm` | Python modules imported during Wizer initialization, frozen into the golden snapshot. | +| `functions[].name` | Function name; becomes the import name and the Python-visible function. | +| `functions[].params` | Ordered typed parameters. | +| `functions[].returns` | Return type or `null` for none. Async functions must return `string`. | +| `functions[].async` | Whether the function is an async host call (see below). | + +Types are a closed enum: `string`, `int`, `float`, `bytes`. Unknown type values fail parsing. + +## Lowering to WASM signatures + +The ABI JSON decides the import signatures and memory protocol. For a function with declared params and return: + +| Declared | Lowered | +| --- | --- | +| `string` / `bytes` param | `i32 ptr, i32 len` (UTF-8 bytes for strings) | +| `int` param | `i32` | +| `float` param | `f64` | +| `string` / `bytes` return | caller appends `i32 result_ptr, i32 result_max_len`; host writes the value into that buffer and returns the written byte length as `i32` | +| no return | `i32` status return | +| async function | returns an `i64` host token instead of a value (see the [async wire protocol](async-protocol.md)) | + +So declared `call(name: string, args: string) -> string` becomes the import: + +```text +boomslang.call(name_ptr: i32, name_len: i32, + args_ptr: i32, args_len: i32, + result_ptr: i32, result_max_len: i32) -> i32 +``` + +**Result buffer protocol:** the guest allocates the result buffer (currently capped at 1 MiB per call) and passes it to the host. A negative return signals failure: `-1` for a handler error, `-2` when the value did not fit in `result_max_len`. The guest surfaces any negative return as a Python exception. + +Behavioral note: on malformed pointers the generated Java host traps the instance, while the generated Rust host returns `-1`; aligning these is tracked in [issue #44](https://github.com/HubSpot/boomslang/issues/44). + +## Generated artifacts + +From one declaration, hostgen produces: + +- **Rust guest** (`ext_.rs`, included via `include!` into your extension crate): the `extern` imports, a Python module exposing typed functions, and `register()` / `prewarm()` hooks for `boomslang_host_core::init`. +- **Java host adapter** (`HostFunctions.java`): typed functional interfaces + a builder producing a `BoomslangExtension` for `PythonExecutorFactory.addExtension`. +- **Rust host adapter** (`host_.rs`): a typed builder with `register(&mut wasmtime::Linker<_>)`. + +Function names prefixed `__async_` are reserved for the async control namespace and rejected by validation. diff --git a/docs/src/reference/hostgen-dsl.md b/docs/src/reference/hostgen-dsl.md new file mode 100644 index 0000000..e77defd --- /dev/null +++ b/docs/src/reference/hostgen-dsl.md @@ -0,0 +1,88 @@ +# hostgen DSL Reference + +`boomslang-hostgen` is both a Rust library (used from an extension crate's `build.rs`) and a CLI. The library declares an extension and emits generated code + [ABI JSON](extension-abi.md); the CLI consumes ABI JSON and generates host adapters. + +## Declaring an extension (`build.rs`) + +```rust +use boomslang_hostgen::{Build, ExtensionSpec, Type}; + +fn main() { + let ext = ExtensionSpec::new("myext") + .wasm_module("myext") + .prewarm(["_myext"]) + .function("do_thing", |f| { + f.param("input", Type::String).returns(Type::String) + }) + .function("lookup", |f| { + f.r#async() + .param("request", Type::String) + .param("shard", Type::Int) + .returns(Type::String) + }); + + Build::new(ext).emit().generate().expect("generate myext"); + + println!("cargo:rerun-if-changed=build.rs"); +} +``` + +### `ExtensionSpec` + +| Method | Effect | +| --- | --- | +| `ExtensionSpec::new(name)` | Start a spec; `name` is the extension/Python module name. | +| `.wasm_module(module)` | WASM import module for the functions (defaults to the extension name). | +| `.prewarm([modules])` | Python modules to import during Wizer init (frozen into the snapshot). | +| `.function(name, \|f\| ...)` | Declare a host function via the closure. | + +### `FunctionSpec` (inside the closure) + +| Method | Effect | +| --- | --- | +| `.param(name, Type)` | Append a typed parameter (order matters). | +| `.returns(Type)` | Declare the return type (omit for none). | +| `.r#async()` | Mark as an async host call — Python awaits it; the host handler is asynchronous. Async functions must return `Type::String`. | + +`Type` is `String`, `Int`, `Float`, or `Bytes`. See [lowering rules](extension-abi.md#lowering-to-wasm-signatures) for the WASM signatures these produce. + +### `Build` + +| Method | Output | +| --- | --- | +| `Build::new(spec)` | Start from a spec. | +| `.emit()` | Shorthand for `.emit_rust_guest().emit_abi_json()` — the standard build.rs setup. | +| `.emit_rust_guest()` | `$OUT_DIR/ext_.rs` — guest code, consumed by `include!(concat!(env!("OUT_DIR"), "/ext_.rs"))`. | +| `.emit_abi_json()` | `$OUT_DIR/.abi.json`. | +| `.emit_abi_json_to(path)` | ABI JSON at a stable path of your choosing (recommended when other builds consume it — `$OUT_DIR` paths contain build fingerprints). | +| `.emit_java_host(out_dir, package)` | `HostFunctions.java` under `out_dir//`. Prefer running the CLI after the build instead of writing into a source tree from `build.rs`. | +| `.emit_rust_host(out_dir)` | `host_.rs` Wasmtime adapter. | +| `.generate()` | Validate the manifest and write everything requested. | + +Validation enforces: exact `abi_version` match, identifier-safe names (no Java/Rust keywords), no reserved `__async_*` function names, and string returns for async functions. + +## The CLI + +```text +boomslang-hostgen [--java-out DIR [--java-package PKG]] [--rust-host-out DIR] +``` + +| Flag | Effect | +| --- | --- | +| `--java-out DIR` | Generate the Java host adapter into `DIR` (package subdirectories created). | +| `--java-package PKG` | Java package for generated code (default `com.hubspot.boomslang.extensions`). | +| `--rust-host-out DIR` | Generate the Rust Wasmtime host adapter into `DIR`. | + +With no output flag the CLI validates the ABI JSON, then exits nonzero with `no output requested`. + +From source: `cargo run --manifest-path boomslang-hostgen/Cargo.toml -- `. + +## Library entry points + +For build tooling that wants codegen without the CLI: + +- `read_abi(path) -> Manifest` — parse + validate an ABI JSON file. +- `generate_java(abi_path, out_dir, package)` — Java adapter from a file. +- `generate_rust_host(abi_path, out_dir)` — Rust host adapter from a file. + +The serde-serializable `Manifest` / `Extension` / `Function` / `Param` / `Type` structs are public; the [ABI JSON schema](extension-abi.md#schema) is their serialized form. diff --git a/examples/custom-python-build/README.md b/examples/custom-python-build/README.md index 8d82efe..be9af61 100644 --- a/examples/custom-python-build/README.md +++ b/examples/custom-python-build/README.md @@ -138,32 +138,5 @@ fn main() { Generated async functions preserve the normal typed argument handling: the Java handler receives typed params and returns a `CompletionStage`, registered alongside a shared `AsyncHostRegistry`. The full Java and Python usage is documented in the [async host calls guide](https://github.hubspot.com/boomslang/guide/async.html). -### Async wire protocol (v1) - -`boomslang_host.asyncio` and `AsyncHostRegistry` talk over a small, versioned protocol invoked -through the stock `boomslang_host.call(name, args)` function. The `__async_*` names are a -**reserved control namespace** — do not define extension host functions with these names: - -| Control call | Args | Returns | -|---|---|---| -| `__async_protocol__` | — | integer protocol version (currently `1`) | -| `__async_start__` | `name\npayload` | decimal token for a registered named async handler | -| `__async_poll__` | timeout ms (`<0` blocks, `0` polls) | one header line per ready completion: `token\t{1\|0}\t` | -| `__async_result__` | token | base64 of that completion's value bytes (consumes it) | -| `__async_cancel__` | token | cancels the in-flight future | - -Why this shape matters: - -- **Versioned.** The Python client is frozen into each consumer's WASM Wizer snapshot, so the Java - host must stay compatible with already-shipped clients. `__async_protocol__` lets a client refuse - a host older than the protocol it was built against; bump `AsyncHostRegistry.PROTOCOL_VERSION` - only for breaking wire changes. -- **Poll and result are decoupled.** `__async_poll__` returns only headers (token, ok, length); - values are fetched one at a time via `__async_result__`. A batch of completions therefore never - exceeds the single host-call result buffer. (A single value larger than that buffer is still a - limitation — chunked retrieval is a future protocol addition.) -- **Failures never hang.** Synchronous handler errors are recorded via `AsyncHostRegistry.startFailed` - and surface as a failed completion (the coroutine raises `HostAsyncError`); the client also rejects - any non-positive token immediately. -- **Binary-safe value channel.** Completion values are carried as base64 of raw bytes, so extending - async returns to `bytes` later needs no wire change. +The `__async_*` control calls and their framing are specified in the +[async wire protocol reference](https://github.hubspot.com/boomslang/reference/async-protocol.html). diff --git a/examples/rust-host/README.md b/examples/rust-host/README.md index df41643..dc1ba13 100644 --- a/examples/rust-host/README.md +++ b/examples/rust-host/README.md @@ -25,13 +25,7 @@ cargo run --manifest-path boomslang-hostgen/Cargo.toml -- \ ## What The ABI Drives -The ABI JSON decides the Wasmtime import signatures and memory lowering: - -- `string` and `bytes` params lower to `i32 ptr, i32 len`. -- `int` params lower to `i32`. -- `float` params lower to `f64`. -- `string` and `bytes` returns use caller-provided `i32 result_ptr, i32 result_max_len` params and return the written byte length as `i32`. -- async functions return an `i64` host token. +The ABI JSON decides the Wasmtime import signatures and memory lowering — pointer/length pairs for strings and bytes, caller-provided result buffers for returns, `i64` tokens for async. The full rules are specified in the [extension ABI reference](https://github.hubspot.com/boomslang/reference/extension-abi.html). The generated host binding is typed: